From abfc267f0cc38d792f68923946a83877d07dee27 Mon Sep 17 00:00:00 2001 From: chenliang Date: Wed, 18 Dec 2019 15:12:32 -0800 Subject: [PATCH] [SPARK-30262][SQL] Avoid NumberFormatException when totalSize is empty MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What changes were proposed in this pull request? We could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like `desc formatted partition` ,the NumberFormatException is showed as below: ``` spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23'); 19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')] java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:411) at java.math.BigInteger.(BigInteger.java:597) at scala.math.BigInt$.apply(BigInt.scala:77) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056) ``` Although we can use 'Analyze table partition ' to update the totalSize,rawDataSize or rowCount, it's unresonable for normal SQL to throw NumberFormatException for Empty totalSize.We should fix the empty case when readHiveStats. ### Why are the changes needed? This is a related to the robustness of the code and may lead to unexpected exception in some unpredictable situation.Here is the case: image ### Does this PR introduce any user-facing change? No ### How was this patch tested? manual Closes #26892 from southernriver/SPARK-30262. Authored-by: chenliang Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/hive/client/HiveClientImpl.scala | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index 700c0884dd..f196e94a83 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -1189,9 +1189,10 @@ private[hive] object HiveClientImpl { * Note that this statistics could be overridden by Spark's statistics if that's available. */ private def readHiveStats(properties: Map[String, String]): Option[CatalogStatistics] = { - val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) - val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) - val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) + val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).filter(_.nonEmpty).map(BigInt(_)) + val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).filter(_.nonEmpty) + .map(BigInt(_)) + val rowCount = properties.get(StatsSetupConst.ROW_COUNT).filter(_.nonEmpty).map(BigInt(_)) // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be // relatively cheap if parameters for the table are populated into the metastore. // Currently, only totalSize, rawDataSize, and rowCount are used to build the field `stats`