36ea55e97e
## What changes were proposed in this pull request? Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n, shuffle) that require them to write and deploy Scala/Java/Python code. We propose adding the following Hive-style Coalesce and Repartition Hint to Spark SQL: ``` ... SELECT /*+ COALESCE(numPartitions) */ ... ... SELECT /*+ REPARTITION(numPartitions) */ ... ``` Multiple such hints are allowed. Multiple nodes are inserted into the logical plan, and the optimizer will pick the leftmost hint. ``` INSERT INTO s SELECT /*+ REPARTITION(100), COALESCE(500), COALESCE(10) */ * FROM t == Logical Plan == 'InsertIntoTable 'UnresolvedRelation `s`, false, false +- 'UnresolvedHint REPARTITION, [100] +- 'UnresolvedHint COALESCE, [500] +- 'UnresolvedHint COALESCE, [10] +- 'Project [*] +- 'UnresolvedRelation `t` == Optimized Logical Plan == InsertIntoHadoopFsRelationCommand ... +- Repartition 100, true +- HiveTableRelation ... ``` ## How was this patch tested? All unit tests. Manual tests using explain. Author: John Zhuge <jzhuge@apache.org> Closes #21911 from jzhuge/SPARK-24940. |
||
---|---|---|
.. | ||
java/org/apache/spark/sql | ||
resources | ||
scala/org/apache/spark/sql |