ODIn/spark-instrumented-optimizer

26908 commits 6 branches 181 tags 397 MiB

Author	SHA1	Message	Date
beliefer	50e535c431	[SPARK-31295][DOC][FOLLOWUP] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. docs/sql-performance-tuning.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.inMemoryColumnarStorage.compressed \| 1.0.1 \| SPARK-2631 \| 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.inMemoryColumnarStorage.batchSize \| 1.1.1 \| SPARK-2650 \| 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.files.maxPartitionBytes \| 2.0.0 \| SPARK-13664 \| 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.files.openCostInBytes \| 2.0.0 \| SPARK-14259 \| 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.broadcastTimeout \| 1.3.0 \| SPARK-4269 \| fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.autoBroadcastJoinThreshold \| 1.1.0 \| SPARK-2393 \| c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.shuffle.partitions \| 1.1.0 \| SPARK-1508 \| 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.adaptive.coalescePartitions.enabled \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.coalescePartitions.minPartitionNum \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.coalescePartitions.initialPartitionNum \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.advisoryPartitionSizeInBytes \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.skewJoin.enabled \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.skewJoin.skewedPartitionFactor \| 3.0.0 \| SPARK-31037 \| 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes \| 3.0.0 \| SPARK-31201 \| 8d0800a0803d3c47938bddefa15328d654739bc5#diff-9a6b543db706f1a90f790783d6930a13 \| docs/sql-ref-ansi-compliance.md Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.ansi.enabled \| 3.0.0 \| SPARK-30125 \| d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.storeAssignmentPolicy \| 3.0.0 \| SPARK-28730 \| 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28096 from beliefer/supplement-version-of-performance. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-02 16:01:54 +09:00
Wenchen Fan	05498af72e	[SPARK-31201][SQL] Add an individual config for skewed partition threshold ### What changes were proposed in this pull request? Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so. Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold. This PR adds a new config for the threshold and set default value as 256 MB. ### Why are the changes needed? Avoid skew join handling that may introduce a perf regression. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27967 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 22:57:01 +09:00
jiake	21c02ee5d0	[SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution ### What changes were proposed in this pull request? This PR will add the user guide for AQE and the detailed configurations about the three mainly features in AQE. ### Why are the changes needed? Add the detailed configurations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? only add doc no need ut. Closes #27616 from JkSelf/aqeuserguide. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-16 23:33:56 +08:00
turbofei	8b1839728a	[SPARK-29542][FOLLOW-UP] Keep the description of spark.sql.files.* in tuning guide be consistent with that in SQLConf ### What changes were proposed in this pull request? This pr is a follow up of https://github.com/apache/spark/pull/26200. In this PR, I modify the description of spark.sql.files.* in sql-performance-tuning.md to keep consistent with that in SQLConf. ### Why are the changes needed? To keep consistent with the description in SQLConf. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed UT. Closes #27545 from turboFei/SPARK-29542-follow-up. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 20:21:52 +09:00
Yuanjian Li	2acae975aa	[SPARK-30278][SQL][DOC] Update Spark SQL document menu for new changes ### What changes were proposed in this pull request? Update the Spark SQL document menu and join strategy hints. ### Why are the changes needed? - Several new changes in the Spark SQL document didn't change the menu-sql.yaml correspondingly. - Update the demo code for join strategy hints. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Document change only. Closes #26917 from xuanyuanking/SPARK-30278. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 13:22:26 +08:00
maryannxue	43da473c1c	[SPARK-27225][SQL] Implement join strategy hints ## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /+ merge(t1) / /+ broadcast(t1) / k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes #24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 00:14:37 +08:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Kazuaki Ishizaki	c391dc65ef	[SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc ## What changes were proposed in this pull request? This PR replaces `turing` with `tuning` in files and a file name. Currently, in the left side menu, `Turing` is shown. [This page](https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/_site/sql-performance-turing.html) is one of examples. ![image](https://user-images.githubusercontent.com/1315079/47332714-20a96180-d6bb-11e8-9a5a-0a8dad292626.png) ## How was this patch tested? `grep -rin turing docs` && `find docs -name "turing"` Closes #22800 from kiszk/SPARK-24499-follow. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-23 12:19:31 +08:00

Renamed from docs/sql-performance-turing.md (Browse further)

8 commits