ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	cdd8e51742	[SPARK-33419][SQL] Unexpected behavior when using SET commands before a query in SparkSession.sql ### What changes were proposed in this pull request? SparkSession.sql converts a string value to a DataFrame, and the string value should be one single SQL statement ending up w/ or w/o one or more semicolons. e.g. ```sql scala> spark.sql(" select 2").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;;;;").show +---+ \| 2\| +---+ \| 2\| +---+ ``` If we put 2 or more statements in, it fails in the parser as expected, e.g. ```sql scala> spark.sql(" select 2; select 1;").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input 'select' expecting {<EOF>, ';'}(line 1, pos 11) == SQL == select 2; select 1; -----------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` As a very generic user scenario, users may want to change some settings before they execute the queries. They may pass a string value like `set spark.sql.abc=2; select 1;` into this API, which creates a confusing gap between the actual effect and the user's expectations. The user may want the query to be executed with spark.sql.abc=2, but Spark actually treats the whole part of `2; select 1;` as the value of the property 'spark.sql.abc', e.g. ``` scala> spark.sql("set spark.sql.abc=2; select 1;").show +-------------+------------+ \| key\| value\| +-------------+------------+ \|spark.sql.abc\|2; select 1;\| +-------------+------------+ ``` What's more, the SET symbol could digest everything behind it, which makes it unstable from version to version, e.g. #### 3.1 ```sql scala> spark.sql("set;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided scala> spark.sql("set a;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set a; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` #### 2.4 ```sql scala> spark.sql("set;").show +---+-----------+ \|key\| value\| +---+-----------+ \| ;\|<undefined>\| +---+-----------+ scala> spark.sql("set a;").show +---+-----------+ \|key\| value\| +---+-----------+ \| a;\|<undefined>\| +---+-----------+ ``` In this PR, 1. make `set spark.sql.abc=2; select 1;` in `SparkSession.sql` fail directly, user should call `.sql` for each statement separately. 2. make the semicolon as the separator of statements, and if users want to use it as part of the property value, shall use quotes too. ### Why are the changes needed? 1. disambiguation for `SparkSession.sql` 2. make semicolon work same both w/ `SET` and other statements ### Does this PR introduce _any_ user-facing change? yes, the semicolon works as a separator of statements now, it will be trimmed if it is at the end of the statement and fail the statement if it is in the middle. you need to use quotes if you want it to be part of the property value ### How was this patch tested? new tests Closes #30332 from yaooqinn/SPARK-33419. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 06:58:16 +00:00
ulysses	82a21d2a3e	[SPARK-33433][SQL] Change Aggregate max rows to 1 if grouping is empty ### What changes were proposed in this pull request? Change `Aggregate` max rows to 1 if grouping is empty. ### Why are the changes needed? If `Aggregate` grouping is empty, the result is always one row. Then we don't need push down limit in `LimitPushDown` with such case ``` select count() from t1 union select count() from t2 limit 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30356 from ulysses-you/SPARK-33433. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-13 15:57:07 +09:00
Dongjoon Hyun	a70a2b02ce	[SPARK-33439][INFRA] Use SERIAL_SBT_TESTS=1 for SQL modules ### What changes were proposed in this pull request? This PR aims to decrease the parallelism of `SQL` module like `Hive` module. ### Why are the changes needed? GitHub Action `sql - slow tests` become flaky. - https://github.com/apache/spark/runs/1393670291 - https://github.com/apache/spark/runs/1393088031 ### Does this PR introduce _any_ user-facing change? No. This is dev-only feature. Although this will increase the running time, but it's better than flakiness. ### How was this patch tested? Pass the GitHub Action stably. Closes #30365 from dongjoon-hyun/SPARK-33439. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 21:19:51 -08:00
Max Gekk	539c2deb89	[SPARK-33426][SQL][TESTS] Unify Hive SHOW TABLES tests ### What changes were proposed in this pull request? 1. Create the separate test suite `org.apache.spark.sql.hive.execution.command.ShowTablesSuite`. 2. Re-use V1 SHOW TABLES tests added by https://github.com/apache/spark/pull/30287 in the Hive test suites. 3. Add new test case for the pattern `'table_name_1\|table_name_2'` in the common test suite. ### Why are the changes needed? To test V1 + common SHOW TABLES tests in Hive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running v1/v2 and Hive v1 `ShowTablesSuite`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #30340 from MaxGekk/show-tables-hive-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 05:15:13 +00:00
Liang-Chi Hsieh	2c64b731ae	[SPARK-33259][SS] Disable streaming query with possible correctness issue by default ### What changes were proposed in this pull request? This patch proposes to disable the streaming query with possible correctness issue in chained stateful operators. The behavior can be controlled by a SQL config, so if users understand the risk and still want to run the query, they can disable the check. ### Why are the changes needed? The possible correctness in chained stateful operators in streaming query is not straightforward for users. From users perspective, it will be considered as a Spark bug. It is also possible the worse case, users are not aware of the correctness issue and use wrong results. A better approach should be to disable such queries and let users choose to run the query if they understand there is such risk, instead of implicitly running the query and let users to find out correctness issue by themselves and report this known to Spark community. ### Does this PR introduce _any_ user-facing change? Yes. Streaming query with possible correctness issue will be blocked to run, except for users explicitly disable the SQL config. ### How was this patch tested? Unit test. Closes #30210 from viirya/SPARK-33259. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 15:31:57 -08:00
Chao Sun	cf3b6551ce	[SPARK-33435][SQL] DSv2: REFRESH TABLE should invalidate caches referencing the table ### What changes were proposed in this pull request? This changes `RefreshTableExec` in DSv2 to also invalidate caches with references to the target table to be refreshed. The change itself is similar to what's done in #30211. Note that though, since we currently don't support caching a DSv2 table directly, this doesn't add recache logic as in the DSv1 impl. I marked it as a TODO for now. ### Why are the changes needed? Currently the behavior in DSv1 and DSv2 is inconsistent w.r.t refreshing table: in DSv1 we invalidate both metadata cache as well as all table caches that are related to the table, but in DSv2 we only do the former. This addresses the issue and make the behavior consistent. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing a v2 table also invalidate all the related caches. ### How was this patch tested? Added a new UT. Closes #30359 from sunchao/SPARK-33435. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 15:22:56 -08:00
Linhong Liu	1baf0d5c9b	[SPARK-33140][SQL][FOLLOW-UP] change val to def in object rule ### What changes were proposed in this pull request? In #30097, many rules changed from case class to object, but if the rule is stateful, there will be a problem. For example, if an object rule uses a `val` to refer to a config, it will be unchanged after initialization even if other spark session uses a different config value. ### Why are the changes needed? Avoid potential bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #30354 from linhongliu-db/SPARK-33140-followup-2. Lead-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Co-authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-13 01:10:28 +09:00
gengjiaan	2f07c56810	[SPARK-33278][SQL] Improve the performance for FIRST_VALUE ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29800 provides a performance improvement for `NTH_VALUE`. `FIRST_VALUE` also could use the `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame`. ### Why are the changes needed? Improve the performance for `FIRST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30178 from beliefer/SPARK-33278. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 14:59:22 +00:00
ulysses	a3d2954662	[SPARK-33421][SQL] Support Greatest and Least in Expression Canonicalize ### What changes were proposed in this pull request? Add `Greatest` and `Least` check in `Canonicalize`. ### Why are the changes needed? The children of both `Greatest` and `Least` are order Irrelevant. Let's say we have `greatest(1, 2)` and `greatest(2, 1)`. We can get the same canonicalized expression in this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30330 from ulysses-you/SPARK-33421. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 20:26:33 +09:00
zhengruifeng	a2887164bc	[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC ### What changes were proposed in this pull request? 1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors; 2, infer an appropriate `maxBlockSizeInMB` if set 0; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. f2jBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 326481 \| 26143 \| 25710 \| 24726 \| 25395 \| 25840 \| 26846 \| 25927 \| 27431 \| 26190 \| 26056 \| 26347 \| 27204 epsilon3000(67%) \| 455247 \| 35893 \| 34366 \| 34985 \| 38387 \| 38901 \| 40426 \| 40044 \| 39161 \| 38767 \| 39965 \| 39523 \| 39108 epsilon4000(50%) \| 306390 \| 42256 \| 41164 \| 43748 \| 48638 \| 50892 \| 50986 \| 51091 \| 51072 \| 51289 \| 51652 \| 53312 \| 52146 epsilon5000(40%) \| 307619 \| 43639 \| 42992 \| 44743 \| 50800 \| 51939 \| 51871 \| 52190 \| 53850 \| 52607 \| 51062 \| 52509 \| 51570 epsilon10000(20%) \| 310070 \| 58371 \| 55921 \| 56317 \| 56618 \| 53694 \| 52131 \| 51768 \| 51728 \| 52233 \| 51881 \| 51653 \| 52440 epsilon20000(10%) \| 316565 \| 109193 \| 95121 \| 82764 \| 69653 \| 60764 \| 56066 \| 53371 \| 52822 \| 52872 \| 52769 \| 52527 \| 53508 epsilon200000(1%) \| 336181 \| 1569721 \| 1069355 \| 673718 \| 375043 \| 218230 \| 145393 \| 110926 \| 94327 \| 87039 \| 83926 \| 81890 \| 81787 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 12.48827602 \| 12.69859977 \| 13.20395535 \| 12.85611341 \| 12.63471362 \| 12.16125307 \| 12.59231689 \| 11.90189931 \| 12.46586483 \| 12.5299739 \| 12.39158158 \| 12.00121306 epsilon3000(67%) \| 1 \| 12.68344803 \| 13.2470174 \| 13.01263399 \| 11.85940553 \| 11.70270687 \| 11.26124276 \| 11.36866946 \| 11.62500958 \| 11.74315784 \| 11.39114225 \| 11.51853351 \| 11.64076404 epsilon4000(50%) \| 1 \| 7.250804619 \| 7.443154212 \| 7.003520161 \| 6.299395534 \| 6.020396133 \| 6.00929667 \| 5.996946625 \| 5.999177632 \| 5.973795551 \| 5.931812902 \| 5.747111345 \| 5.875618456 epsilon5000(40%) \| 1 \| 7.049176196 \| 7.155261444 \| 6.875243055 \| 6.055492126 \| 5.92269778 \| 5.930462108 \| 5.894213451 \| 5.712516249 \| 5.847491779 \| 6.024421292 \| 5.858405226 \| 5.965076595 epsilon10000(20%) \| 1 \| 5.312055644 \| 5.544786395 \| 5.505797539 \| 5.4765269 \| 5.774760681 \| 5.947900481 \| 5.98960748 \| 5.994239097 \| 5.93628549 \| 5.976561747 \| 6.002942714 \| 5.912852784 epsilon20000(10%) \| 1 \| 2.899132728 \| 3.328024306 \| 3.824911797 \| 4.544886796 \| 5.209745902 \| 5.64629187 \| 5.931404695 \| 5.993052137 \| 5.987384627 \| 5.999071425 \| 6.026710073 \| 5.916218136 epsilon200000(1%) \| 1 \| 0.214166084 \| 0.314377358 \| 0.498993644 \| 0.896379882 \| 1.540489392 \| 2.312222734 \| 3.03067811 \| 3.563995463 \| 3.862417997 \| 4.005683578 \| 4.105275369 \| 4.110445425 OpenBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 299119 \| 26047 \| 25049 \| 25239 \| 28001 \| 35138 \| 36438 \| 36279 \| 36114 \| 35111 \| 35428 \| 36295 \| 35197 epsilon3000(67%) \| 439798 \| 33321 \| 34423 \| 34336 \| 38906 \| 51756 \| 54138 \| 54085 \| 53412 \| 54766 \| 54425 \| 54221 \| 54842 epsilon4000(50%) \| 302963 \| 42960 \| 40678 \| 43483 \| 48254 \| 50888 \| 54990 \| 52647 \| 51947 \| 51843 \| 52891 \| 53410 \| 52020 epsilon5000(40%) \| 303569 \| 44225 \| 44961 \| 45065 \| 51768 \| 52776 \| 51930 \| 53587 \| 53104 \| 51833 \| 52138 \| 52574 \| 53756 epsilon10000(20%) \| 307403 \| 58447 \| 55993 \| 56757 \| 56694 \| 54038 \| 52734 \| 52073 \| 52051 \| 52150 \| 51986 \| 52407 \| 52390 epsilon20000(10%) \| 313344 \| 107580 \| 94679 \| 83329 \| 70226 \| 60996 \| 57130 \| 55461 \| 54641 \| 52712 \| 52541 \| 53101 \| 53312 epsilon200000(1%) \| 334679 \| 1642726 \| 1073148 \| 654481 \| 364974 \| 213881 \| 140248 \| 107579 \| 91757 \| 85090 \| 81940 \| 80492 \| 80250 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 11.48381771 \| 11.94135494 \| 11.85146004 \| 10.68243991 \| 8.512692811 \| 8.208985125 \| 8.244962651 \| 8.282632774 \| 8.519238985 \| 8.443011178 \| 8.241328007 \| 8.498423161 epsilon3000(67%) \| 1 \| 13.19882356 \| 12.7762833 \| 12.80865564 \| 11.30411762 \| 8.497526857 \| 8.123646976 \| 8.131607655 \| 8.234067251 \| 8.030493372 \| 8.080808452 \| 8.111211523 \| 8.01936472 epsilon4000(50%) \| 1 \| 7.052211359 \| 7.44783421 \| 6.967389555 \| 6.278505409 \| 5.953525389 \| 5.509419895 \| 5.754610899 \| 5.832155851 \| 5.843855487 \| 5.728063376 \| 5.672402172 \| 5.823971549 epsilon5000(40%) \| 1 \| 6.86419446 \| 6.751829363 \| 6.736247642 \| 5.864027971 \| 5.752027437 \| 5.845734643 \| 5.664974714 \| 5.716499699 \| 5.856674319 \| 5.822413595 \| 5.774127896 \| 5.647164968 epsilon10000(20%) \| 1 \| 5.259517169 \| 5.490025539 \| 5.416124883 \| 5.422143437 \| 5.688645028 \| 5.829313157 \| 5.903308816 \| 5.905803923 \| 5.894592522 \| 5.913188166 \| 5.865685882 \| 5.867589235 epsilon20000(10%) \| 1 \| 2.912660346 \| 3.309540658 \| 3.760323537 \| 4.461937174 \| 5.137123746 \| 5.48475407 \| 5.649807973 \| 5.734594901 \| 5.944452876 \| 5.963799699 \| 5.900905821 \| 5.87755102 epsilon200000(1%) \| 1 \| 0.203733915 \| 0.311866583 \| 0.511365494 \| 0.916994087 \| 1.564790701 \| 2.38633706 \| 3.111006795 \| 3.647449241 \| 3.933235398 \| 4.084439834 \| 4.157916315 \| 4.170454829 ### Does this PR introduce _any_ user-facing change? yes, param `blockSize` -> `blockSizeInMB` in master ### How was this patch tested? added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907)) Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-12 19:14:07 +08:00
Kent Yao	4335af075a	[MINOR][DOC] spark.executor.memoryOverhead is not cluster-mode only ### What changes were proposed in this pull request? Remove "in cluster mode" from the description of `spark.executor.memoryOverhead` ### Why are the changes needed? fix correctness issue in documentaion ### Does this PR introduce _any_ user-facing change? yes, users may not get confused about the description `spark.executor.memoryOverhead` ### How was this patch tested? pass GA doc generation Closes #30311 from yaooqinn/minordoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-12 18:53:06 +09:00
xuewei.linxuewei	6d31daeb6a	[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for `element_at`，`elt`, `GetArrayItem` functions. ### Why are the changes needed? For ansiMode. ### Does this PR introduce any user-facing change? When `spark.sql.ansi.enabled` = true, Spark will throw `ArrayIndexOutOfBoundsException` if out-of-range index when accessing array elements ### How was this patch tested? Added UT and existing UT. Closes #30297 from leanken/leanken-SPARK-33386. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 08:50:32 +00:00
Dongjoon Hyun	22baf05a9e	[SPARK-33408][SPARK-32354][K8S][R] Use R 3.6.3 in K8s R image and re-enable RTestsSuite ### What changes were proposed in this pull request? This PR aims to use R 3.6.3 in K8s R image and re-enable `RTestsSuite`. ### Why are the changes needed? Jenkins Server is using `R 3.6.3`. ``` + SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s + /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz * using log directory ‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’ * using R version 3.6.3 (2020-02-29) ``` OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and currently `spark-3.0.1` fails to run SparkR. ``` $ cd spark-3.0.1-bin-hadoop3.2 $ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build ... exit code: 1 termination reason: Error ... $ bin/spark-submit --master k8s://https://192.168.64.49:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark-r:latest local:///opt/spark/examples/src/main/r/dataframe.R $ k logs dataframe-r-b1c14b75b0c09eeb-driver ... + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.RRunner local:///opt/spark/examples/src/main/r/dataframe.R 20/11/10 06:03:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Error: package or namespace load failed for ‘SparkR’ in rbind(info, getNamespaceInfo(env, "S3methods")): number of columns of matrices must match (see arg 2) In addition: Warning message: package ‘SparkR’ was built under R version 4.0.2 Execution halted ``` In addition, this PR aims to recover the test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass K8S IT Jenkins job. Closes #30130 from dongjoon-hyun/SPARK-32354. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 15:36:31 +09:00
Yuanjian Li	9f983a68f1	[SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods ### Why are the changes needed? Follow the comment: https://github.com/apache/spark/pull/26935#discussion_r514697997 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test and Mima test. Closes #30344 from xuanyuanking/SPARK-30294-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 12:22:25 +09:00
Ruifeng Zheng	6244407ce6	Revert "[WIP] Test (#30327 )" This reverts commit `61ee5d8a4e`. ### What changes were proposed in this pull request? I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009, but I merged it to master by mistake. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 11:32:12 +09:00
WeichenXu	61ee5d8a4e	[WIP] Test (#30327 ) * resend * address comments * directly gen new Iter * directly gen new Iter * update blockify strategy * address comments * try to fix 2.13 * try to fix scala 2.13 * use 1.0 as the default value for gemv * update Co-authored-by: zhengruifeng <ruifengz@foxmail.com>	2020-11-12 10:20:33 +08:00
Josh Soref	9d58a2f0f0	[MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and examples ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: graphx, external, and examples. Split per holdenk https://github.com/apache/spark/pull/30323#issuecomment-725159710 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No testing was performed Closes #30326 from jsoref/spelling-graphx. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-12 08:29:22 +09:00
Steve Loughran	318a173fce	[SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs ### What changes were proposed in this pull request? 1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in `spark.sql.sources.writeJobUUID` 1. `SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by appending a random long number to the supplied timestamp to ensure the probability of a collision is near-zero. 1. With tests of uniqueness, round trips and negative jobID rejection. ### Why are the changes needed? Without this, if more than one job is started in the same second and the committer expects application attempt IDs to be unique is at risk of clashing with other jobs. With the fix, * those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique. * committers which use the Hadoop JobID for unique paths and filenames will get the randomly generated jobID. Assuming all clocks in a cluster in sync, the probability of two jobs launched in the same second has dropped from 1 to 1/(2^63) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests There's a new test suite SparkHadoopWriterUtilsSuite which creates jobID, verifies they are unique even for the same timestamp and that they can be marshalled to string and parsed back in the hadoop code, which contains some (brittle) assumptions about the format of job IDs. Functional Integration Tests 1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven repository 1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARs. 1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version 1. And executed against AWS london. The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique. With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs. Closes #30319 from steveloughran/BUG/SPARK-33402-jobuuid. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 14:27:48 -08:00
Max Gekk	7e867298fe	[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc` ### What changes were proposed in this pull request? Updated results of `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By regenerating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark" ``` Closes #30338 from MaxGekk/fix-trunc_date-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 08:50:43 -08:00
zero323	4b76a74f1c	[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ ### What changes were proposed in this pull request? Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`. ### Why are the changes needed? API consistency and improved readability of the expressions. ### Does this PR introduce _any_ user-facing change? Before this change col("abc") col("wąż") result in Column<b'abc'> Column<b'w\xc4\x85\xc5\xbc'> After this change we'll get Column<'abc'> Column<'wąż'> ### How was this patch tested? Existing tests and manual inspection. Closes #30322 from zero323/SPARK-33415. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 00:13:17 +09:00
stczwd	1eb236b936	[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 ### What changes were proposed in this pull request? This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617. ### Does this PR introduce _any_ user-facing change? Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table. ### How was this patch tested? Run suites and fix old tests. Closes #29339 from stczwd/SPARK-32512-new. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jacky Lee <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 09:30:42 +00:00
Wenchen Fan	8760032f4f	[SPARK-33412][SQL] OverwriteByExpression should resolve its delete condition based on the table relation not the input query ### What changes were proposed in this pull request? Make a special case in `ResolveReferences`, which resolves `OverwriteByExpression`'s condition expression based on the table relation instead of the input query. ### Why are the changes needed? The condition expression is passed to the table implementation at the end, so we should resolve it using table schema. Previously it works because we have a hack in `ResolveReferences` to delay the resolution if `outputResolved == false`. However, this hack doesn't work for tables accepting any schema like https://github.com/delta-io/delta/pull/521 . We may wrongly resolve the delete condition using input query's outout columns which don't match the table column names. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests and updated test in v2 write. Closes #30318 from cloud-fan/v2-write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 16:13:21 +09:00
Takeshi Yamamuro	4b367976a8	[SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #30324 from maropu/FilterBugInTPCDSQueryBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 15:24:05 +09:00
Terry Kim	6d5d030957	[SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("SHOW CREATE TABLE t AS SERDE") // Succeeds ``` With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883) at scala.Option.map(Option.scala:230) ``` , which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail. ### Does this PR introduce _any_ user-facing change? After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`. ### How was this patch tested? Updated existing tests. Closes #30321 from imback82/show_create_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:54:27 +00:00
Max Gekk	1e2eeda20e	[SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests ### What changes were proposed in this pull request? In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`. ### Why are the changes needed? - The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: - `org.apache.spark.sql.execution.command.v1.ShowTablesSuite` - `org.apache.spark.sql.execution.command.v2.ShowTablesSuite` - `ShowTablesParserSuite` Closes #30287 from MaxGekk/unify-dsv1_v2-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:26:46 +00:00
ulysses	5197c5d2e7	[SPARK-33390][SQL] Make Literal support char array ### What changes were proposed in this pull request? Make Literal support char array. ### Why are the changes needed? We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`. ### Does this PR introduce _any_ user-facing change? Yes, user can call `Literal()` with `char[]`. ### How was this patch tested? Add test. Closes #30295 from ulysses-you/SPARK-33390. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 11:39:11 +09:00
Utkarsh	46346943bb	[SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression ### What changes were proposed in this pull request? The following query produces incorrect results: ``` SELECT date_trunc('minute', '1769-10-17 17:10:02') ``` Spark currently incorrectly returns ``` 1769-10-17 17:10:02 ``` against the expected return value of ``` 1769-10-17 17:10:00 ``` Steps to repro Run the following commands in spark-shell: ``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show() ``` This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`. This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`. This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests to `DateTimeUtilsSuite` which previously failed and pass now. Closes #30303 from utkarsh39/trunc-timestamp-fix. Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 09:28:59 +09:00
Liang-Chi Hsieh	6fa80ed1dd	[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions ### What changes were proposed in this pull request? Currently we skip subexpression elimination in branches of conditional expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions. ### Why are the changes needed? We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two `Project`s and produces conditional expression like: ``` CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END ``` If `jsonToStruct(json)` is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30245 from viirya/SPARK-33337. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-10 16:17:00 -08:00
zero323	122c8999cb	[SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns ### What changes were proposed in this pull request? Changes pyspark.sql.dataframe.DataFrame to :py:class:`pyspark.sql.DataFrame` ### Why are the changes needed? Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104). ### Does this PR introduce _any_ user-facing change? User will see shorter reference with a link. ### How was this patch tested? `dev/lint-python` and manual check of the rendered docs. Closes #30313 from zero323/SPARK-33251-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-11-10 09:17:00 -08:00
Chao Sun	3165ca742a	[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader ### What changes were proposed in this pull request? This removes the `sharesHadoopClasses` flag from `IsolatedClientLoader` in Hive module. ### Why are the changes needed? Currently, when initializing `IsolatedClientLoader`, users can set the `sharesHadoopClasses` flag to decide whether the `HiveClient` created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies. There are two reasons to remove this: 1. this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when `spark.sql.hive.metastore.jars` is equal to "maven", which could be very rare. 2. when `sharesHadoopClasses` is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download `hadoop-client` jar and put all the sub-module jars (e.g., `hadoop-common`, `hadoop-hdfs`) together with the Hive jars, and the Hadoop version used by `hadoop-client` is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the `IsolatedClientLoader` is still using Hadoop 2.x or even lower. ### Does this PR introduce _any_ user-facing change? This affects Spark users in one scenario: when `spark.sql.hive.metastore.jars` is set to `maven` AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to _not_ share Hadoop classes, but with the PR it will share Hadoop classes with Spark. ### How was this patch tested? Existing UTs. Closes #30284 from sunchao/SPARK-33376. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 15:41:04 +00:00
angerszhu	34f5e7ce77	[SPARK-33302][SQL] Push down filters through Expand ### What changes were proposed in this pull request? Push down filter through expand. For case below: ``` create table t1(pid int, uid int, sid int, dt date, suid int) using parquet; create table t2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vs AS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM t1 u join t2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 ``` Plan. before this pr: ``` == Physical Plan == (5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- (4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- (4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- (2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- (2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- (2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- (2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- (2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- (2) Project [uid#7, dt#9, suid#10] : +- (2) Filter isnotnull(uid#7) : +- (2) ColumnarToRow : +- FileScan parquet default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date,suid:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233] +- (1) Project [pid#11, vs#12, uid#13] +- (1) Filter isnotnull(uid#13) +- (1) ColumnarToRow +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [isnotnull(uid#13)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` Plan. after. this pr. : ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[years#0, appversion#2], functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L]) +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71] +- HashAggregate(keys=[years#0, appversion#2], functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2, sum#22L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[count(distinct uid#7)], output=[years#0, appversion#2, uusers#3L]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true, [id=#67] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[partial_count(distinct uid#7)], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, count#27L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7, 5), true, [id=#63] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Project [uid#7, dt#9, pid#11, vs#12] +- BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight, false :- Filter isnotnull(uid#7) : +- FileScan parquet default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, false] as bigint)),false), [id=#58] +- Filter ((CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND isnotnull(uid#13)) +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS), isnotnull..., Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` ### Why are the changes needed? Improve performance, filter more data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30278 from AngersZhuuuu/SPARK-33302. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:40:24 +00:00
Chao Sun	4934da56bc	[SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache ### What changes were proposed in this pull request? This changes `DropTableExec` to also invalidate caches referencing the table to be dropped, in a cascading manner. ### Why are the changes needed? In DSv1, `DROP TABLE` command also invalidate caches as described in [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). However in DSv2 the same command only drops the table but doesn't handle the caches. This could lead to correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. Now DSv2 `DROP TABLE` command also invalidates cache. ### How was this patch tested? Added a new UT Closes #30211 from sunchao/SPARK-33305. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:37:42 +00:00
lrz	27bb40b629	[SPARK-33339][PYTHON] Pyspark application will hang due to non Exception error ### What changes were proposed in this pull request? When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang. The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck we can run a simple test to reproduce this case: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` ### Why are the changes needed? to make sure pyspark application won't hang if there's non-Exception error in python worker ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added a new test and also manually tested the case above Closes #30248 from li36909/pyspark. Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 19:39:18 +09:00
xuewei.linxuewei	e3a768dd79	[SPARK-33391][SQL] element_at with CreateArray not respect one based index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes #30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 07:23:47 +00:00
Yuanjian Li	ad02ceda29	[SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table ### What changes were proposed in this pull request? - Call `spark.read.table` in `spark.table`. - Add comments for `spark.table` to emphasize it also support streaming temp view reading. ### Why are the changes needed? The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #30148 from xuanyuanking/SPARK-33244. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:46:45 +00:00
Terry Kim	90f6f39e42	[SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `LOAD DATA` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds ``` With this change, `LOAD DATA` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865) at scala.Option.foreach(Option.scala:407) ``` , which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30270 from imback82/load_data_cmd. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:28:06 +00:00
Gengliang Wang	a1f84d8714	[SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata ### What changes were proposed in this pull request? When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference. ### Why are the changes needed? For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`. The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`. ### Does this PR introduce _any_ user-facing change? Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true. ### How was this patch tested? Unit test Closes #30273 from gengliangwang/supportsExternalMetadata. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 04:43:32 +00:00
Chao Sun	c2caf2522b	[SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0 ### What changes were proposed in this pull request? This upgrade Apache Arrow version from 1.0.1 to 2.0.0 ### Why are the changes needed? Apache Arrow 2.0.0 was released with some improvements from Java side, so it's better to upgrade Spark to the new version. Note that the format version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible between 1.0.1 and 2.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #30306 from sunchao/SPARK-33213. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-09 19:07:16 -08:00
Gabor Somogyi	4ac8133866	[SPARK-33223][SS][UI] Structured Streaming Web UI state information ### What changes were proposed in this pull request? Structured Streaming UI is not containing state information. In this PR I've added it. ### Why are the changes needed? Missing state information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. <img width="1044" alt="Screenshot 2020-10-30 at 15 14 21" src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png"> Closes #30151 from gaborgsomogyi/SPARK-33223. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-10 11:22:35 +09:00
neko	4360c6f12a	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts ### What changes were proposed in this pull request? add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts. ### Why are the changes needed? The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test result shows below: ![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png) ![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png) Closes #30266 from akiyamaneko/pyspark-hint-info. Authored-by: neko <echohlne@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:12:19 +09:00
Dongjoon Hyun	35ac314181	[SPARK-33405][BUILD] Upgrade commons-compress to 1.20 ### What changes were proposed in this pull request? This PR aims to upgrade `commons-compress` from 1.8 to 1.20. ### Why are the changes needed? - https://commons.apache.org/proper/commons-compress/security-reports.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30304 from dongjoon-hyun/SPARK-33405. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:08:55 +09:00
Kent Yao	036c11b0d4	[SPARK-33397][YARN][DOC] Fix generating md to html for available-patterns-for-shs-custom-executor-log-url ### What changes were proposed in this pull request? 1. replace `{{}}` with `{{}}` 2. using `<code></code>` in td-tag ### Why are the changes needed? to fix this. ![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png) ### Does this PR introduce _any_ user-facing change? yes, you will see the correct online doc with this change ![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png) ### How was this patch tested? shown as the above pic via jekyll serve. Closes #30298 from yaooqinn/SPARK-33397. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-10 10:15:55 +09:00
zero323	090962cd42	[SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.ml` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30285 from zero323/SPARK-33251. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 09:33:48 +09:00
huangtianhua	83a80796aa	[SPARK-32691][BUILD] Update commons-crypto to v1.1.0 ### What changes were proposed in this pull request? Update the package commons-crypto to v1.1.0 to support aarch64 platform - https://issues.apache.org/jira/browse/CRYPTO-139 ### Why are the changes needed? The package commons-crypto-1.0.0 available in the Maven repository doesn't support aarch64 platform. It costs long time in CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever receive block data from client, if the time more than the default value 120s, IOException raised and client will retry replicate the block data to other executors. But in fact the replication is complete, it makes the replication number incorrect. This makes DistributedSuite tests pass. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the CIs. Closes #30275 from huangtianhua/SPARK-32691. Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-09 14:33:27 -08:00
Chandni Singh	8113c88542	[SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode ### What changes were proposed in this pull request? This is one of the patches for SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602) which is needed for push-based shuffle. Summary of changes: - Adds an implementation of `MergedShuffleFileManager` which was introduced with [Spark 32915](https://issues.apache.org/jira/browse/SPARK-32915). - Integrated the push-based shuffle service with `YarnShuffleService`. ### Why are the changes needed? Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #30062 from otterc/SPARK-32916. Lead-authored-by: Chandni Singh <singh.chandni@gmail.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Co-authored-by: Ye Zhou <yezhou@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-09 11:00:52 -06:00
Peter Toth	84dc374611	[SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls ### What changes were proposed in this pull request? This PR modifies the `ExtractPythonUDFs` rule to deduplicate deterministic PythonUDF calls. Before this PR the dataframe: `df.withColumn("c", batchedPythonUDF(col("a"))).withColumn("d", col("c"))` has the plan: ``` (1) Project [value#1 AS a#4, pythonUDF1#15 AS c#7, pythonUDF1#15 AS d#10] +- BatchEvalPython [dummyUDF(value#1), dummyUDF(value#1)], [pythonUDF0#14, pythonUDF1#15] +- LocalTableScan [value#1] ``` After this PR the deterministic PythonUDF calls are deduplicated: ``` (1) Project [value#1 AS a#4, pythonUDF0#14 AS c#7, pythonUDF0#14 AS d#10] +- BatchEvalPython [dummyUDF(value#1)], [pythonUDF0#14] +- LocalTableScan [value#1] ``` ### Why are the changes needed? To fix a performance issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New and existing UTs. Closes #30203 from peter-toth/SPARK-33303-deduplicate-deterministic-udf-calls. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-09 19:27:36 +09:00
Linhong Liu	4e1c89400d	[SPARK-33140][SQL][FOLLOW-UP] Use sparkSession in AQE context when applying rules ### What changes were proposed in this pull request? After #30097, all rules are using `SparkSession.active` to get `SQLConf` and `SparkSession`. But in AQE, when applying the rules for the initial plan, we should use the spark session in AQE context. ### Why are the changes needed? Fix potential problem caused by using the wrong spark session ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing ut Closes #30294 from linhongliu-db/SPARK-33140-followup. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:44:58 +00:00
Yuming Wang	7a5647a93a	[SPARK-33385][SQL] Support bucket pruning for IsNaN ### What changes were proposed in this pull request? This pr add support bucket pruning on `IsNaN` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30291 from wangyum/SPARK-33385. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:20:31 +00:00
Yuming Wang	69799c514f	[SPARK-33372][SQL] Fix InSet bucket pruning ### What changes were proposed in this pull request? This pr fix `InSet` bucket pruning because of it's values should not be `Literal`: `cbd3fdea62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L253-L255)` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test: ```scala spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t") spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show ``` Before this PR \| After this PR -- \| -- ![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) \| ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png) Closes #30279 from wangyum/SPARK-33372. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:32:51 +00:00
Wenchen Fan	98730b7ee2	[SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer ### What changes were proposed in this pull request? This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer. ### Why are the changes needed? Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes. ### Does this PR introduce _any_ user-facing change? Yes. 1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT. 2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed. 3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed. 4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command) ### How was this patch tested? new tests Closes #29970 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:08:00 +00:00

1 2 3 4 5 ...

28489 commits