ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	b2180c0950	[SPARK-32471][SQL][DOCS][TESTS][PYTHON][SS] Describe JSON option `allowNonNumericNumbers` ### What changes were proposed in this pull request? 1. Describe the JSON option `allowNonNumericNumbers` which is used in read 2. Add new test cases for allowed JSON field values: NaN, +INF, +Infinity, Infinity, -INF and -Infinity ### Why are the changes needed? To improve UX with Spark SQL and to provide users full info about the supported option. ### Does this PR introduce _any_ user-facing change? Yes, in PySpark. ### How was this patch tested? Added new test to `JsonParsingOptionsSuite` Closes #29275 from MaxGekk/allowNonNumericNumbers-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-29 12:14:13 +09:00
HyukjinKwon	5491c08bf1	Revert "[SPARK-31525][SQL] Return an empty list for df.head() when df is empty" This reverts commit `44a5258ac2`.	2020-07-29 12:07:35 +09:00
Michael Munday	a3d80564ad	[SPARK-32458][SQL][TESTS] Fix incorrectly sized row value reads ### What changes were proposed in this pull request? Updates to tests to use correctly sized `getInt` or `getLong` calls. ### Why are the changes needed? The reads were incorrectly sized (i.e. `putLong` paired with `getInt` and `putInt` paired with `getLong`). This causes test failures on big-endian systems. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests were run on a big-endian system (s390x). This change is unlikely to have any practical effect on little-endian systems. Closes #29258 from mundaym/fix-row. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-28 10:36:20 -07:00
Xiaochang Wu	44c868b73a	[SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs ### What changes were proposed in this pull request? Rewrite a clearer and complete BLAS native acceleration enabling guide. ### Why are the changes needed? The document of enabling BLAS native acceleration in ML guide (https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete and unclear to the user. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29139 from xwu99/blas-doc. Lead-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-28 08:36:11 -07:00
Max Gekk	c28da672f8	[SPARK-32382][SQL] Override table renaming in JDBC dialects ### What changes were proposed in this pull request? Override the default implementation of `JdbcDialect.renameTable()`: ```scala s"ALTER TABLE $oldTable RENAME TO $newTable" ``` in the following JDBC dialects according to official documentation: - DB2 - Derby - MS SQL Server - Teradata Other dialects follow the default implementation: - MySQL: https://dev.mysql.com/doc/refman/8.0/en/alter-table.html - Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ALTER-TABLE.html#GUID-552E7373-BF93-477D-9DA3-B2C9386F2877 - PostgreSQL: https://www.postgresql.org/docs/12/sql-altertable.html ### Why are the changes needed? To have correct implementation of table renaming for all supported JDBC dialects. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Manually Closes #29237 from MaxGekk/jdbc-rename-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-28 12:34:10 +00:00
yi.wu	ca1ecf7f9f	[SPARK-32459][SQL] Support WrappedArray as customCollectionCls in MapObjects ### What changes were proposed in this pull request? This PR supports `WrappedArray` as `customCollectionCls` in `MapObjects`. ### Why are the changes needed? This helps fix the regression caused by SPARK-31826. For the following test, it can pass in branch-3.0 but fail in master branch: ```scala test("WrappedArray") { val myUdf = udf((a: WrappedArray[Int]) => WrappedArray.make[Int](Array(a.head + 99))) checkAnswer(Seq(Array(1)) .toDF("col") .select(myUdf(Column("col"))), Row(ArrayBuffer(100))) } ``` In SPARK-31826, we've changed the catalyst-to-scala converter from `CatalystTypeConverters` to `ExpressionEncoder.deserializer`. However, `CatalystTypeConverters` supports `WrappedArray` while `ExpressionEncoder.deserializer` doesn't. ### Does this PR introduce _any_ user-facing change? No, SPARK-31826 is merged into master and branch-3.1, which haven't been released. ### How was this patch tested? Added a new test for `WrappedArray` in `UDFSuite`; Also updated `ObjectExpressionsSuite` for `MapObjects`. Closes #29261 from Ngone51/fix-wrappedarray. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-28 12:24:15 +00:00
xuewei.linxuewei	12b9787a7f	[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize ### What changes were proposed in this pull request? Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16. ``` select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#45' and p_type not like 'MEDIUM POLISHED%' and p_size in (49, 14, 23, 45, 19, 3, 36, 9) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size ``` In above query, will planed into LeftAnti condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey)) Inside BroadcastNestedLoopJoinExec will perform O(M\*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M). But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future. After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s The semantic of null-aware anti join is: ![image](https://user-images.githubusercontent.com/17242071/88077041-66a39a00-cbad-11ea-8fb6-c235c4d219b4.png) ### Why are the changes needed? TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off 2. added case in org.apache.spark.sql.JoinSuite. 3. added case in org.apache.spark.sql.SubquerySuite. 3. Compare performance before and after applying this patch against TPCH Query 16. 4. config combination against e2e test with following ``` Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "false" ), Map( "sspark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "true" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "false" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "true" ) ``` Closes #29104 from leanken/leanken-SPARK-32290. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-28 04:42:15 +00:00
Tianshi Zhu	44a5258ac2	[SPARK-31525][SQL] Return an empty list for df.head() when df is empty ### What changes were proposed in this pull request? return an empty list instead of None when calling `df.head()` ### Why are the changes needed? `df.head()` and `df.head(1)` are inconsistent when df is empty. ### Does this PR introduce _any_ user-facing change? Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken. ### How was this patch tested? Closes #29214 from tianshizz/SPARK-31525. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-28 12:32:19 +09:00
Shantanu	77f2ca6cce	[MINOR][PYTHON] Fix spacing in error message ### What changes were proposed in this pull request? Fixes spacing in an error message ### Why are the changes needed? Makes error messages easier to read ### Does this PR introduce _any_ user-facing change? Yes, it changes the error message ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it Closes #29264 from hauntsaninja/patch-1. Authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-28 11:22:18 +09:00
Frank Yin	8323c8eb56	[SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans ### What changes were proposed in this pull request? This PR is intended to solve schema pruning not working with window functions, as described in SPARK-32059. It also solved schema pruning not working with `Sort`. It also generalizes with `Project->Filter->[any node can be pruned]`. ### Why are the changes needed? This is needed because of performance issues with nested structures with querying using window functions as well as sorting. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Introduced two tests: 1) optimizer planning level 2) end-to-end tests with SQL queries. Closes #28898 from frankyin-factual/master. Authored-by: Frank Yin <frank@factual.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-28 10:00:21 +09:00
GuoPhilipse	8de43338be	[SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs ### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CASE/ELSE WHEN/THEN MAP KEYS TERMINATED BY NULL DEFINED AS LINES TERMINATED BY ESCAPED BY COLLECTION ITEMS TERMINATED BY PIVOT LATERAL VIEW OUTER? ROW FORMAT SERDE ROW FORMAT DELIMITED FIELDS TERMINATED BY IGNORE NULLS FIRST LAST ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? ![image](https://user-images.githubusercontent.com/46367746/88148830-c6dc1f80-cc31-11ea-81ea-13bc9dc34550.png) ![image](https://user-images.githubusercontent.com/46367746/88148968-fb4fdb80-cc31-11ea-8649-e8297cf5813e.png) ![image](https://user-images.githubusercontent.com/46367746/88149000-073b9d80-cc32-11ea-9aa4-f914ecd72663.png) ![image](https://user-images.githubusercontent.com/46367746/88149021-0f93d880-cc32-11ea-86ed-7db8672b5aac.png) ### How was this patch tested? No Closes #29056 from GuoPhilipse/add-missing-keywords. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-28 09:41:53 +09:00
zhengruifeng	f7542d3b61	[SPARK-32457][ML] logParam thresholds in DT/GBT/FM/LR/MLP ### What changes were proposed in this pull request? logParam `thresholds` in DT/GBT/FM/LR/MLP ### Why are the changes needed? param `thresholds` is logged in NB/RF, but not in other ProbabilisticClassifier ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29257 from zhengruifeng/instr.logParams_add_thresholds. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-27 12:05:29 -07:00
HyukjinKwon	c1140661bf	[SPARK-32443][CORE] Use POSIX-compatible `command -v` in testCommandAvailable ### What changes were proposed in this pull request? This PR aims to use `command -v` in non-Window operating systems instead of executing the given command. ### Why are the changes needed? 1. `command` is POSIX-compatible - POSIX.1-2017: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/command.html 2. `command` is faster and safer than the direct execution - `command` doesn't invoke another process. ```scala scala> sys.process.Process("ls").run().exitValue() LICENSE NOTICE bin doc lib man res1: Int = 0 ``` 3. The existing way behaves inconsistently. - `rm` cannot be checked. AS-IS ```scala scala> sys.process.Process("rm").run().exitValue() usage: rm [-f \| -i] [-dPRrvW] file ... unlink file res0: Int = 64 ``` TO-BE ``` Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process(Seq("sh", "-c", s"command -v ls")).run().exitValue() /bin/ls val res1: Int = 0 ``` 4. The existing logic is already broken in Scala 2.13 environment because it hangs like the following. ```scala $ bin/scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process("cat").run().exitValue() // hang here. ``` ### Does this PR introduce _any_ user-facing change? No. Although this is inside `main` source directory, this is used for testing purpose. ``` $ git grep testCommandAvailable \| grep -v 'def testCommandAvailable' core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("wc")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(!TestUtils.testCommandAvailable("some_nonexistent_command")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: private lazy val isPythonAvailable: Boolean = TestUtils.testCommandAvailable(pythonExec) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: if (TestUtils.testCommandAvailable(pythonExec)) { sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("python")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("echo \| sed")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) ``` ### How was this patch tested? - Scala 2.12: Pass the Jenkins with the existing tests and one modified test. - Scala 2.13: Do the following manually. It should pass instead of `hang`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.PipedRDDSuite ... Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29241 from dongjoon-hyun/SPARK-32443. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-27 12:02:43 -07:00
Kent Yao	d315ebf3a7	[SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens ### What changes were proposed in this pull request? When using `Seconds.toMicros` API to convert epoch seconds to microseconds, ```scala /** * Equivalent to * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}. * param duration the duration * return the converted duration, * or {code Long.MIN_VALUE} if conversion would negatively * overflow, or {code Long.MAX_VALUE} if it would positively overflow. */ ``` This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)` ### Why are the changes needed? fix silent data change between 3.x and 2.x ``` ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" +294247-01-10 12:00:54.775807 ``` ``` kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" 284550-10-19 15:58:1010.448384 ``` ### Does this PR introduce _any_ user-facing change? Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow. ### How was this patch tested? add unit test Closes #29220 from yaooqinn/SPARK-32424. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 17:03:14 +00:00
Cheng Su	548b7db345	[SPARK-32420][SQL] Add handling for unique key in non-codegen hash join ### What changes were proposed in this pull request? `HashRelation` has two separate code paths for unique key look up and non-unique key look up E.g. in its subclass [`UnsafeHashedRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177), unique key look up is more efficient as it does not have e.g. extra `Iterator[UnsafeRow].hasNext()/next()` overhead per row. `BroadcastHashJoinExec` has handled unique key vs non-unique key separately in [code-gen path](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321). But the non-codegen path for broadcast hash join and shuffled hash join do not separate it yet, so adding the support here. ### Why are the changes needed? Shuffled hash join and non-codegen broadcast hash join still rely on this code path for execution. So this PR will help save CPU for executing this two type of join. Adding codegen for shuffled hash join would be a different topic and I will add it in https://issues.apache.org/jira/browse/SPARK-32421 . Ran the same query as [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153-L167), with enabling and disabling this feature. Verified 20% wall clock time improvement (switch control and test group order as well to verify the improvement to not be the noise). ``` Running benchmark: shuffle hash join Running case: shuffle hash join unique key SHJ off Stopped after 5 iterations, 4039 ms Running case: shuffle hash join unique key SHJ on Stopped after 5 iterations, 2898 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join unique key SHJ off 707 808 81 5.9 168.6 1.0X shuffle hash join unique key SHJ on 547 580 50 7.7 130.4 1.3X ``` ``` Running benchmark: shuffle hash join Running case: shuffle hash join unique key SHJ on Stopped after 5 iterations, 3333 ms Running case: shuffle hash join unique key SHJ off Stopped after 5 iterations, 4268 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join unique key SHJ on 565 667 60 7.4 134.8 1.0X shuffle hash join unique key SHJ off 774 854 85 5.4 184.4 0.7X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Added test in `OuterJoinSuite` to cover left outer and right outer join. * Added test in `ExistenceJoinSuite` to cover left semi join, and existence join. * [Existing `joinSuite` already covered inner join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala#L182) * [Existing `ExistenceJoinSuite` already covered left anti join, and existence join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala#L228) Closes #29216 from c21/unique-key. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 17:01:03 +00:00
HyukjinKwon	ea58e52823	[SPARK-32434][CORE][FOLLOW-UP] Fix load-spark-env.cmd to be able to run in Windows properly ### What changes were proposed in this pull request? This PR is basically a followup of SPARK-26132 and SPARK-32434. You can't define an environment variable within an-if to use it within the block. See also https://superuser.com/questions/78496/variables-in-batch-file-not-being-set-when-inside-if ### Why are the changes needed? For Windows users to use Spark and fix the build in AppVeyor. ### Does this PR introduce _any_ user-facing change? No, it's only in unreleased branches. ### How was this patch tested? Manually tested on a local Windows machine, and AppVeyor build at https://github.com/HyukjinKwon/spark/pull/13. See https://ci.appveyor.com/project/HyukjinKwon/spark/builds/34316409 Closes #29254 from HyukjinKwon/SPARK-32434. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 22:37:08 +09:00
Warren Zhu	998086c9a1	[SPARK-30794][CORE] Stage Level scheduling: Add ability to set off heap memory ### What changes were proposed in this pull request? Support set off heap memory in `ExecutorResourceRequests` ### Why are the changes needed? Support stage level scheduling ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in `ResourceProfileSuite` and `DAGSchedulerSuite` Closes #28972 from warrenzhu25/30794. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-27 08:16:13 -05:00
HyukjinKwon	a82aee0441	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3 ### What changes were proposed in this pull request? This PR removes the manual port of `heapq3.py` introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's `heapq.merge()` doesn't not support `key` and `reverse`. See - https://docs.python.org/2/library/heapq.html#heapq.merge in Python 2 - https://docs.python.org/3.8/library/heapq.html#heapq.merge in Python 3 Since we dropped the Python 2 at SPARK-32138, we can remove this away. ### Why are the changes needed? To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at `heapq`. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing tests should cover. I locally ran and verified: ```bash ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle" ./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter" ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key" ``` Closes #29229 from HyukjinKwon/SPARK-32435. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 20:10:13 +09:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00
SaurabhChawla	99f33ec30f	[SPARK-32234][FOLLOWUP][SQL] Update the description of utility method ### What changes were proposed in this pull request? As the part of this PR https://github.com/apache/spark/pull/29045 added the helper method. This PR is the FOLLOWUP PR to update the description of helper method. ### Why are the changes needed? For better readability and understanding of the code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since its only change of updating the description , So ran the Spark shell Closes #29232 from SaurabhChawla100/SPARK-32234-Desc. Authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 08:14:02 +00:00
HyukjinKwon	bfa5d57bbd	[SPARK-32452][R][SQL] Bump up the minimum Arrow version as 1.0.0 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date. Other required changes to support 1.0.0 were already made in SPARK-32451. ### Why are the changes needed? R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0. Also, we're technically not testing old Arrow versions in SparkR for now. ### Does this PR introduce _any_ user-facing change? Yes, users wouldn't be able to use SparkR with old Arrow. ### How was this patch tested? GitHub Actions and AppVeyor are already testing them. Closes #29253 from HyukjinKwon/SPARK-32452. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 14:21:15 +09:00
Cheng Su	01cf8a4ce8	[SPARK-32383][SQL] Preserve hash join (BHJ and SHJ) stream side ordering ### What changes were proposed in this pull request? Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } ``` Current physical plan (extra sort on `k1` before top sort merge join): ``` (9) SortMergeJoin [k1#220L], [k4#232L], Inner :- (6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- (6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- (6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- (2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- (1) Project [id#218L AS k1#220L] : : : +- (1) Range (0, 100, step=1, splits=2) : : +- (4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : : +- (3) Project [id#222L AS k2#224L] : : +- (3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] : +- (5) Project [id#226L AS k3#228L] : +- (5) Range (0, 3, step=1, splits=2) +- (8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- (7) Project [id#230L AS k4#232L] +- (7) Range (0, 100, step=1, splits=2) ``` Ideal physical plan (no extra sort on `k1` before top sort merge join): ``` (9) SortMergeJoin [k1#220L], [k4#232L], Inner :- (6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- (6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- (2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- (1) Project [id#218L AS k1#220L] : : : +- (1) Range (0, 100, step=1, splits=2) : : +- (4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : : +- (3) Project [id#222L AS k2#224L] : : +- (3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- (5) Project [id#226L AS k3#228L] : +- (5) Range (0, 3, step=1, splits=2) +- (8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- (7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` ### Why are the changes needed? To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table. Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29181 from c21/ordering. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 04:51:32 +00:00
Dongjoon Hyun	13c64c2980	[SPARK-32448][K8S][TESTS] Use single version for exec-maven-plugin/scalatest-maven-plugin ### What changes were proposed in this pull request? Two different versions are used for the same artifacts, `exec-maven-plugin` and `scalatest-maven-plugin`. This PR aims to use the same versions for `exec-maven-plugin` and `scalatest-maven-plugin`. In addition, this PR removes `scala-maven-plugin.version` from `K8s` integration suite because it's unused. ### Why are the changes needed? This will prevent the mistake which upgrades only one place and forgets the others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins K8S IT. Closes #29248 from dongjoon-hyun/SPARK-32448. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-26 19:25:41 -07:00
Dongjoon Hyun	8153f56286	[SPARK-32451][R] Support Apache Arrow 1.0.0 ### What changes were proposed in this pull request? Currently, `GitHub Action` is broken due to `SparkR UT failure` from new Apache Arrow 1.0.0. ![Screen Shot 2020-07-26 at 5 12 08 PM](https://user-images.githubusercontent.com/9700541/88492923-3409f080-cf63-11ea-8fea-6051298c2dd0.png) This PR aims to update R code according to Apache Arrow 1.0.0 recommendation to pass R unit tests. An alternative is pinning Apache Arrow version at 0.17.1 and I also created a PR to compare with this. - https://github.com/apache/spark/pull/29251 ### Why are the changes needed? - Apache Spark 3.1 supports Apache Arrow 0.15.1+. - Apache Arrow released 1.0.0 a few days ago and this causes GitHub Action SparkR test failures due to warnings. - https://github.com/apache/spark/commits/master ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Pass the Jenkins (https://github.com/apache/spark/pull/29252#issuecomment-664067492) - [x] Pass the GitHub (https://github.com/apache/spark/runs/912656867) Closes #29252 from dongjoon-hyun/SPARK-ARROW. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-26 18:51:25 -07:00
Liang-Chi Hsieh	70ac594bb3	[SPARK-32450][PYTHON] Upgrade pycodestyle to v2.6.0 ### What changes were proposed in this pull request? This patch upgrades pycodestyle from v2.4.0 to v2.6.0. The changes at each release: 2.5.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id3 2.6.0a1: https://pycodestyle.pycqa.org/en/latest/developer.html#a1-2020-04-23 2.6.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id2 Changes: Dropped Python 2.6 and 3.3 support, added Python 3.7 and 3.8 support... ### Why are the changes needed? Including bug fixes and newer Python version support. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Ran `dev/lint-python` locally. Closes #29249 from viirya/upgrade-pycodestyle. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 10:43:32 +09:00
Dongjoon Hyun	4f79b9fffd	[SPARK-32447][CORE] Use python3 by default in pyspark and find-spark-home scripts ### What changes were proposed in this pull request? This PR aims to use `python3` instead of `python` inside `bin/pyspark`, `bin/find-spark-home` and `bin/find-spark-home.cmd` script. ``` $ git diff master --stat bin/find-spark-home \| 4 ++-- bin/find-spark-home.cmd \| 4 ++-- bin/pyspark \| 4 ++-- ``` ### Why are the changes needed? According to [PEP 394](https://www.python.org/dev/peps/pep-0394/), we have four different cases for `python` while `python3` will be there always. ``` - Distributors may choose to set the behavior of the python command as follows: python2, python3, not provide python command, allow python to be configurable by an end user or a system administrator. ``` Moreover, these scripts already depend on `find_spark_home.py` which is using `#!/usr/bin/env python3`. ``` FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py" ``` ### Does this PR introduce _any_ user-facing change? No. Apache Spark 3.1 already drops Python 2.7 via SPARK-32138 . ### How was this patch tested? Pass the Jenkins or GitHub Action. Closes #29246 from dongjoon-hyun/SPARK-FIND-SPARK-HOME. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-26 15:55:48 -07:00
Dongjoon Hyun	7e0c5b3b53	[SPARK-32442][CORE][TESTS] Fix TaskSetManagerSuite by hiding `o.a.s.FakeSchedulerBackend` ### What changes were proposed in this pull request? There exists two `FakeSchedulerBackend` classes. ``` $ git grep "class FakeSchedulerBackend" core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala:private class FakeSchedulerBackend( core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala:class FakeSchedulerBackend extends SchedulerBackend { ``` This PR aims to hide the following at `TaskSetManagerSuite`. ```scala import org.apache.spark.{FakeSchedulerBackend => _, _} ``` ### Why are the changes needed? Although `TaskSetManagerSuite` is inside `org.apache.spark.scheduler` package, `import org.apache.spark._` makes Scala 2.13 confused and causes 4 UT failures. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.TaskSetManagerSuite ... Tests: succeeded 48, failed 4, canceled 0, ignored 0, pending 0 * 4 TESTS FAILED * ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass the following manually. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.TaskSetManagerSuite ... Tests: succeeded 52, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29240 from dongjoon-hyun/SPARK-32442. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-26 07:54:30 -07:00
Itsuki Toyota	86ead044e3	[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-26 09:12:43 -05:00
Dongjoon Hyun	83ffef7ffb	[SPARK-32441][BUILD][CORE] Update json4s to 3.7.0-M5 for Scala 2.13 ### What changes were proposed in this pull request? This PR aims to upgrade `json4s` to from 3.6.6 to 3.7.0-M5 for Scala 2.13 support at Apache Spark 3.1.0 on December. We will upgrade to the latest `json4s` around November. ### Why are the changes needed? `json4s` starts to support Scala 2.13 since v3.7.0-M4. - https://github.com/json4s/json4s/issues/660 - `b013af8e75` Old `json4s` causes many UT failures with `NoSuchMethodException`. ```scala Cause: java.lang.NoSuchMethodException: scala.collection.immutable.Seq$.apply(scala.collection.Seq) at java.lang.Class.getMethod(Class.java:1786) ``` The following is one example. ```scala $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite ... Tests: succeeded 4, failed 9, canceled 0, ignored 0, pending 0 * 9 TESTS FAILED * ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. Scala 2.12: Pass the Jenkins or GitHub Action with the existing tests. 2. Scala 2.13: Do the following manually at least. ```scala $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite ... Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29239 from dongjoon-hyun/SPARK-32441. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 20:34:31 -07:00
Dongjoon Hyun	147022a5c6	[SPARK-32440][CORE][TESTS] Make BlockManagerSuite robust from Scala object size difference ### What changes were proposed in this pull request? This PR aims to increase the memory parameter in `BlockManagerSuite`'s worker decommission test cases. ### Why are the changes needed? Scala 2.13 generates different Java objects and this affects Spark's `SizeEstimator/SizeTracker/SizeTrackingVector`. This causes UT failures like the following. If we decrease the values, those test cases fails in Scala 2.12, too. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.storage.BlockManagerSuite ... - test decommission block manager should not be part of peers * FAILED * 0 did not equal 2 (BlockManagerSuite.scala:1869) - test decommissionRddCacheBlocks should offload all cached blocks * FAILED * 0 did not equal 2 (BlockManagerSuite.scala:1884) ... Tests: succeeded 81, failed 2, canceled 0, ignored 0, pending 0 * 2 TESTS FAILED * ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.storage.BlockManagerSuite ... Tests: succeeded 83, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29238 from dongjoon-hyun/SPARK-32440. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 15:54:21 -07:00
Dongjoon Hyun	d1301af4eb	[SPARK-32437][CORE][FOLLOWUP] Update dependency manifest for RoaringBitmap 0.9.0	2020-07-25 10:58:25 -07:00
Dongjoon Hyun	80e8898158	[SPARK-32438][CORE][TESTS] Use HashMap.withDefaultValue in RDDSuite ### What changes were proposed in this pull request? Since Scala 2.13, `HashMap` is changed to become a final in the future and `.withDefault` is recommended. This PR aims to use `HashMap.withDefaultValue` instead of overriding manually in the test case. - https://www.scala-lang.org/api/current/scala/collection/mutable/HashMap.html ```scala deprecatedInheritance(message = "HashMap wil be made final; use .withDefault for the common use case of computing a default value", since = "2.13.0") ``` ### Why are the changes needed? In Scala 2.13, the existing code causes a failure because the default value function doesn't work correctly. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite - aggregate * FAILED * org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 61.0 failed 1 times, most recent failure: Lost task 0.0 in stage 61.0 (TID 198, localhost, executor driver): java.util.NoSuchElementException: key not found: a ``` ### Does this PR introduce _any_ user-facing change? No. This is a test case change. ### How was this patch tested? 1. Scala 2.12: Pass the Jenkins or GitHub with the existing tests. 2. Scala 2.13: Manually do the following. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite ... Tests: succeeded 72, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29235 from dongjoon-hyun/SPARK-32438. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 10:52:55 -07:00
Dongjoon Hyun	f9f18673dc	[SPARK-32436][CORE] Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal ### What changes were proposed in this pull request? This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`. In Scala 2.12, this is initialized to `-1` via the following. ```scala protected def this() = this(null, -1, null, -1, null, -1) // For deserialization only ``` ### Why are the changes needed? In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example. - org.apache.spark.scheduler.MapStatusSuite ``` MapStatusSuite: - compressSize - decompressSize * RUN ABORTED * java.lang.NoSuchFieldError: numNonEmptyBlocks at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181) at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281) at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64) at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61) at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60) ... ``` ### Does this PR introduce _any_ user-facing change? No. This is a private class. ### How was this patch tested? 1. Pass the GitHub Action or Jenkins with the existing tests. 2. Test with Scala-2.13 with `MapStatusSuite`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite ... MapStatusSuite: - compressSize - decompressSize - MapStatus should never report non-empty blocks' sizes as 0 - large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus - HighlyCompressedMapStatus: estimated size should be the average non-empty block size - SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize - RoaringBitmap: runOptimize succeeded - RoaringBitmap: runOptimize failed - Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated. - SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE Run completed in 7 seconds, 971 milliseconds. Total number of tests run: 10 Suites: completed 2, aborted 0 Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29231 from dongjoon-hyun/SPARK-32436. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 10:16:01 -07:00
Dongjoon Hyun	aab1e09f1c	[SPARK-32434][CORE] Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts ### What changes were proposed in this pull request? This PR aims to support Scala 2.11 at `AbstractCommandBuilder.java` and `load-spark-env` scripts. ### Why are the changes needed? Currently, Scala 2.12 is only supported and the following fails. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -DwildcardSuites=none -Dtest=org.apache.spark.launcher.SparkLauncherSuite ... [ERROR] Failures: [ERROR] SparkLauncherSuite.testChildProcLauncher:123 expected:<0> but was:<1> [ERROR] SparkLauncherSuite.testSparkLauncherGetError:274 [ERROR] Tests run: 6, Failures: 2, Errors: 0, Skipped: 0 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This should be tested manually with the above command. ``` [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 2.186 s] [INFO] Spark Project Tags ................................. SUCCESS [ 4.400 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 1.744 s] [INFO] Spark Project Networking ........................... SUCCESS [ 2.233 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.527 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 5.564 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 1.946 s] [INFO] Spark Project Core ................................. SUCCESS [01:21 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:41 min [INFO] Finished at: 2020-07-24T20:04:34-07:00 [INFO] ------------------------------------------------------------------------ ``` Closes #29227 from dongjoon-hyun/SPARK-32434. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 08:19:02 -07:00
Dongjoon Hyun	f642234d85	[SPARK-32437][CORE] Improve MapStatus deserialization speed with RoaringBitmap 0.9.0 ### What changes were proposed in this pull request? This PR aims to speed up `MapStatus` deserialization by 5~18% with the latest RoaringBitmap `0.9.0` and new APIs. Note that we focus on `deserialization` time because `serialization` occurs once while `deserialization` occurs many times. ### Why are the changes needed? The current version is too old. We had better upgrade it to get the performance improvement and bug fixes. Although `MapStatusesSerDeserBenchmark` is synthetic, the benchmark result is updated with this patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action. Closes #29233 from dongjoon-hyun/SPARK-ROAR. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 08:07:28 -07:00
sychen	be9f03dc71	[SPARK-32426][SQL] ui shows sql after variable substitution ### What changes were proposed in this pull request? When submitting sql with variables, the sql displayed by ui is not replaced by variables. ### Why are the changes needed? See the final executed sql in ui ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test Closes #29221 from cxzl25/SPARK-32426. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 03:30:01 -07:00
HyukjinKwon	277a4063ef	[SPARK-32422][SQL][TESTS] Use python3 executable instead of python3.6 in IntegratedUDFTestUtils ### What changes were proposed in this pull request? This PR uses `python3` instead of `python3.6` executable as a fallback in `IntegratedUDFTestUtils`. ### Why are the changes needed? Currently, GitHub Actions skips pandas UDFs. Python 3.8 is installed explicitly but somehow `python3.6` looks available in GitHub Actions build environment by default. ``` [info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF is skipped because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED !!! ... [info] - udf/postgreSQL/udf-select_having.sql - Scalar Pandas UDF is skipped because pyspark,pandas and/or pyarrow were not available in [python3.6]. !!! IGNORED !!! ... ``` It was chosen as `python3.6` for Jenkins to pick one Python explicitly; however, looks we're already using `python3` here and there. It will also reduce the overhead to fix when we deprecate or drop Python versions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It should be tested in Jenkins and GitHub Actions environments here. Closes #29217 from HyukjinKwon/SPARK-32422. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-25 03:06:45 -07:00
HyukjinKwon	8e36a8f33f	[SPARK-32419][PYTHON][BUILD] Avoid using subshell for Conda env (de)activation in pip packaging test ### What changes were proposed in this pull request? This PR proposes to avoid using subshell when it activates Conda environment. Looks like it ends up with activating the env within the subshell even if you use `conda` command. ### Why are the changes needed? If you take a close look for GitHub Actions log: ``` Installing dist into virtual env Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz Collecting py4j==0.10.9 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB) Using legacy setup.py install for pyspark, since package 'wheel' is not installed. Installing collected packages: py4j, pyspark Running setup.py install for pyspark: started Running setup.py install for pyspark: finished with status 'done' Successfully installed py4j-0.10.9 pyspark-3.1.0.dev0 ... Installing dist into virtual env Obtaining file:///home/runner/work/spark/spark/python Collecting py4j==0.10.9 Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB) Installing collected packages: py4j, pyspark Attempting uninstall: py4j Found existing installation: py4j 0.10.9 Uninstalling py4j-0.10.9: Successfully uninstalled py4j-0.10.9 Attempting uninstall: pyspark Found existing installation: pyspark 3.1.0.dev0 Uninstalling pyspark-3.1.0.dev0: Successfully uninstalled pyspark-3.1.0.dev0 Running setup.py develop for pyspark Successfully installed py4j-0.10.9 pyspark ``` It looks not properly using Conda as it removes the previously installed one when it reinstalls again. We should ideally test it with Conda environment as it's intended. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions will test. I also manually tested in my local. Closes #29212 from HyukjinKwon/SPARK-32419. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-25 13:09:23 +09:00
Gabor Somogyi	b890fdc8df	[SPARK-32387][SS] Extract UninterruptibleThread runner logic from KafkaOffsetReader ### What changes were proposed in this pull request? `UninterruptibleThread` running functionality is baked into `KafkaOffsetReader` which can be extracted into a class. The main intention is to simplify `KafkaOffsetReader` in order to make easier to solve SPARK-32032. In this PR I've made this extraction without functionality change. ### Why are the changes needed? `UninterruptibleThread` running functionality is baked into `KafkaOffsetReader`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Closes #29187 from gaborgsomogyi/SPARK-32387. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 11:41:42 -07:00
Thomas Graves	e6ef27be52	[SPARK-32287][TESTS] Flaky Test: ExecutorAllocationManagerSuite.add executors default profile ### What changes were proposed in this pull request? I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds. ### Why are the changes needed? test failure ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? unit test Closes #29225 from tgravescs/SPARK-32287. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 11:12:28 -07:00
Andy Grove	64a01c0a55	[SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation ### What changes were proposed in this pull request? Provide a generic mechanism for plugins to inject rules into the AQE "query prep" stage that happens before query stage creation. This goes along with https://issues.apache.org/jira/browse/SPARK-32332 where the current AQE implementation doesn't allow for users to properly extend it for columnar processing. ### Why are the changes needed? The issue here is that we create new query stages but we do not have access to the parent plan of the new query stage so certain things can not be determined because you have to know what the parent did. With this change it would allow you to add TAGs to be able to figure out what is going on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new unit test is included in the PR. Closes #29224 from andygrove/insert-aqe-rule. Authored-by: Andy Grove <andygrove@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 11:03:57 -07:00
Kent Yao	d3596c04b0	[SPARK-32406][SQL] Make RESET syntax support single configuration reset ### What changes were proposed in this pull request? This PR extends the RESET command to support reset SQL configuration one by one. ### Why are the changes needed? Currently, the reset command only supports restore all of the runtime configurations to their defaults. In most cases, users do not want this, but just want to restore one or a small group of settings. The SET command can work as a workaround for this, but you have to keep the defaults in your mind or by temp variables, which turns out not very convenient to use. Hive supports this: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample reset <key> \| Resets the value of a particular configuration variable (key) to the default value.Note: If you misspell the variable name, Beeline will not show an error. -- \| -- PostgreSQL supports this too https://www.postgresql.org/docs/9.1/sql-reset.html ### Does this PR introduce _any_ user-facing change? yes, reset can restore one configuration now ### How was this patch tested? add new unit tests. Closes #29202 from yaooqinn/SPARK-32406. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 09:13:26 -07:00
HyukjinKwon	fa184c3308	[SPARK-32408][BUILD] Enable crossPaths back to prevent side effects ### What changes were proposed in this pull request? This PR proposes to enable `corssPaths` back for now to match with the build as it was. It still indeterministically doesn't run JUnit tests given my observation, and this PR basically reverts the partial fix from https://github.com/apache/spark/pull/29057. See also https://github.com/apache/spark/pull/29205 for the full context. ### Why are the changes needed? To prevent the side effects from crossPaths such as SPARK_PREPEND_CLASSES or tests that run conditionally if the test classes are present in PySpark. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested: ```bash build/sbt -Phadoop-2.7 -Phive -Phive-2.3 -Phive-thriftserver -DskipTests clean test:package ./python/run-tests --python-executable=python3 --testname="pyspark.sql.tests.test_dataframe QueryExecutionListenerTests" ``` Closes #29218 from HyukjinKwon/SPARK-32408-1. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 08:52:30 -07:00
Max Gekk	8bc799f920	[SPARK-32375][SQL] Basic functionality of table catalog v2 for JDBC ### What changes were proposed in this pull request? This PR implements basic functionalities of the `TableCatalog` interface, so that end-users can use the JDBC as a catalog. ### Why are the changes needed? To have at least one built implementation of Catalog Plugin API available to end users. JDBC is perfectly fit for this. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By new test suite `JDBCTableCatalogSuite`. Closes #29168 from MaxGekk/jdbc-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-24 14:12:43 +00:00
Gengliang Wang	8896f4af87	Revert "[SPARK-32253][INFRA] Show errors only for the sbt tests of github actions" ### What changes were proposed in this pull request? This reverts commit `026b0b926d`. ### Why are the changes needed? As HyukjinKwon pointed out in https://github.com/apache/spark/pull/29133#issuecomment-663339240, there is no JUnit test report after https://github.com/apache/spark/pull/29133. Let's revert https://github.com/apache/spark/pull/29133 for now and find a better solution to improve the log output later. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions build Closes #29219 from gengliangwang/revertErrorOnly. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-24 18:14:19 +08:00
Liang-Chi Hsieh	84efa04c57	[SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase ### What changes were proposed in this pull request? Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase. See https://github.com/apache/spark/pull/28996#discussion_r453460284. ### Why are the changes needed? Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29107 from viirya/move-union-by-name. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-24 04:33:18 +00:00
Max Gekk	19e3ed765a	[SPARK-32415][SQL][TESTS] Enable tests for JSON option: allowNonNumericNumbers ### What changes were proposed in this pull request? Enable two tests from `JsonParsingOptionsSuite`: - `allowNonNumericNumbers off` - `allowNonNumericNumbers on` ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the enabled tests. Closes #29207 from MaxGekk/allowNonNumericNumbers-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-24 09:55:36 +09:00
Max Gekk	658e87471c	[SPARK-30648][SQL][FOLLOWUP] Refactoring of JsonFilters: move config checking out ### What changes were proposed in this pull request? Refactoring of `JsonFilters`: - Add an assert to the `skipRow` method to check the input `index` - Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`. ### Why are the changes needed? 1. The assert should catch incorrect usage of `JsonFilters` 2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see https://github.com/apache/spark/pull/29145). 3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json." $ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json." ``` Closes #29206 from MaxGekk/json-filters-pushdown-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-24 09:54:11 +09:00
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
Venkata krishnan Sowrirajan	e7fb67cd88	[SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's blacklisting feature ### What changes were proposed in this pull request? In this change, when dynamic allocation is enabled instead of aborting immediately when there is an unschedulable taskset due to blacklisting, pass an event saying `SparkListenerUnschedulableTaskSetAdded` which will be handled by `ExecutorAllocationManager` and request more executors needed to schedule the unschedulable blacklisted tasks. Once the event is sent, we start the abortTimer similar to [SPARK-22148][SPARK-15815] to abort in the case when no new executors launched either due to max executors reached or cluster manager is out of capacity. ### Why are the changes needed? This is an improvement. In the case when dynamic allocation is enabled, this would request more executors to schedule the unschedulable tasks instead of aborting the stage without even retrying upto spark.task.maxFailures times (in some cases not retrying at all). This is a potential issue with respect to Spark's Fault tolerance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit tests both in ExecutorAllocationManagerSuite and TaskSchedulerImplSuite Closes #28287 from venkata91/SPARK-31418. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-23 12:33:22 -05:00

1 2 3 4 5 ...

27767 commits