ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Lukas Rytz	1a62e6a2c1	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:50 -05:00
Max Gekk	bd62ad9982	[SPARK-36736][SQL] Support ILIKE (ALL \| ANY \| SOME) - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL \| ANY \| SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_example(subject varchar(20)); spark-sql> insert into ilike_example values > ('jane doe'), > ('Jane Doe'), > ('JANE DOE'), > ('John Doe'), > ('John Smith'); spark-sql> select * > from ilike_example > where subject ilike any ('jane%', '%SMITH') > order by subject; JANE DOE Jane Doe John Smith jane doe ``` The syntax of `ILIKE` is similar to `LIKE`: ``` str NOT? ILIKE (ANY \| SOME \| ALL) (pattern+) ``` ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test to test parsing of `ILIKE`: ``` $ build/sbt "test:testOnly *.ExpressionParserSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql" ``` Closes #33966 from MaxGekk/ilike-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 22:51:49 +08:00
Kousuke Saruta	e858cd568a	[SPARK-36724][SQL] Support timestamp_ntz as a type of time column for SessionWindow ### What changes were proposed in this pull request? This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does. ### Why are the changes needed? For better usability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33965 from sarutak/session-window-ntz. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-13 21:47:43 +08:00
Yuto Akutsu	3747cfdb40	[SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API ### What changes were proposed in this pull request? Fixed wrong documentation on Cot API ### Why are the changes needed? [Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check. Closes #33978 from yutoacts/SPARK-36738. Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-13 21:51:29 +09:00
ulysses-you	4a6b2b9fc8	[SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle ### What changes were proposed in this pull request? - move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase. - run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase. - add `SkewJoinAwareCost` to support estimate skewed join cost - add new config to decide if force optimize skewed join - in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost. ### Why are the changes needed? In general, skewed join has more impact on performance than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle. A common case: ``` HashAggregate SortMergJoin Sort Exchange Sort Exchange ``` and after this PR, the plan looks like: ``` HashAggregate Exchange SortMergJoin (isSkew=true) Sort Exchange Sort Exchange ``` Note that, the new introduced shuffle also can be optimized by AQE. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? * Add new test * pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle` * pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition` Closes #32816 from ulysses-you/support-extra-shuffle. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 17:21:27 +08:00
Kousuke Saruta	e1e19619b7	[SPARK-36729][BUILD] Upgrade Netty from 4.1.63 to 4.1.68 ### What changes were proposed in this pull request? This PR upgrades Netty from `4.1.63` to `4.1.68`. All the changes from `4.1.64` to `4.1.68` are as follows. * 4.1.64 and 4.1.65 * https://netty.io/news/2021/05/19/4-1-65-Final.html * 4.1.66 * https://netty.io/news/2021/07/16/4-1-66-Final.html * 4.1.67 * https://netty.io/news/2021/08/16/4-1-67-Final.html * 4.1.68 * https://netty.io/news/2021/09/09/4-1-68-Final.html ### Why are the changes needed? Recently Netty `4.1.68` was released, which includes official M1 Mac support. * Add support for mac m1 * https://github.com/netty/netty/pull/11666 `4.1.65` also includes a critical bug fix which Spark might be affected. * JNI classloader deadlock with latest JDK version * https://github.com/netty/netty/issues/11209 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CIs. Closes #33970 from sarutak/upgrade-netty-4.1.68. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-12 10:07:27 -07:00
yangjie01	0e1157df06	[SPARK-36636][CORE][TEST] LocalSparkCluster change to use tmp workdir in test to avoid directory name collision ### What changes were proposed in this pull request? As described in SPARK-36636，if the test cases with config `local-cluster[n, c, m]` are run continuously within 1 second, the workdir name collision will occur because appid use format as `app-yyyyMMddHHmmss-0000` and workdir name associated with it in test now, the related logs are as follows: ``` java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/1 at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch executor app-20210908074432-0000/0 for test 21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch executor app-20210908074432-0000/0 for test. java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/0 at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Since the default value of s`park.deploy.maxExecutorRetries` is 10, the test failed will occur when 5 consecutive cases with local-cluster[3, 1, 1024] are completed within 1 second: 1. case 1: use worker directories: `/app-202109102324-0000/0`, `/app-202109102324-0000/1`, `/app-202109102324-0000/2` 2. case 2: retry 3 times then use worker directories: `/app-202109102324-0000/3`, `/app-202109102324-0000/4`, `/app-202109102324-0000/5` 3. case 3: retry 6 times then use worker directories: `/app-202109102324-0000/6`, `/app-202109102324-0000/7`, `/app-202109102324-0000/8` 4. case 4: retry 9 times then use worker directories: `/app-202109102324-0000/9`, `/app-202109102324-0000/10`, `/app-202109102324-0000/11` 5. case 5: retry more than 10 times then failed To avoid this issue, this pr change to use tmp workdir in test with config `local-cluster[n, c, m]`. ### Why are the changes needed? Avoid UT failures caused by continuous workdir name collision. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. - Manual test: `build/mvn clean install -Pscala-2.13 -pl core -am` or `build/mvn clean install -pl core -am`, with Scala 2.13 is easier to reproduce this problem Before The test failed error logs as follows and randomness in test failure: ``` - SPARK-33084: Add jar support Ivy URI -- test exclude param when transitive=true * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$138(SparkContextSuite.scala:1109) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - SPARK-33084: Add jar support Ivy URI -- test different version * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$142(SparkContextSuite.scala:1118) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - SPARK-33084: Add jar support Ivy URI -- test invalid param * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$146(SparkContextSuite.scala:1129) at org.apache.spark.SparkFunSuite.withLogAppender(SparkFunSuite.scala:235) at org.apache.spark.SparkContextSuite.$anonfun$new$145(SparkContextSuite.scala:1127) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) ... - SPARK-33084: Add jar support Ivy URI -- test multiple transitive params * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$149(SparkContextSuite.scala:1140) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - SPARK-33084: Add jar support Ivy URI -- test param key case sensitive * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... - SPARK-33084: Add jar support Ivy URI -- test transitive value case insensitive * FAILED * org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at: org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101) org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) org.scalatest.Transformer.apply(Transformer.scala:22) org.scalatest.Transformer.apply(Transformer.scala:20) org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647) at scala.Option.foreach(Option.scala:437) at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644) at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734) at org.apache.spark.SparkContext.<init>(SparkContext.scala:95) at org.apache.spark.SparkContextSuite.$anonfun$new$159(SparkContextSuite.scala:1166) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ``` After ``` Run completed in 26 minutes, 38 seconds. Total number of tests run: 2863 Suites: completed 276, aborted 0 Tests: succeeded 2863, failed 0, canceled 4, ignored 8, pending 0 All tests passed. ``` Closes #33963 from LuciferYang/SPARK-36636. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-12 09:57:06 -05:00
attilapiros	ba81b92402	[SPARK-36719][CORE] Supporting Netty Logging at the network layer ### What changes were proposed in this pull request? Supporting Netty level logging at the network layer. To configure Netty level logging a `LogHandler` must be added to the channel pipeline. In this PR I have introduced a new class `NettyLogger` which is able to construct a log handler depending on the log level: - in case of `log4j.logger.org.apache.spark.network.util.NettyLogger=DEBUG`: a custom log handler is created which does not dump the message contents. This way the log is a bit more compact. Moreover when network level encryption is switched on this level might be sufficient. - in case of `log4j.logger.org.apache.spark.network.util.NettyLogger=TRACE`: Netty's own log handler is used which dumps the message contents. - otherwise (when the logger is not TRACE or DEBUG) the pipeline does not contain a log handler (there is no runtime penalty for the default setting but a long running app/service must be restarted along with the new log level to have an effect). ### Why are the changes needed? This level of logging proved to be sufficient during debugging some external shuffle related problem. Compared with the tcpdump this log lines can be more easily correlated with the Spark internal calls. Moreover the log layout can be configured to contain the thread names that way for a timeout a busy thread could be identified. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. #### DEBUG level ``` ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719› ╰─$ tail -1 ./conf/log4j.properties log4j.logger.org.apache.spark.network.util.NettyLogger=DEBUG ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719› ╰─$ ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master local\[8\] ./examples/target/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar README.md 2> >(grep NettyLogger) 1> /dev/null 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf] REGISTERED 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf] CONNECT: /172.30.64.219:61014 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] ACTIVE 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] REGISTERED 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] ACTIVE 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] WRITE 66B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] FLUSH 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] READ 66B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] WRITE: MessageWithHeader [headerLength: 74, bodyLength: 1552705] 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] FLUSH 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 74B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] READ COMPLETE 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 2048B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 32768B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 10561B 21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE 21/09/10 15:24:40 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 ! R:/172.30.64.219:61015] INACTIVE 21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE 21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 ! R:/172.30.64.219:61014] INACTIVE 21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 ! R:/172.30.64.219:61014] UNREGISTERED 21/09/10 15:24:40 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 ! R:/172.30.64.219:61015] UNREGISTERED ``` #### TRACE level ``` ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719› ╰─$ tail -1 ./conf/log4j.properties log4j.logger.org.apache.spark.network.util.NettyLogger=TRACE ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719› ╰─$ ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master local\[8\] ./examples/target/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar README.md 1> /dev/null 2>&1 ... 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786] REGISTERED 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786] CONNECT: /172.30.64.219:61044 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] ACTIVE 21/09/10 15:29:14 INFO TransportClientFactory: Successfully created connection to /172.30.64.219:61044 after 37 ms (0 ms spent in bootstraps) 21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] REGISTERED 21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] ACTIVE 21/09/10 15:29:14 INFO Utils: Fetching spark://172.30.64.219:61044/jars/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar to /private/var/folders/t_/fr_vqcyx23vftk81ftz1k5hw0000gn/T/spark-91e059f5-1e29-4727-8602-f81206bbe48b/userFiles-50b48490-8950-4c46-b3d3-61a2c85412a3/fetchFileTemp8803030587223485061.tmp 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] WRITE: 66B +-------------------------------------------------+ \| 0 1 2 3 4 5 6 7 8 9 a b c d e f \| +--------+-------------------------------------------------+----------------+ \|00000000\| 00 00 00 00 00 00 00 42 06 00 00 00 35 2f 6a 61 \|.......B....5/ja\| \|00000010\| 72 73 2f 6f 72 69 67 69 6e 61 6c 2d 73 70 61 72 \|rs/original-spar\| \|00000020\| 6b 2d 65 78 61 6d 70 6c 65 73 5f 32 2e 31 32 2d \|k-examples_2.12-\| \|00000030\| 33 2e 33 2e 30 2d 53 4e 41 50 53 48 4f 54 2e 6a \|3.3.0-SNAPSHOT.j\| \|00000040\| 61 72 \|ar \| +--------+-------------------------------------------------+----------------+ 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] FLUSH 21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] READ: 66B +-------------------------------------------------+ \| 0 1 2 3 4 5 6 7 8 9 a b c d e f \| +--------+-------------------------------------------------+----------------+ \|00000000\| 00 00 00 00 00 00 00 42 06 00 00 00 35 2f 6a 61 \|.......B....5/ja\| \|00000010\| 72 73 2f 6f 72 69 67 69 6e 61 6c 2d 73 70 61 72 \|rs/original-spar\| \|00000020\| 6b 2d 65 78 61 6d 70 6c 65 73 5f 32 2e 31 32 2d \|k-examples_2.12-\| \|00000030\| 33 2e 33 2e 30 2d 53 4e 41 50 53 48 4f 54 2e 6a \|3.3.0-SNAPSHOT.j\| \|00000040\| 61 72 \|ar \| +--------+-------------------------------------------------+----------------+ 21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] WRITE: MessageWithHeader [headerLength: 74, bodyLength: 1552705] 21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] FLUSH 21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] READ: 74B ... ``` Closes #33962 from attilapiros/SPARK-36719. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-11 16:14:02 -07:00
dgd_contributor	ebca01f03e	[SPARK-35822][UI] Spark UI-Executor tab is empty in IE11 ### What changes were proposed in this pull request? Refactor some functions in utils.js to fix the empty UI-Executor tab in yarn mode in IE11. ### Why are the changes needed? Spark UI-Executor tab is empty in IE11: So this PR to fix this. ![Executortab_IE](https://user-images.githubusercontent.com/84778052/132786964-b17b6d12-457f-4ba3-894f-3f2e1c285b1e.PNG) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Testcase Closes #33937 from dgd-contributor/SPARK-35822-v2. Authored-by: dgd_contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-11 15:58:31 -07:00
Kousuke Saruta	c36d70836d	[SPARK-36725][SQL][TESTS] Ensure HiveThriftServer2Suites to stop Thrift JDBC server on exit ### What changes were proposed in this pull request? This PR aims to ensure that HiveThriftServer2Suites (e.g. `thriftserver.UISeleniumSuite`) stop Thrift JDBC server on exit using shutdown hook. ### Why are the changes needed? Normally, HiveThriftServer2Suites stops Thrift JDBC server via `afterAll` method. But, if they are killed by signal (e.g. Ctrl-C), Thrift JDBC server will be remain. ``` $ jps 2792969 SparkSubmit ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Killed `thriftserver.UISeleniumSuite` by Ctrl-C and confirmed no Thrift JDBC server is remain by jps. Closes #33967 from sarutak/stop-thrift-on-exit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-11 15:54:35 -07:00
dgd-contributor	9af0132516	[SPARK-36685][ML][MLLIB] Fix wrong assert messages ### What changes were proposed in this pull request? Fix wrong assert statement, a mistake when coding ### Why are the changes needed? wrong assert statement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33953 from dgd-contributor/SPARK-36685. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-11 14:39:42 -07:00
Sean Owen	e5283f5ed5	[SPARK-36704][CORE] Expand exception handling to more Java 9 cases where reflection is limited at runtime, when reflecting to manage DirectByteBuffer settings ### What changes were proposed in this pull request? Improve exception handling in the Platform initialization, where it attempts to assess whether reflection is possible to modify DirectByteBuffer. This can apparently fail in more cases on Java 9+ than are currently handled, whereas Spark can continue without reflection if needed. More detailed comments on the change inline. ### Why are the changes needed? This exception seems to be possible and fails startup: ``` Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module 71e9ddb4 at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:357) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297) at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188) at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181) at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:56) ``` ### Does this PR introduce _any_ user-facing change? Should strictly allow Spark to continue in more cases. ### How was this patch tested? Existing tests. Closes #33947 from srowen/SPARK-36704. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-11 13:38:10 -05:00
Huaxin Gao	1f679ed8e9	[SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-11 10:12:21 -07:00
yi.wu	7103a165d5	Revert "[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight" This reverts commit `b9e53f8937`. ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/32114 ### Why are the changes needed? It breaks the expected `BlockManager` re-registration (e.g., heartbeat loss of an active executor) due to deferred removal of `BlockManager`, see the check: `9cefde8db3/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (L551)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass existing tests. Closes #33942 from Ngone51/SPARK-36700. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-09 20:21:14 -07:00
dgd-contributor	711577e238	[SPARK-36687][SQL][CORE] Rename error classes with _ERROR suffix ### What changes were proposed in this pull request? redundant _ERROR suffix in error-classes.json ### Why are the changes needed? Clean up error classes to reduce clutter ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33944 from dgd-contributor/SPARK-36687. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-10 10:00:28 +09:00
Shruti Gumma	28e0a0e21e	[SPARK-36334][K8S][FOLLOWUP] Allow equal resource version to update snapshot ### What changes were proposed in this pull request? This PR aims to allow snapshot updates when the resource version is equal to the previous version. ### Why are the changes needed? This will prevent the chance of timing issue when the driver may not register executors yet when the last pod update events. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33949 from dongjoon-hyun/SPARK-36334-2. Authored-by: Shruti Gumma <shruti_gumma@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-09 16:57:04 -07:00
Liang-Chi Hsieh	6bcf330191	[SPARK-36669][SQL] Add Lz4 wrappers for Hadoop Lz4 codec ### What changes were proposed in this pull request? This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec. ### Why are the changes needed? Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency. ``` [info] Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory [info] at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66) [info] at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168) ``` Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test. Closes #33940 from viirya/lz4-wrappers. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-09 09:31:00 -07:00
itholic	0227793f9e	[SPARK-36689][PYTHON] Cleanup the deprecated APIs and raise proper warning message ### What changes were proposed in this pull request? This PR proposes cleanup the deprecated APIs in `missing/*.py`, and raise proper warning message for the deprecated APIs such as pandas does. Also remove the checking for pandas < 1.0, since now we only focus on following the behavior of latest pandas. ### Why are the changes needed? We should follow the deprecation of APIs of latest pandas. ### Does this PR introduce _any_ user-facing change? Now the some APIs raise proper alternative message for deprecated functions such as pandas does. ### How was this patch tested? Ran `dev/lint-python` and manually check the pandas API documents one by one. Closes #33931 from itholic/SPARK-36689. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 19:50:29 +09:00
Liang-Chi Hsieh	647ffe655f	[SPARK-34479][SQL][DOC][FOLLOWUP] Add zstandard to avro supported codecs ### What changes were proposed in this pull request? Adding `zstandard` to avro supported codecs. ### Why are the changes needed? To improve the document. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only. Closes #33943 from viirya/minor-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-08 23:21:23 -07:00
Liang-Chi Hsieh	c95d3fe6c9	[SPARK-36670][SQL][TEST][FOLLOWUP] Add AvroCodecSuite ### What changes were proposed in this pull request? This patch proposes to add also `AvroCodecSuite` as a follow up for SPARK-36670. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #33939 from viirya/SPARK-36670-avro. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-08 22:17:35 -07:00
Max Gekk	b74a1ba69f	[SPARK-36674][SQL] Support ILIKE - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `like` expression - `ilike`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_ex(subject varchar(20)); spark-sql> insert into ilike_ex values > ('John Dddoe'), > ('Joe Doe'), > ('John_down'), > ('Joe down'), > (null); spark-sql> select * > from ilike_ex > where subject ilike '%j%h%do%' > order by 1; John Dddoe John_down ``` The syntax of `ilike` is similar to `like`: ``` str ILIKE pattern[ ESCAPE escape] ``` #### Implementation details `ilike` is implemented as a runtime replaceable expression to `Like(Lower(left), Lower(right), escapeChar)`. Such replacement is acceptable because `ilike`/`like` recognise only `_` and `%` as special characters but not special character classes. Note: The PR aims to support `ilike` in SQL only. Others APIs can be updated separately on demand. ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [ClickHouse](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#ilike) - [Vertica](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm) - [Impala](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_operators.html#ilike) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test: ``` $ build/sbt "test:testOnly .RegexpExpressionsSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "test:testOnly SQLKeywordSuite" $ build/sbt "sql/testOnly *ExpressionsSchemaSuite" ``` Closes #33919 from MaxGekk/ilike-single-pattern. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:55:20 +08:00
Andrew Liu	9b633f2075	[SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe ### What changes were proposed in this pull request? fix SimplifyConditionalsInPredicate to be null-safe Reproducible: ``` import org.apache.spark.sql.types.{StructField, BooleanType, StructType} import org.apache.spark.sql.Row val schema = List( StructField("b", BooleanType, true) ) val data = Seq( Row(true), Row(false), Row(null) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) // cartesian product of true / false / null val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal")) df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // actual: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // +-----+--------+ spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // expected: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // \| null\| true\| // +-----+--------+ ``` ### Why are the changes needed? is a regression that leads to incorrect results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate. Authored-by: Andrew Liu <andrewlliu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:32:40 +08:00
Angerszhuuuu	67421d80b8	[SPARK-36692][CORE] Improve Error statement when requesting thread dump while executor already stopped ### What changes were proposed in this pull request? For now, when user check thread dump for a executor while this executor is stopped, the error log shows following might disturb users. ![image](https://user-images.githubusercontent.com/46485123/132471501-db96894d-abe9-4d62-9943-06c578382ef2.png) ### Why are the changes needed? Improve error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33935 from AngersZhuuuu/SPARK-36692. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 10:04:24 +09:00
Xinrong Meng	33bb7b39e9	[SPARK-36697][PYTHON] Fix dropping all columns of a DataFrame ### What changes were proposed in this pull request? Fix dropping all columns of a DataFrame ### Why are the changes needed? When dropping all columns of a pandas-on-Spark DataFrame, a ValueError is raised. Whereas in pandas, an empty DataFrame reserving the index is returned. We should follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. From ```py >>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]}) >>> psdf x y z 0 1 3 5 1 2 4 6 >>> psdf.drop(['x', 'y', 'z']) Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) ``` To ```py >>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]}) >>> psdf x y z 0 1 3 5 1 2 4 6 >>> psdf.drop(['x', 'y', 'z']) Empty DataFrame Columns: [] Index: [0, 1] ``` ### How was this patch tested? Unit tests. Closes #33938 from xinrong-databricks/frame_drop_col. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 09:59:42 +09:00
Hyukjin Kwon	34f80ef313	[SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark ### What changes were proposed in this pull request? This PR adds: - the support of `TimestampNTZType` in pandas API on Spark. - the support of Py4J handling of `spark.sql.timestampType` configuration ### Why are the changes needed? To complete `TimestampNTZ` support. In more details: - ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp): > Because naive datetime objects are treated by many datetime methods as local times ... semantically it is legitimate to assume: - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone). - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time) - ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278): - `Timestamp(..., timezone=sparkLocalTimezone)` -> `TimestamType` - `Timestamp(..., timezone=null)` -> `TimestampNTZType` - (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time). One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default: ```python >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int") 0 0 ``` whereas in Spark: ```python >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show() +------+ \| _1\| +------+ \|-32400\| +------+ >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show() +---+ \| _1\| +---+ \| 0\| +---+ ``` In contrast, some APIs like `pandas.fromtimestamp` assume they are local times: ```python >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0]) Timestamp('1970-01-01 09:00:00') ``` For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work. As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work. ### Does this PR introduce _any_ user-facing change? Yes, now pandas API on Spark can handle `TimestampNTZType`. ```python import datetime spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark() ``` ``` dt 0 2021-08-31 19:58:55.024410 ``` This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration: ```python >>> lit(datetime.datetime.now()) Column<'TIMESTAMP '2021-09-03 19:34:03.949998''> ``` ```python >>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") >>> lit(datetime.datetime.now()) Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''> ``` ### How was this patch tested? Unittests were added. Closes #33877 from HyukjinKwon/SPARK-36625. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 09:57:38 +09:00
yangjie01	375ca94678	[SPARK-36690][SS] Clean up deprecated api usage after upgrade commons-pool2 to 2.11.1 ### What changes were proposed in this pull request? SPARK-36583 upgrade `Apache commons-pool2` from 2.6.2 to 2.11.1 and there are some deprecated API usage related to it that need to be cleaned up. The list of changes is as follows: - `BaseObjectPoolConfig.setMinEvictableIdleTimeMillis` -> `BaseObjectPoolConfig.setMinEvictableIdleTime` - `BaseObjectPoolConfig.setSoftMinEvictableIdleTimeMillis` -> `BaseObjectPoolConfig.setSoftMinEvictableIdleTime` - `BaseObjectPoolConfig.setTimeBetweenEvictionRunsMillis` -> `BaseObjectPoolConfig.setTimeBetweenEvictionRuns` ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA or Jenkins Tests. Closes #33933 from LuciferYang/SPARK-36690. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-08 16:44:43 +09:00
Huaxin Gao	23794fb303	[SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference ### What changes were proposed in this pull request? Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead ### Why are the changes needed? `FieldReference` is a private class, should use `NamedReference` instead ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #33927 from huaxingao/agg_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 14:05:44 +08:00
Yuto Akutsu	5d5c942c70	[SPARK-36688][R] Add cot as an R function ### What changes were proposed in this pull request? Add cotangent as an R function. ### Why are the changes needed? My previous PR (https://github.com/apache/spark/pull/33906) missed R support. ### Does this PR introduce _any_ user-facing change? Yes, users can now call the cotangent function as an R function. ### How was this patch tested? unit tests. Closes #33925 from yutoacts/SPARK-36660. Lead-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-08 14:21:01 +09:00
yangjie01	acd9c92fa8	[SPARK-36684][SQL][TESTS] Add Jackson test dependencies to `sql/core` module at `hadoop-2.7` profile ### What changes were proposed in this pull request? SPARK-26346 upgrade Parquet related modules from 1.10.1 to 1.11.1 and `parquet-jackson 1.11.1` use `com.fasterxml.jackson` instead of `org.codehaus.jackson`. So, there are warning logs related to ``` 17:12:17.605 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated ... Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ... ``` when test `sql/core` modules with `hadoop-2.7` profile. This pr adds test dependencies related to `org.codehaus.jackson` in `sql/core` module when `hadoop-2.7` profile is activated. ### Why are the changes needed? Clean up test warning logs that shouldn't exist. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA or Jenkins Tests. - Manual test `mvn clean test -pl sql/core -am -DwildcardSuites=none -Phadoop-2.7` Before No test failed, but warning logs as follows: ``` [INFO] Running test.org.apache.spark.sql.JavaBeanDeserializationSuite 22:42:45.211 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22:42:46.827 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650) at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:62) at org.apache.spark.sql.internal.SharedState$.liftedTree1$1(SharedState.scala:181) at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:180) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:54) at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:135) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:135) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:134) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:335) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:109) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:109) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:194) at org.apache.spark.sql.types.DataType.sameType(DataType.scala:97) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:291) at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85) at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82) at scala.collection.immutable.List.forall(List.scala:89) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1074) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1069) at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1080) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1079) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.$anonfun$dataType$4(objects.scala:815) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.dataType(objects.scala:815) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.$anonfun$dataType$9(complexTypeCreator.scala:416) at scala.collection.immutable.List.map(List.scala:290) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType$lzycompute(complexTypeCreator.scala:410) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:409) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:398) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:309) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:319) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:248) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:154) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testBeanWithArrayFieldDeserialization(JavaBeanDeserializationSuite.java:75) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:364) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:237) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:158) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548) Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.<clinit>(WebHdfsFileSystem.java:129) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 81 more Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 88 more ``` After There are no more warning logs like above Closes #33926 from LuciferYang/SPARK-36684. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-07 21:40:41 -07:00
Pablo Langa	feba05f181	[SPARK-35803][SQL] Support DataSource V2 CreateTempViewUsing ### What changes were proposed in this pull request? Currently only DataSources V1 are supported in the CreateTempViewUsing command. This PR refactor DataframeReader to reuse the code for the creation of a DataFrame from a DataSource V2 ### Why are the changes needed? Improve the support of DataSourve V2 in this command ### Does this PR introduce _any_ user-facing change? It does not change the current behavior, it only adds a new functionality ### How was this patch tested? Unit testing Closes #33922 from planga82/feature/spark35803_crateview_datasourceV2. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 12:16:29 +08:00
Liang Zhang	cb30683b65	[SPARK-36642][SQL] Add df.withMetadata: a syntax suger to update the metadata of a dataframe ### What changes were proposed in this pull request? To make it easy to use/modify the semantic annotation, we want to have a shorter API to update the metadata in a dataframe. Currently we have `df.withColumn("col1", col("col1").alias("col1", metadata=metadata))` to update the metadata without changing the column name, and this is too verbose. We want to have a syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality. ### Why are the changes needed? A bit of background for the frequency of the update: We are working on inferring the semantic data types and use them in AutoML and store the semantic annotation in the metadata. So in many cases, we will suggest the user update the metadata to correct the wrong inference or add the annotation for weak inference. ### Does this PR introduce _any_ user-facing change? Yes. A syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality as`df.withColumn("col1", col("col1").alias("col1", metadata=metadata))`. ### How was this patch tested? A unit test in DataFrameSuite.scala. Closes #33853 from liangz1/withMetadata. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-09-08 09:35:18 +08:00
itholic	71dbd03fbe	[SPARK-36531][SPARK-36515][PYTHON] Improve test coverage for data_type_ops/* and groupby ### What changes were proposed in this pull request? This PR proposes improving test coverage for pandas-on-Spark data types & GroupBy code base, which is written in `data_type_ops/.py` and `groupby.py` separately. This PR did the following to improve coverage: - Add unittest for untested code - Fix unittest which is not tested properly - Remove unused code NOTE*: This PR is not only include the test-only update, for example it includes the fixing `astype` for binary ops. pandas-on-Spark Series we have: ```python >>> psser 0 [49] 1 [50] 2 [51] dtype: object ``` before: ```python >>> psser.astype(bool) Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type mismatch: cannot cast binary to boolean; ... ``` after: ```python >>> psser.astype(bool) 0 True 1 True 2 True dtype: bool ``` ### Why are the changes needed? To make the project healthier by improving coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest. Closes #33850 from itholic/SPARK-36531. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-08 10:22:52 +09:00
Venkata Sai Akhil Gudesa	2ed6e7bc5d	[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root \|-- a: struct (nullable = true) \| \|-- c: struct (nullable = true) \| \| \|-- e: string (nullable = true) \| \|-- d: integer (nullable = true) \|-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 18:15:48 -07:00
Liang-Chi Hsieh	5a0ae694d0	[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite ### What changes were proposed in this pull request? This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. ### Why are the changes needed? We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 16:53:11 -07:00
Andy Grove	f78d8394dc	[SPARK-36666][SQL] Fix regression in AQEShuffleReadExec Fix regression in AQEShuffleReadExec when used in conjunction with Spark plugins with custom partitioning. Signed-off-by: Andy Grove <andygrove73gmail.com> ### What changes were proposed in this pull request? Return `UnknownPartitioning` rather than throw an exception in `AQEShuffleReadExec`. ### Why are the changes needed? The [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) replaces `AQEShuffleReadExec` with a custom operator that runs on the GPU. Due to changes in [SPARK-36315](`dd80457ffb`), Spark now throws an exception if the shuffle exchange does not have recognized partitioning, and this happens before the postStageOptimizer rules so there is no opportunity to replace this operator now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I am still in the process of testing this change. I will update the PR in the next few days with status. Closes #33910 from andygrove/SPARK-36666. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-07 13:49:45 -07:00
Liang-Chi Hsieh	6745d77818	[SPARK-36682][CORE][TEST] Add Hadoop sequence file test for different Hadoop codecs ### What changes were proposed in this pull request? This patch proposes to add e2e tests for using Hadoop codecs to write sequence files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33924 from viirya/hadoop-seq-test. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 13:19:58 -07:00
Kousuke Saruta	a5fe5d368c	[SPARK-36153][SQL][DOCS][FOLLOWUP] Fix the description about the possible values of `spark.sql.catalogImplementation` property ### What changes were proposed in this pull request? This PR fixes the description about the possible values of `spark.sql.catalogImplementation` property. It was added in SPARK-36153 (#33362) but the possible values are `hive` or `in-memory` rather than `true` or `false`. ### Why are the changes needed? To fix wrong description. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I just confirmed `in-memory` and `hive` are the valid values with SparkShell. Closes #33923 from sarutak/fix-doc-about-catalogImplementation. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-07 11:39:45 +09:00
Jungtaek Lim	093c2080fe	[SPARK-36667][SS][TEST] Close resources properly in StateStoreSuite/RocksDBStateStoreSuite ### What changes were proposed in this pull request? This PR proposes to ensure StateStoreProvider instances are properly closed for each test in StateStoreSuite/RocksDBStateStoreSuite. ### Why are the changes needed? While this doesn't break the test, this is a bad practice and may possibly make nasty problems in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs Closes #33916 from HeartSaVioR/SPARK-36667. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-06 17:40:03 -07:00
Kousuke Saruta	0ab0cb108d	[SPARK-36675][SQL] Support ScriptTransformation for timestamp_ntz ### What changes were proposed in this pull request? This PR aims to support `ScriptTransformation` for `timestamp_ntz`. In the current master, it doesn't work. ``` spark.sql("SELECT transform(col1) USING 'cat' AS (col1 timestamp_ntz) FROM VALUES timestamp_ntz'2021-09-06 20:19:13' t").show(false) 21/09/06 22:03:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: SparkScriptTransformation without serde does not support TimestampNTZType$ as output data type at org.apache.spark.sql.errors.QueryExecutionErrors$.outputDataTypeUnsupportedByNodeWithoutSerdeError(QueryExecutionErrors.scala:1740) at org.apache.spark.sql.execution.BaseScriptTransformationExec.$anonfun$outputFieldWriters$1(BaseScriptTransformationExec.scala:245) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(BaseScriptTransformationExec.scala:194) at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$(BaseScriptTransformationExec.scala:194) at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$lzycompute(SparkScriptTransformationExec.scala:38) at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(SparkScriptTransformationExec.scala:38) at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.$anonfun$processRowWithoutSerde$1(BaseScriptTransformationExec.scala:121) at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:162) at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:113) ``` ### Why are the changes needed? For better usability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33920 from sarutak/script-transformation-timestamp-ntz. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-06 20:58:07 +02:00
Hyukjin Kwon	c6f3a13087	[SPARK-36626][PYTHON][FOLLOW-UP] Use datetime.tzinfo instead of datetime.tzname() ### What changes were proposed in this pull request? This PR is a small followup of https://github.com/apache/spark/pull/33876 which proposes to use `datetime.tzinfo` instead of `datetime.tzname` to see if timezome information is provided or not. This way is consistent with other places such as: `9c5bcac61e/python/pyspark/sql/types.py (L182)` `9c5bcac61e/python/pyspark/sql/types.py (L1662)` ### Why are the changes needed? In some cases, `datetime.tzname` can raise an exception (https://docs.python.org/3/library/datetime.html#datetime.datetime.tzname): > ... raises an exception if the latter doesn’t return None or a string object, I was able to reproduce this in Jenkins with setting `spark.sql.timestampType` to `TIMESTAMP_NTZ` by default: ``` ====================================================================== ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_serde.py", line 92, in test_time_with_timezone ... File "/usr/lib/pypy3/lib-python/3/datetime.py", line 979, in tzname raise NotImplementedError("tzinfo subclass must override tzname()") NotImplementedError: tzinfo subclass must override tzname() ``` ### Does this PR introduce _any_ user-facing change? No to end users because it has not be released. This is rather a safeguard to prevent potential breakage. ### How was this patch tested? Manually tested. Closes #33918 from HyukjinKwon/SPARK-36626-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-06 17:16:52 +02:00
Yuto Akutsu	db95960f4b	[SPARK-36660][SQL] Add cot as Scala and Python functions ### What changes were proposed in this pull request? Add cotangent support by Dataframe operations (e.g. `df.select(cot($"col"))`). ### Why are the changes needed? Cotangent has been supported by Spark SQL since 2.3.0 but it cannot be called by Dataframe operations. ### Does this PR introduce _any_ user-facing change? Yes, users can now call the cotangent function by Dataframe operations. ### How was this patch tested? unit tests. Closes #33906 from yutoacts/SPARK-36660. Lead-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-06 13:38:18 +09:00
yangjie01	bdb73bbc27	[SPARK-36613][SQL][SS] Use EnumSet as the implementation of Table.capabilities method return value ### What changes were proposed in this pull request? The `Table.capabilities` method return a `java.util.Set` of `TableCapability` enumeration type, which is implemented using `java.util.HashSet` now. Such Set can be replaced `with java.util.EnumSet` because `EnumSet` implementations can be much more efficient compared to other sets. ### Why are the changes needed? Use more appropriate data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. - Add a new benchmark to compare `create` and `contains` operation between `EnumSet` and `HashSet` Closes #33867 from LuciferYang/SPARK-36613. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:23:05 -05:00
yangjie01	35848385ae	[SPARK-36602][COER][SQL] Clean up redundant asInstanceOf casts ### What changes were proposed in this pull request? The change of this pr is remove redundant asInstanceOf casts in Spark code. ### Why are the changes needed? Code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. Closes #33852 from LuciferYang/cleanup-asInstanceof. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:22:28 -05:00
Senthil Kumar	6bd491ecb8	[SPARK-36643][SQL] Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### What changes were proposed in this pull request? This PR adds additional information to ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### Why are the changes needed? Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not. ### Does this PR introduce any user-facing change? Yes. Trivial change in the error messages is included ### How was this patch tested? New Test added - SPARK-36643: Show migration guide when attempting SparkConf Closes #33894 from senthh/1st_Sept_2021. Lead-authored-by: Senthil Kumar <senthh@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-03 23:49:45 -07:00
Xinrong Meng	019203caec	[SPARK-36655][PYTHON] Add `versionadded` for API added in Spark 3.3.0 ### What changes were proposed in this pull request? Add `versionadded` for API added in Spark 3.3.0: DataFrame.combine_first. ### Why are the changes needed? That documents the version of Spark which added the described API. ### Does this PR introduce _any_ user-facing change? No user-facing behavior change. Only the document of the affected API shows when it's introduced. ### How was this patch tested? Manual test. Closes #33901 from xinrong-databricks/version. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-03 11:22:53 -07:00
dgd-contributor	9b262e722d	[SPARK-36401][PYTHON] Implement Series.cov ### What changes were proposed in this pull request? Implement Series.cov ### Why are the changes needed? That is supported in pandas. We should support that as well. ### Does this PR introduce _any_ user-facing change? Yes. Series.cov can be used. ```python >>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) >>> s2 = ps.Series([0.12528585, 0.26962463, 0.51111198]) >>> s1.cov(s2) -0.016857626527158744 >>> reset_option("compute.ops_on_diff_frames") ``` ### How was this patch tested? Unit tests Closes #33752 from dgd-contributor/SPARK-36401_Implement_Series.cov. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-09-03 10:41:27 -07:00
Kousuke Saruta	cf3bc65e69	[SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-03 23:25:18 +09:00
itholic	61b3223f47	[SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric` ### What changes were proposed in this pull request? This PR proposes to support `errors` argument for `ps.to_numeric` such as pandas does. Note that we don't support the `ignore` when the `arg` is pandas-on-Spark Series for now. ### Why are the changes needed? We should match the behavior to pandas' as much as possible. Also in the [recent blog post](https://medium.com/chuck.connell.3/pandas-on-databricks-via-koalas-a-review-9876b0a92541), the author pointed out we're missing this feature. Seems like it's the kind of feature that commonly used in data science. ### Does this PR introduce _any_ user-facing change? Now the `errors` argument is available for `ps.to_numeric`. ### How was this patch tested? Unittests. Closes #33882 from itholic/SPARK-36609. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-03 21:33:30 +09:00
Kent Yao	7f1ad7be18	[SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### What changes were proposed in this pull request? Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### Why are the changes needed? spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues. It's better to leave this choice to users ### Does this PR introduce _any_ user-facing change? spark.sql.execution.topKSortFallbackThreshold is now user-facing ### How was this patch tested? passing GA Closes #33904 from yaooqinn/SPARK-36659. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-09-03 19:11:37 +08:00
Kazuyuki Tanimura	d3e3df17aa	[SPARK-36644][SQL] Push down boolean column filter ### What changes were proposed in this pull request? This PR proposes to improve `DataSourceStrategy` to be able to push down boolean column filters. Currently boolean column filters do not get pushed down and may cause unnecessary IO. ### Why are the changes needed? The following query does not push down the filter in the current implementation ``` SELECT * FROM t WHERE boolean_field ``` although the following query pushes down the filter as expected. ``` SELECT * FROM t WHERE boolean_field = true ``` This is because the Physical Planner (`DataSourceStrategy`) currently only pushes down limited expression patterns like`EqualTo`. It is fair for Spark SQL users to expect `boolean_field` performs the same as `boolean_field = true`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` build/sbt "core/testOnly *DataSourceStrategySuite -- -z SPARK-36644" ``` Closes #33898 from kazuyukitanimura/SPARK-36644. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-03 07:39:14 +00:00

1 2 3 4 5 ...

31197 commits