ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gengjiaan	32a523b56f	[SPARK-34234][SQL] Remove TreeNodeException that didn't work ### What changes were proposed in this pull request? `TreeNodeException` causes the error msg not clear and it didn't work well. Because the `TreeNodeException` looks redundancy, we could remove it. There are show a case: ``` val df = Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4)).toDF("x", "y") val hashAggDF = df.groupBy("x").agg(c, sum("y")) ``` The above code will use `HashAggregateExec`. In order to ensure that an exception will be thrown when executing `HashAggregateExec`, I added `throw new RuntimeException("calculate error")` into `72b7f8abfb/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L85)` So, if the above code is executed, `RuntimeException("calculate error")` will be thrown. Before this PR, the error is: ``` execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: ShuffleQueryStage 0 +- Exchange hashpartitioning(x#105, 5), ENSURE_REQUIREMENTS, [id=#168] +- HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:163) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:81) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:79) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: HashAggregate(keys=[x#105], functions=[partial_sum(y#106)], output=[x#105, sum#118L]) +- Project [_1#100 AS x#105, _2#101 AS y#106] +- LocalTableScan [_1#100, _2#101] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:122) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:163) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 91 more Caused by: java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1(HashAggregateExec.scala:85) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 103 more ``` After this PR, the error is: ``` calculate error java.lang.RuntimeException: calculate error at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:84) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:121) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:120) at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:161) at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:80) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:78) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5(AdaptiveSparkPlanExec.scala:207) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$5$adapted(AdaptiveSparkPlanExec.scala:205) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:205) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:289) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3708) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2977) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3699) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3697) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2977) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$3(DataFrameAggregateSuite.scala:665) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.DataFrameAggregateSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.DataFrameAggregateSuite.withSQLConf(DataFrameAggregateSuite.scala:37) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2(DataFrameAggregateSuite.scala:659) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$assertNoExceptions$2$adapted(DataFrameAggregateSuite.scala:655) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.sql.DataFrameAggregateSuite.assertNoExceptions(DataFrameAggregateSuite.scala:655) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126(DataFrameAggregateSuite.scala:695) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$126$adapted(DataFrameAggregateSuite.scala:695) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.DataFrameAggregateSuite.$anonfun$new$125(DataFrameAggregateSuite.scala:695) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) at org.scalatest.Suite.run(Suite.scala:1112) at org.scalatest.Suite.run$(Suite.scala:1094) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) ``` ### Why are the changes needed? `TreeNodeException` didn't work well. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #31337 from beliefer/SPARK-34234. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 06:25:33 +00:00
Max Gekk	c082c537de	[SPARK-34404][SQL] Add new Avro datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new option `datetimeRebaseMode` for the Avro datasource. The option influences on loading ancient dates and timestamps column values from avro files. The option supports the same values as the SQL config `spark.sql.legacy.avro.datetimeRebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from avro files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load avro files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").format("avro").load(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").format("avro").load(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.avro.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.avro.datetimeRebaseModeInRead", "legacy") spark.read.format("avro").load(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly AvroV1Suite" $ build/sbt "test:testOnly AvroV2Suite" ``` Closes #31529 from MaxGekk/avro-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 06:23:10 +00:00
Chao Sun	0986f16c8d	[SPARK-34347][SQL] CatalogImpl.uncacheTable should invalidate in cascade for temp views ### What changes were proposed in this pull request? This PR includes the following changes: 1. in `CatalogImpl.uncacheTable`, invalidate caches in cascade when the target table is a temp view, and `spark.sql.legacy.storeAnalyzedPlanForView` is false (default value). 2. make `SessionCatalog.lookupTempView` public and return processed temp view plan (i.e., with `View` op). ### Why are the changes needed? Following [SPARK-34052](https://issues.apache.org/jira/browse/SPARK-34052) (#31107), we should invalidate in cascade for `CatalogImpl.uncacheTable` when the table is a temp view, so that the behavior is consistent. ### Does this PR introduce _any_ user-facing change? Yes, now `SQLContext.uncacheTable` will drop temp view in cascade by default. ### How was this patch tested? Added a UT Closes #31462 from sunchao/SPARK-34347. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-02-09 20:48:58 -08:00
Gabor Somogyi	0a37a95224	[SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers ### What changes were proposed in this pull request? JDBC connection provider API and embedded connection providers already added to the code but no in-depth description about the internals. In this PR I've added both user and developer documentation and additionally added an example custom JDBC connection provider. ### Why are the changes needed? No documentation and example custom JDBC provider. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` <img width="793" alt="Screenshot 2021-02-02 at 16 35 43" src="https://user-images.githubusercontent.com/18561820/106623428-e48d2880-6574-11eb-8d14-e5c2aa7c37f1.png"> Closes #31384 from gaborgsomogyi/SPARK-31816. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-10 12:28:28 +09:00
MrPowers	e6753c9402	[SPARK-33995][SQL] Expose make_interval as a Scala function ### What changes were proposed in this pull request? This pull request exposes the `make_interval` function, [as suggested here](https://github.com/apache/spark/pull/31000#pullrequestreview-560812433), and as agreed to [here](https://github.com/apache/spark/pull/31000#issuecomment-754856820) and [here](https://github.com/apache/spark/pull/31000#issuecomment-755040234). This powerful little function allows for idiomatic datetime arithmetic via the Scala API: ```scala // add two hours df.withColumn("plus_2_hours", col("first_datetime") + make_interval(hours = lit(2))) // subtract one week and 30 seconds col("d") - make_interval(weeks = lit(1), secs = lit(30)) ``` The `make_interval` [SQL function](https://github.com/apache/spark/pull/26446) already exists. Here is [the JIRA ticket](https://issues.apache.org/jira/browse/SPARK-33995) for this PR. ### Why are the changes needed? The Spark API makes it easy to perform datetime addition / subtraction with months (`add_months`) and days (`date_add`). Users need to write code like this to perform datetime addition with years, weeks, hours, minutes, or seconds: ```scala df.withColumn("plus_2_hours", expr("first_datetime + INTERVAL 2 hours")) ``` We don't want to force users to manipulate SQL strings when they're using the Scala API. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds `make_interval` to the `org.apache.spark.sql.functions` API. This single function will benefit a lot of users. It's a small increase in the surface of the API for a big gain. ### How was this patch tested? This was tested via unit tests. cc: MaxGekk Closes #31073 from MrPowers/SPARK-33995. Authored-by: MrPowers <matthewkevinpowers@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:27:41 +00:00
Angerszhuuuu	2f387b41e8	[SPARK-34137][SQL] Update suquery's stats when build LogicalPlan's stats ### What changes were proposed in this pull request? When explain SQL with cost, treeString about subquery won't show it's statistics: How to reproduce: ``` spark.sql("create table t1 using parquet as select id as a, id as b from range(1000)") spark.sql("create table t2 using parquet as select id as c, id as d from range(2000)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql( """ \|WITH max_store_sales AS \| (SELECT max(csales) tpcds_cmax \| FROM (SELECT \| sum(b) csales \| FROM t1 WHERE a < 100 ) x), \|best_ss_customer AS \| (SELECT \| c \| FROM t2 \| WHERE d > (SELECT * FROM max_store_sales)) \| \|SELECT c FROM best_ss_customer \|""".stripMargin).explain("cost") ``` Before this PR's output: ``` == Optimized Logical Plan == Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3) +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) : +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L] : +- Aggregate [sum(b#4266L) AS csales#4260L] : +- Project [b#4266L] : +- Filter ((a#4265L < 100) AND isnotnull(a#4265L)) : +- Relation default.t1[a#4265L,b#4266L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation default.t2[c#4263L,d#4264L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) ``` After this pr: ``` == Optimized Logical Plan == Project [c#4481L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3) +- Filter (isnotnull(d#4482L) AND (d#4482L > scalar-subquery#4480 [])), Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) : +- Aggregate [max(csales#4478L) AS tpcds_cmax#4479L], Statistics(sizeInBytes=16.0 B, rowCount=1) : +- Aggregate [sum(b#4484L) AS csales#4478L], Statistics(sizeInBytes=16.0 B, rowCount=1) : +- Project [b#4484L], Statistics(sizeInBytes=1616.0 B, rowCount=101) : +- Filter (isnotnull(a#4483L) AND (a#4483L < 100)), Statistics(sizeInBytes=2.4 KiB, rowCount=101) : +- Relation[a#4483L,b#4484L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation[c#4481L,d#4482L] parquet, Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3) ``` ### Why are the changes needed? Complete explain treeString's statistics ### Does this PR introduce _any_ user-facing change? When user use explain with cost mode, user can see subquery's statistic too. ### How was this patch tested? Added UT Closes #31485 from AngersZhuuuu/SPARK-34137. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:21:45 +00:00
Angerszhuuuu	123365e05c	[SPARK-34240][SQL] Unify output of `SHOW TBLPROPERTIES` clause's output attribute's schema and ExprID ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the exprID unchanged to avoid bugs when we apply more operators above the command output DataFrame. This PR did 2 things ： 1. After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`.. 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Why are the changes needed? 1. Keep `SHOW TBLPROPERTIES`'s output schema consistence 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Does this PR introduce _any_ user-facing change? After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`. Before this PR: ``` sql > SHOW TBLPROPERTIES tabe_name('key') value value_of_key ``` After this PR ``` sql > SHOW TBLPROPERTIES tabe_name('key') key value key value_of_key ``` ### How was this patch tested? Added UT Closes #31378 from AngersZhuuuu/SPARK-34240. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:19:52 +00:00
Kousuke Saruta	f79305a402	[SPARK-34311][SQL] PostgresDialect can't treat arrays of some types ### What changes were proposed in this pull request? This PR fixes the issue that `PostgresDialect` can't treat arrays of some types. Though PostgreSQL supports wide range of types (https://www.postgresql.org/docs/13/datatype.html), the current `PostgresDialect` can't treat arrays of the following types. * xml * tsvector * tsquery * macaddr * macaddr8 * txid_snapshot * pg_snapshot * point * line * lseg * box * path * polygon * circle * pg_lsn * bit varying * interval NOTE: PostgreSQL doesn't implement arrays of serial types so this PR doesn't care about them. ### Why are the changes needed? To provide better support with PostgreSQL. ### Does this PR introduce _any_ user-facing change? Yes. PostgresDialect can handle arrays of types shown above. ### How was this patch tested? New test. Closes #31419 from sarutak/postgres-array-types. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-10 11:29:14 +09:00
Angerszhuuuu	3e12e9d2ee	[SPARK-34238][SQL][FOLLOW_UP] SHOW PARTITIONS Keep consistence with other `SHOW` command ### What changes were proposed in this pull request? Keep consistence with other `SHOW` command according to https://github.com/apache/spark/pull/31341#issuecomment-774613080 ### Why are the changes needed? Keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31516 from AngersZhuuuu/SPARK-34238-follow-up. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 02:28:05 +00:00
Holden Karau	5248ecb5ab	[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes ### What changes were proposed in this pull request? Allow users to have Spark attempt to decommission excluded executors. Since excluded executors may be flaky, this also adds the ability for users to specify a time limit after which a decommissioning executor will be killed by Spark. ### Why are the changes needed? This may help prevent fetch failures from excluded executors, and also handle the situation in which executors ### Does this PR introduce _any_ user-facing change? Yes, two new configuration flags for the behaviour. ### How was this patch tested? Extended unit and integration tests. Closes #31539 from holdenk/re=enable-SPARK-34104-SPARK-34105. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 18:16:09 -08:00
Liang-Chi Hsieh	1fbd576410	[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document ### What changes were proposed in this pull request? This follows up #31160 to update score function in the document. ### Why are the changes needed? Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No, only doc change. Closes #31531 from viirya/SPARK-34080-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-10 09:24:25 +09:00
HyukjinKwon	c8628c943c	Revert "[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes" This reverts commit `50641d2e3d`.	2021-02-10 08:00:03 +09:00
Holden Karau	50641d2e3d	[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes ### What changes were proposed in this pull request? Allow users to have Spark attempt to decommission excluded executors. Since excluded executors may be flaky, this also adds the ability for users to specify a time limit after which a decommissioning executor will be killed by Spark. ### Why are the changes needed? This may help prevent fetch failures from excluded executors, and also handle the situation in which executors ### Does this PR introduce _any_ user-facing change? Yes, two new configuration flags for the behaviour. ### How was this patch tested? Extended unit and integration tests. Closes #31249 from holdenk/configure-inaccessibleList-kill-to-use-decommissioning. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 14:21:24 -08:00
Holden Karau	2b51843ca4	[SPARK-34363][CORE] Add an option for limiting storage for migrated shuffle blocks ### What changes were proposed in this pull request? Allow users to configure a maximum amount of shuffle blocks to be stored and reject remote shuffle blocks when this threshold is exceeded. ### Why are the changes needed? In disk constrained environments with large amount of shuffle data, migrations may result in excessive disk pressure on the nodes. On Kube nodes this can result in cascading failures when combined with `emptyDir`. ### Does this PR introduce _any_ user-facing change? Yes, new configuration parameter. ### How was this patch tested? New unit tests. Closes #31493 from holdenk/SPARK-34337-reject-disk-blocks-when-under-disk-pressure. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 10:21:56 -08:00
Holden Karau	cf7a13c363	[SPARK-34209][SQL] Delegate table name validation to the session catalog ### What changes were proposed in this pull request? Delegate table name validation to the session catalog ### Why are the changes needed? Queerying of tables with nested namespaces. ### Does this PR introduce _any_ user-facing change? SQL queries of nested namespace queries ### How was this patch tested? Unit tests updated. Closes #31427 from holdenk/SPARK-34209-delegate-table-name-validation-to-the-catalog. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 10:15:16 -08:00
“attilapiros”	b2dc38b654	[SPARK-34334][K8S] Correctly identify timed out pending pod requests as excess request ### What changes were proposed in this pull request? Fixing identification of timed-out pending pod requests as excess requests to delete when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests too. ### Why are the changes needed? After https://github.com/apache/spark/pull/29981 only timed out newly created requests and timed out pending requests are taken as excess request. But there is small bug when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests as well. Because all the newly created requests are counted as excess request when items are chosen from the timed out pod pending requests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There is new unit test added: `SPARK-34334: correctly identify timed out pending pod requests as excess`. Closes #31445 from attilapiros/SPARK-34334. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 10:06:55 -08:00
Weichen Xu	18b30107ad	[MINOR][ML][TESTS] Increase tolerance to make NaiveBayesSuite more robust ### What changes were proposed in this pull request? Increase the rel tol from 0.2 to 0.35. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT. Closes #31536 from WeichenXu123/ES-65815. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-09 23:00:13 +09:00
Angerszhuuuu	7ea3a336b9	[SPARK-34355][CORE][SQL][FOLLOWUP] Log commit time in all File Writer ### What changes were proposed in this pull request? When doing https://issues.apache.org/jira/browse/SPARK-34399 based on https://github.com/apache/spark/pull/31471 Found FileBatchWrite will use `FileFormatWrite.processStates()` too. We need log commit duration in other writer too. In this pr: 1. Extract a commit job method in SparkHadoopWriter 2. address other commit writer ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #31520 from AngersZhuuuu/SPARK-34355-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-09 16:05:39 +09:00
Dongjoon Hyun	ea339c38b4	[SPARK-34407][K8S] KubernetesClusterSchedulerBackend.stop should clean up K8s resources ### What changes were proposed in this pull request? This PR aims to fix `KubernetesClusterSchedulerBackend.stop` to wrap `super.stop` with `Utils.tryLogNonFatalError`. ### Why are the changes needed? [CoarseGrainedSchedulerBackend.stop](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L559) may throw `SparkException` and this causes K8s resource (pod and configmap) leakage. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes #31533 from dongjoon-hyun/SPARK-34407. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-08 21:47:23 -08:00
wyp	a1e75edc39	[SPARK-34405][CORE] Fix mean value of timersLabels in the PrometheusServlet class ### What changes were proposed in this pull request? The getMetricsSnapshot method of the PrometheusServlet class has a wrong value, It should be taking the mean value but it's taking the max value. ### Why are the changes needed? The mean value of timersLabels in the PrometheusServlet class is wrong, You can look at line 105 of this class: L105. ``` sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n") ``` it should be ``` sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMean}\n") ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![image](https://user-images.githubusercontent.com/5170878/107313576-cc199280-6acd-11eb-9384-b6abf71c0f90.png) Closes #31532 from 397090770/SPARK-34405. Authored-by: wyp <wyphao.2007@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-08 21:18:29 -08:00
yikf	37fe8c6d3c	[SPARK-34395][SQL] Clean up unused code for code simplifications ### What changes were proposed in this pull request? Currently, we pass the default value `EmptyRow` to method `checkEvaluation` in the `StringExpressionsSuite`, but the default value of the 'checkEvaluation' method parameter is the `emptyRow`. We can clean the parameter for Code Simplifications. ### Why are the changes needed? for Code Simplifications before: ``` def testConcat(inputs: String): Unit = { val expected = if (inputs.contains(null)) null else inputs.mkString checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected, EmptyRow) } ``` after: ``` def testConcat(inputs: String): Unit = { val expected = if (inputs.contains(null)) null else inputs.mkString checkEvaluation(Concat(inputs.map(Literal.create(_, StringType))), expected) } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or Github action. Closes #31510 from yikf/master. Authored-by: yikf <13468507104@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 20:37:23 -06:00
gengjiaan	e65b28cf7d	[SPARK-34352][SQL] Improve SQLQueryTestSuite so as could run on windows system ### What changes were proposed in this pull request? The current implement of `SQLQueryTestSuite` cannot run on windows system. Becasue the code below will fail on windows system: `assume(TestUtils.testCommandAvailable("/bin/bash"))` For operation system that cannot support `/bin/bash`, we just skip some tests. ### Why are the changes needed? SQLQueryTestSuite has a bug on windows system. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #31466 from beliefer/SPARK-34352. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-09 10:58:58 +09:00
yangjie01	777d51e7e3	[SPARK-34374][SQL][DSTREAM] Use standard methods to extract keys or values from a Map ### What changes were proposed in this pull request? Use standard methods to extract `keys` or `values` from a `Map`, it's semantically consistent and use the `DefaultKeySet` and `DefaultValuesIterable` instead of a manual loop. Before ``` map.map(_._1) map.map(_._2) ``` After ``` map.keys map.values ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31484 from LuciferYang/keys-and-values. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 15:42:55 -06:00
jiake	3b26bc2536	[SPARK-34168][SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning ### What changes were proposed in this pull request? This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side. ### Why are the changes needed? Improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? adding new ut Closes #31258 from JkSelf/supportDPP1. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 16:42:52 +00:00
Terry Kim	c92e408aa1	[SPARK-34388][SQL] Propagate the registered UDF name to ScalaUDF, ScalaUDAF and ScalaAggregator ### What changes were proposed in this pull request? This PR proposes to propagate the name used for registering UDFs to `ScalaUDF`, `ScalaUDAF` and `ScaalAggregator`. Note that `PythonUDF` gets the name correctly: `466c045bfa/python/pyspark/sql/udf.py (L358-L359)` , and same for Hive UDFs: `466c045bfa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala (L67)` ### Why are the changes needed? This PR can help in the following scenarios: 1) Better EXPLAIN output 2) By adding `def name: String` to `UserDefinedExpression`, we can match an expression by `UserDefinedExpression` and look up the catalog, an use case needed for #31273. ### Does this PR introduce _any_ user-facing change? The EXPLAIN output involving udfs will be changed to use the name used for UDF registration. For example, for the following: ``` sql("CREATE TEMPORARY FUNCTION test_udf AS 'org.apache.spark.examples.sql.Spark33084'") sql("SELECT test_udf(col1) FROM VALUES (1), (2), (3)").explain(true) ``` The output of the optimized plan will change from: ``` Aggregate [spark33084(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330846906be0f, 1, 1) AS spark33084(col1)#237] +- LocalRelation [col1#223] ``` to ``` Aggregate [test_udf(cast(col1#223 as bigint), org.apache.spark.examples.sql.Spark330847a62d697, 1, 1, Some(test_udf)) AS test_udf(col1)#237] +- LocalRelation [col1#223] ``` ### How was this patch tested? Added new tests. Closes #31500 from imback82/udaf_name. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 16:02:07 +00:00
yliou	d1131bc850	[MINOR][SQL][FOLLOW-UP] Add assertion to FixedLengthRowBasedKeyValueBatch ### What changes were proposed in this pull request? Adds an assert to `FixedLengthRowBasedKeyValueBatch#appendRow` method to check the incoming vlen and klen by comparing them with the lengths stored as member variables as followup to https://github.com/apache/spark/pull/30788 ### Why are the changes needed? Add assert statement to catch similar bugs in future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran some tests locally, though not easy to test. Closes #31447 from yliou/SPARK-33726-Assert. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-08 08:46:01 -06:00
Linhong Liu	037bfb2dbc	[SPARK-33438][SQL] Eagerly init objects with defined SQL Confs for command `set -v` ### What changes were proposed in this pull request? In Spark, `set -v` is defined as "Queries all properties that are defined in the SQLConf of the sparkSession". But there are other external modules that also define properties and register them to SQLConf. In this case, it can't be displayed by `set -v` until the conf object is initiated (i.e. calling the object at least once). In this PR, I propose to eagerly initiate all the objects registered to SQLConf, so that `set -v` will always output the completed properties. ### Why are the changes needed? Improve the `set -v` command to produces completed and deterministic results ### Does this PR introduce _any_ user-facing change? `set -v` command will dump more configs ### How was this patch tested? existing tests Closes #30363 from linhongliu-db/set-v. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:48:28 +09:00
Max Gekk	a85490659f	[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new options for the Parquet datasource: 1. `datetimeRebaseMode` 2. `int96RebaseMode` Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps. The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from parquet files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy") spark.read.parquet(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31489 from MaxGekk/parquet-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 13:28:40 +00:00
HyukjinKwon	70ef196d59	[SPARK-34157][BUILD][FOLLOW-UP] Fix Scala 2.13 compilation error via using Array.deep ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/31245: ``` [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:112:53: value deep is not a member of Array[String] [error] assert(sql("show tables").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:115:72: value deep is not a member of Array[String] [error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:121:55: value deep is not a member of Array[String] [error] assert(sql("show tables").schema.fieldNames.deep == [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:124:74: value deep is not a member of Array[String] [error] assert(sql("show table extended like 'tbl'").schema.fieldNames.deep == [error] ^ ``` It broke Scala 2.13 build. This PR works around by using ScalaTests' `===` that can compare `Array`s safely. ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should test it out. Closes #31526 from HyukjinKwon/SPARK-34157. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:25:59 +09:00
HyukjinKwon	556ecd681a	[MINOR] Add a note about pip installation test in RC for release vote template ### What changes were proposed in this pull request? This PR proposes to add a note about pip installation test in RC for release vote template. ### Why are the changes needed? To promote PySpark users to test PyPi distribution and pip installation. ### Does this PR introduce _any_ user-facing change? No. It will be used for release vote. ### How was this patch tested? N/A Closes #31527 from HyukjinKwon/minor-update-vote-templ. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 22:24:42 +09:00
Gengliang Wang	88ced28141	[SPARK-33354][DOC] Remove an unnecessary quote in doc ### What changes were proposed in this pull request? Remove an unnecessary quote in the documentation. Super trivial. ### Why are the changes needed? Fix a mistake. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc Closes #31523 from gengliangwang/removeQuote. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 21:08:34 +09:00
Angerszhuuuu	70a79e920a	[SPARK-34239][SQL][FOLLOW_UP] SHOW COLUMNS Keep consistence with other `SHOW` command ### What changes were proposed in this pull request? Keep consistence with other `SHOW` command according to https://github.com/apache/spark/pull/31341#issuecomment-774613080 ### Why are the changes needed? Keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31518 from AngersZhuuuu/SPARK-34239-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 11:39:59 +00:00
gengjiaan	2c243c93d9	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`. As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`. Take `show tables` and `show table extended like 'tbl'` as example. The output before this PR: `show tables` \|database\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| If catalog is v2 session catalog, the output before this PR: \|namespace\|tableName\| -- \| -- \| default\| tbl `show table extended like 'tbl'` \|database\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| The output after this PR: `show tables` \|namespace\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| `show table extended like 'tbl'` \|namespace\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| ### Why are the changes needed? This PR have benefits as follows: First, Unify schema for the output of SHOW TABLES. Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? Yes. The output schema of `SHOW TABLES` replace `database` by `namespace`. ### How was this patch tested? Jenkins test. Closes #31245 from beliefer/SPARK-34157. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 08:39:58 +00:00
ulysses-you	9270238473	[SPARK-34355][SQL] Add log and time cost for commit job ### What changes were proposed in this pull request? Add some info log around commit log. ### Why are the changes needed? Th commit job is a heavy option and we have seen many times Spark block at this code place due to the slow rpc with namenode or other. It's better to record the time that commit job cost. ### Does this PR introduce _any_ user-facing change? Yes, more info log. ### How was this patch tested? Not need. Closes #31471 from ulysses-you/add-commit-log. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-08 16:44:59 +09:00
yangjie01	b344e91368	[SPARK-34375][CORE][K8S][TEST] Replaces 'Mockito.initMocks' with 'Mockito.openMocks' ### What changes were proposed in this pull request? `Mockito.initMocks(Object)` is a deprecated api, should use `Mockito.openMocks(Object).close()` instead. ### Why are the changes needed? Cleanup deprecation api usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31487 from LuciferYang/mockito-api. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 15:13:00 +09:00
Dongjoon Hyun	329f945534	[SPARK-34391][BUILD] Upgrade commons-io to 2.8.0 ### What changes were proposed in this pull request? This PR aims to upgrade `commons-io` from 2.5 to 2.8.0 for Apache Spark 3.2.0. ### Why are the changes needed? `2.5` was released on 2016-04-22. This will bring the latest bug fixes. - [2020-09-05: 2.8.0](https://commons.apache.org/proper/commons-io/changes-report.html#a2.8.0) - [2020-05-24: 2.7](https://commons.apache.org/proper/commons-io/changes-report.html#a2.7) - [2017-10-15: 2.6](https://commons.apache.org/proper/commons-io/changes-report.html#a2.6) ### Does this PR introduce _any_ user-facing change? Yes, but this is a compatible dependency change. ### How was this patch tested? Pass the CIs. Closes #31503 from dongjoon-hyun/SPARK-34391. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-07 21:53:42 -08:00
Dongjoon Hyun	dcaf62afea	[SPARK-34346][CORE][TESTS][FOLLOWUP] Fix UT by removing core-site.xml ### What changes were proposed in this pull request? This is a follow-up for SPARK-34346 which causes a flakiness due to `core-site.xml` test resource file addition. This PR aims to remove the test resource `core/src/test/resources/core-site.xml` from `core` module. ### Why are the changes needed? Due to the test resource `core-site.xml`, YARN UT becomes flaky in GitHub Action and Jenkins. ``` $ build/sbt "yarn/testOnly .YarnClusterSuite -- -z SPARK-16414" -Pyarn ... [info] YarnClusterSuite: [info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) FAILED * (20 seconds, 209 milliseconds) [info] FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:210) ``` To isolate more, we may use `SPARK_TEST_HADOOP_CONF_DIR` like `yarn` module's `yarn/Client`, but it seems an overkill in `core` module. ``` // SPARK-23630: during testing, Spark scripts filter out hadoop conf dirs so that user's // environments do not interfere with tests. This allows a special env variable during // tests so that custom conf dirs can be used by unit tests. val confDirs = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR") ++ (if (Utils.isTesting) Seq("SPARK_TEST_HADOOP_CONF_DIR") else Nil) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31515 from dongjoon-hyun/SPARK-34346-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 11:32:23 +09:00
Kevin	30ef3d6e1c	[SPARK-34158] Incorrect url of the only developer Matei in pom.xml ### What changes were proposed in this pull request? Update the Incorrect URL of the only developer Matei in pom.xml ### Why are the changes needed? The current link was broken ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? change the link to https://cs.stanford.edu/people/matei/ Closes #31512 from pingsutw/SPARK-34158. Authored-by: Kevin <pingsutw@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 09:15:05 +09:00
raphaelauv	34a1a65b39	[SPARK-34398][DOCS] Fix PySpark migration link ### What changes were proposed in this pull request? docs/pyspark-migration-guide.md ### Why are the changes needed? broken link ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build and check Closes #31514 from raphaelauv/patch-2. Authored-by: raphaelauv <raphaelauv@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 09:12:15 +09:00
Dongjoon Hyun	eb5558ed35	[SPARK-34390][CORE] Enable Zstandard buffer pool by default ### What changes were proposed in this pull request? This PR aims to enable ZStandard JNI BufferPool by default in Apache Spark 3.2.0. ### Why are the changes needed? 1. SPEED UP SPARK-34387 shows the speed-up on both Java8/Java11 by adding [ZStandardBenchmark](https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/io/ZStandardBenchmark.scala). 2. MEMORY USAGE The followings are the memory usage graphs while running [ZStandardBenchmark](https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/io/ZStandardBenchmark.scala) on Java11 with the increased N (100000) value in order to visualize easily. In the charts, the first half is the memory consumption without buffer pool while the last last half is one with buffer pool. The difference is noticeable. ```scala - val N = 10000 + val N = 100000 ``` - Compression ![Screenshot from 2021-02-06 18-41-17](https://user-images.githubusercontent.com/9700541/107134909-0c4cfb00-68ab-11eb-9273-82cbecdebfba.png) - Decompression ![Screenshot from 2021-02-06 18-43-05](https://user-images.githubusercontent.com/9700541/107134927-2edf1400-68ab-11eb-97c4-5cd101e91bb0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the existing UTs. Closes #31502 from dongjoon-hyun/SPARK-34390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 23:55:17 -08:00
Yuming Wang	6e05e99143	[SPARK-34342][SQL] Format DateLiteral and TimestampLiteral toString ### What changes were proposed in this pull request? This pr format DateLiteral and TimestampLiteral toString. For example: ```sql SELECT * FROM date_dim WHERE d_date BETWEEN (cast('2000-03-11' AS DATE) - INTERVAL 30 days) AND (cast('2000-03-11' AS DATE) + INTERVAL 30 days) ``` Before this pr: ``` Condition : (((isnotnull(d_date#18) AND (d_date#18 >= 10997)) AND (d_date#18 <= 11057)) ``` After this pr: ``` Condition : (((isnotnull(d_date#14) AND (d_date#14 >= 2000-02-10)) AND (d_date#14 <= 2000-04-10)) ``` ### Why are the changes needed? Make the plan more readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31455 from wangyum/SPARK-34342. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 19:49:38 -08:00
“attilapiros”	cc508d17c7	[SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url" ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. ### Why are the changes needed? Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`. So for example when a new column (with a default value) is added to the table then one the following problem happens: - when the new field is added after the last one the cell values will be null values instead of the default value - when the schema is extended somewhere before the last field then values will be listed for the wrong column positions Similar error will happen when one of the field is removed from the schema. For details please check the attached unit tests where both cases are checked. ### Does this PR introduce _any_ user-facing change? Fixes the potential value error. ### How was this patch tested? The existing unit tests for schema evolution is generalized and reused. New tests: - `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)` - `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)` Closes #31501 from attilapiros/SPARK-34370. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 17:25:39 -08:00
Max Gekk	6845d26057	[SPARK-34385][SQL] Unwrap `SparkUpgradeException` in v2 Parquet datasource ### What changes were proposed in this pull request? Unwrap `SparkUpgradeException` from `ParquetDecodingException` in v2 `FilePartitionReader` in the same way as v1 implementation does: `3a299aa648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (L180-L183)` ### Why are the changes needed? 1. To be compatible with v1 implementation of the Parquet datasource. 2. To improve UX with Spark SQL by making `SparkUpgradeException` more visible. ### Does this PR introduce _any_ user-facing change? Yes, it can. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31497 from MaxGekk/parquet-spark-upgrade-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 16:49:15 -08:00
Dongjoon Hyun	466c045bfa	[SPARK-34387][CORE][TESTS] Add ZStandardBenchmark ### What changes were proposed in this pull request? This PR aims to add ZStandardBenchmark as a base-line. ### Why are the changes needed? This will prevent any regression when we upgrade Zstandard library in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. Closes #31498 from dongjoon-hyun/SPARK-ZSTD-BENCH. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 15:34:40 -08:00
Ruifeng Zheng	178dc50b7a	[SPARK-34356][ML] OVR transform fix potential column conflict ### What changes were proposed in this pull request? 1, clear predictionCol & probabilityCol, use tmp rawPred col, to avoid potential column conflict; 2, use array instead of map, to keep in line with the python side; 3, simplify transform ### Why are the changes needed? if input dataset has a column whose name is `predictionCol`,`probabilityCol`,`RawPredictionCol`, transfrom will fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuite Closes #31472 from zhengruifeng/ovr_submodel_skip_pred_prob. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 17:03:19 -06:00
tanel.kiis@gmail.com	c73f70bb0d	[SPARK-34141][SQL] Remove side effect from ExtractGenerator ### What changes were proposed in this pull request? Rewrote one `ExtractGenerator` case such that it would not rely on a side effect of the flatmap function. ### Why are the changes needed? With the dataframe api it is possible to have a lazy sequence as the `output` of a `LogicalPlan`. When exploding a column on this dataframe using the `withColumn("newName", explode(col("name")))` method, the `ExtractGenerator` does not extract the generator and `CheckAnalysis` would throw an exception. ### Does this PR introduce _any_ user-facing change? Bugfix Before this, the work around was to put `.select("*")` before the explode. ### How was this patch tested? UT Closes #31213 from tanelk/SPARK-34141_extract_generator. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 13:27:07 -06:00
Ruifeng Zheng	a9969faca7	[SPARK-34291][ML] LSH hashDistance optimization ### What changes were proposed in this pull request? `hashDistance` optimization: if two vectors in a pair are the same, directly return 0.0 ### Why are the changes needed? it should be faster than existing impl, because of short-circuit ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #31394 from zhengruifeng/min_hash_distance_opt. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 13:25:49 -06:00
Xinrong Meng	747ad1809b	[PYTHON][MINOR] Fix docstring of DataFrame.join ### What changes were proposed in this pull request? Fix docstring of PySpark `DataFrame.join`. ### Why are the changes needed? For a better view of PySpark documentation. ### Does this PR introduce _any_ user-facing change? No (only documentation changes). ### How was this patch tested? Manual test. From ![image](https://user-images.githubusercontent.com/47337188/106977730-c14ab080-670f-11eb-8df8-5aea90902104.png) To ![image](https://user-images.githubusercontent.com/47337188/106977834-ed663180-670f-11eb-9c5e-d09be26e0ca8.png) Closes #31463 from xinrong-databricks/fixDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 09:08:49 -06:00
“attilapiros”	e614f34c7a	[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" ### What changes were proposed in this pull request? Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data the table level properties were overwritten by the partition level properties. This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data. This new behavior is consistent with Apache Hive. See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in: ``` 0: jdbc:hive2://<IP>:10000> select * from t; INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds INFO : OK +---------------+-------------+-------------+ \| t.col1 \| t.col2 \| t.ds \| +---------------+-------------+-------------+ \| col1_default \| col2_value \| 1981-01-07 \| \| col1_value \| col2_value \| 1983-04-27 \| +---------------+-------------+-------------+ 2 rows selected (0.159 seconds) ``` ### Why are the changes needed? Without this change the old schema would be used. This can use a correctness issue when the new schema introduces a new field with a default value (following the rules of schema evolution) before an existing field. In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions. For example check the attached unit test `SPARK-26836: support Avro schema evolution` Without this fix the result of the select on the table would be: ``` +----------+----------+----------+ \| col1\| col2\| ds\| +----------+----------+----------+ \|col2_value\| null\|1981-01-07\| \|col1_value\|col2_value\|1983-04-27\| +----------+----------+----------+ ``` With this fix: ``` +------------+----------+----------+ \| col1\| col2\| ds\| +------------+----------+----------+ \|col1_default\|col2_value\|1981-01-07\| \| col1_value\|col2_value\|1983-04-27\| +------------+----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Just fixes the value errors. When a new column is introduced even to the last position then instead of 'null' the given default will be used. ### How was this patch tested? This was tested with the unit tested included to the PR. And manually on Apache Spark / Hive. Closes #31133 from attilapiros/SPARK-26836. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-05 10:56:25 -08:00
Kousuke Saruta	1f4135c4bd	[SPARK-34350][SQL][TESTS] replace withTimeZone defined in OracleIntegrationSuite with DateTimeTestUtils.withDefaultTimeZone ### What changes were proposed in this pull request? This PR replaces `withTimeZone` defined and used in `OracleIntegrationSuite` with `DateTimeTestUtils.withDefaultTimeZone` which is defined as a utility method. ### Why are the changes needed? Both methods are semantically the same so it might be better to use the utility one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `OracleIntegrationSuite` passes. Closes #31465 from sarutak/oracle-timezone-util. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-05 23:26:22 +09:00

1 2 3 4 5 ...

29326 commits