ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Liang-Chi Hsieh	275ad6bd0b	[SPARK-36673][SQL][FOLLOWUP] Remove duplicate test in DataFrameSetOperationsSuite ### What changes were proposed in this pull request? As a followup of #34025 to remove duplicate test. ### Why are the changes needed? To remove duplicate test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #34032 from viirya/remove. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `f9644cc253`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-17 11:52:26 -07:00
Angerszhuuuu	61d7f1da1b	[SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `69e006dd53`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:42:21 +08:00
Liang-Chi Hsieh	895218996a	[SPARK-36673][SQL] Fix incorrect schema of nested types of union ### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes #34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `cdd7ae937d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:37:40 +08:00
Hyukjin Kwon	7d7c9915bb	[SPARK-36788][SQL] Change log level of AQE for non-supported plans from warning to debug ### What changes were proposed in this pull request? This PR suppresses the warnings for plans where AQE is not supported. Currently we show the warnings such as: ``` org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] ``` for every plan that AQE is not supported. ### Why are the changes needed? It's too noisy now. Below is the example of `SortSuite` run: ``` 14:51:40.675 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324881] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=true, sortOrder=List('a DESC NULLS FIRST) (785 milliseconds) 14:51:41.416 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324884 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324884] . 14:51:41.467 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324884 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324884] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS FIRST) (796 milliseconds) 14:51:42.210 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324887 ASC NULLS LAST], true +- Scan ExistingRDD[a#324887] . 14:51:42.259 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324887 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324887] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS LAST) (797 milliseconds) 14:51:43.009 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324890 DESC NULLS LAST], true +- Scan ExistingRDD[a#324890] . 14:51:43.061 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324890 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324890] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS LAST) (848 milliseconds) 14:51:43.857 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324893 DESC NULLS FIRST], true +- Scan ExistingRDD[a#324893] . 14:51:43.903 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324893 DESC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324893] . [info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS FIRST) (827 milliseconds) 14:51:44.682 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324896 ASC NULLS FIRST], true +- Scan ExistingRDD[a#324896] . 14:51:44.748 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324896 ASC NULLS FIRST], true, 23 +- Scan ExistingRDD[a#324896] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS FIRST) (565 milliseconds) 14:51:45.248 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324899 ASC NULLS LAST], true +- Scan ExistingRDD[a#324899] . 14:51:45.312 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324899 ASC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324899] . [info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS LAST) (591 milliseconds) 14:51:45.841 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324902 DESC NULLS LAST], true +- Scan ExistingRDD[a#324902] . 14:51:45.905 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324902 DESC NULLS LAST], true, 23 +- Scan ExistingRDD[a#324902] . ``` ### Does this PR introduce _any_ user-facing change? Yes, it will show less warnings to users. Note that AQE is enabled by default from Spark 3.2, see SPARK-33679 ### How was this patch tested? Manually tested via unittests. Closes #34026 from HyukjinKwon/minor-log-level. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `917d7dad4d`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 12:01:53 +09:00
Josh Rosen	3502fda783	[SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit * FAILED * (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit * FAILED * (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit `3ae6e6775b`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2021-09-16 14:31:22 -07:00
Dongjoon Hyun	63b8417794	[SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c217797297`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-15 23:36:36 -07:00
Chao Sun	a7dc8242ea	[SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit `a927b0836b`) Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-15 19:17:49 +00:00
Lukas Rytz	2e7583799e	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit `1a62e6a2c1`) Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:58 -05:00
Liang-Chi Hsieh	b52fbeee2d	[SPARK-36669][SQL] Add Lz4 wrappers for Hadoop Lz4 codec ### What changes were proposed in this pull request? This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec. ### Why are the changes needed? Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency. ``` [info] Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory [info] at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66) [info] at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152) [info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168) ``` Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test. Closes #33940 from viirya/lz4-wrappers. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `6bcf330191`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-09 09:31:11 -07:00
Andrew Liu	6cb23c163c	[SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe ### What changes were proposed in this pull request? fix SimplifyConditionalsInPredicate to be null-safe Reproducible: ``` import org.apache.spark.sql.types.{StructField, BooleanType, StructType} import org.apache.spark.sql.Row val schema = List( StructField("b", BooleanType, true) ) val data = Seq( Row(true), Row(false), Row(null) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) // cartesian product of true / false / null val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal")) df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // actual: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // +-----+--------+ spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // expected: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // \| null\| true\| // +-----+--------+ ``` ### Why are the changes needed? is a regression that leads to incorrect results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate. Authored-by: Andrew Liu <andrewlliu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `9b633f2075`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:32:59 +08:00
Huaxin Gao	7e8860751c	[SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference ### What changes were proposed in this pull request? Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead ### Why are the changes needed? `FieldReference` is a private class, should use `NamedReference` instead ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #33927 from huaxingao/agg_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `23794fb303`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 14:06:26 +08:00
yangjie01	c4332c7bf0	[SPARK-36684][SQL][TESTS] Add Jackson test dependencies to `sql/core` module at `hadoop-2.7` profile ### What changes were proposed in this pull request? SPARK-26346 upgrade Parquet related modules from 1.10.1 to 1.11.1 and `parquet-jackson 1.11.1` use `com.fasterxml.jackson` instead of `org.codehaus.jackson`. So, there are warning logs related to ``` 17:12:17.605 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated ... Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ... ``` when test `sql/core` modules with `hadoop-2.7` profile. This pr adds test dependencies related to `org.codehaus.jackson` in `sql/core` module when `hadoop-2.7` profile is activated. ### Why are the changes needed? Clean up test warning logs that shouldn't exist. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA or Jenkins Tests. - Manual test `mvn clean test -pl sql/core -am -DwildcardSuites=none -Phadoop-2.7` Before No test failed, but warning logs as follows: ``` [INFO] Running test.org.apache.spark.sql.JavaBeanDeserializationSuite 22:42:45.211 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22:42:46.827 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650) at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:62) at org.apache.spark.sql.internal.SharedState$.liftedTree1$1(SharedState.scala:181) at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:180) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:54) at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:135) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:135) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:134) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:335) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:109) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:109) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:194) at org.apache.spark.sql.types.DataType.sameType(DataType.scala:97) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:291) at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85) at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82) at scala.collection.immutable.List.forall(List.scala:89) at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType(TypeCoercion.scala:291) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1074) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1069) at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1080) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1079) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1084) at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:37) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.$anonfun$dataType$4(objects.scala:815) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.dataType(objects.scala:815) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.$anonfun$dataType$9(complexTypeCreator.scala:416) at scala.collection.immutable.List.map(List.scala:290) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType$lzycompute(complexTypeCreator.scala:410) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:409) at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:398) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:309) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:319) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:248) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:154) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testBeanWithArrayFieldDeserialization(JavaBeanDeserializationSuite.java:75) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:364) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:237) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:158) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548) Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.<clinit>(WebHdfsFileSystem.java:129) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 81 more Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 88 more ``` After There are no more warning logs like above Closes #33926 from LuciferYang/SPARK-36684. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `acd9c92fa8`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-07 21:40:53 -07:00
Liang-Chi Hsieh	e39948fada	[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite ### What changes were proposed in this pull request? This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. ### Why are the changes needed? We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `5a0ae694d0`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 16:53:25 -07:00
Andy Grove	533f655690	[SPARK-36666][SQL] Fix regression in AQEShuffleReadExec Fix regression in AQEShuffleReadExec when used in conjunction with Spark plugins with custom partitioning. Signed-off-by: Andy Grove <andygrove73gmail.com> ### What changes were proposed in this pull request? Return `UnknownPartitioning` rather than throw an exception in `AQEShuffleReadExec`. ### Why are the changes needed? The [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) replaces `AQEShuffleReadExec` with a custom operator that runs on the GPU. Due to changes in [SPARK-36315](`dd80457ffb`), Spark now throws an exception if the shuffle exchange does not have recognized partitioning, and this happens before the postStageOptimizer rules so there is no opportunity to replace this operator now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I am still in the process of testing this change. I will update the PR in the next few days with status. Closes #33910 from andygrove/SPARK-36666. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `f78d8394dc`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-09-07 13:50:00 -07:00
Jungtaek Lim	e16c886b87	[SPARK-36667][SS][TEST] Close resources properly in StateStoreSuite/RocksDBStateStoreSuite ### What changes were proposed in this pull request? This PR proposes to ensure StateStoreProvider instances are properly closed for each test in StateStoreSuite/RocksDBStateStoreSuite. ### Why are the changes needed? While this doesn't break the test, this is a bad practice and may possibly make nasty problems in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs Closes #33916 from HeartSaVioR/SPARK-36667. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `093c2080fe`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-06 17:40:13 -07:00
Angerszhuuuu	8b4cc90c44	[SPARK-36637][SQL] Provide proper error message when use undefined window frame ### What changes were proposed in this pull request? Two case of using undefined window frame as below should provide proper error message 1. For case using undefined window frame with window function ``` SELECT nth_value(employee_name, 2) OVER w second_highest_salary FROM basic_pays; ``` origin error message is ``` Window function nth_value(employee_name#x, 2, false) requires an OVER clause. ``` It's confused that in use use a window frame `w` but it's not defined. Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` 2. For case using undefined window frame with aggregation function ``` SELECT SUM(salary) OVER w sum_salary FROM basic_pays; ``` origin error message is ``` Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34] +- SubqueryAlias spark_catalog.default.basic_pays +- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []] ``` In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` ### Why are the changes needed? Provide proper error message ### Does this PR introduce _any_ user-facing change? Yes, error messages are improved as described in desc ### How was this patch tested? Added UT Closes #33892 from AngersZhuuuu/SPARK-36637. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `568ad6aa44`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-02 22:32:47 +08:00
Gengliang Wang	1bad04d028	Preparing development version 3.2.1-SNAPSHOT	2021-08-31 17:04:14 +00:00
Gengliang Wang	03f5d23e96	Preparing Spark release v3.2.0-rc2	2021-08-31 17:04:08 +00:00
Jungtaek Lim	9a71c4ca84	[SPARK-36619][SS] Fix bugs around prefix-scan for HDFS backed state store and RocksDB state store ### What changes were proposed in this pull request? This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store. > HDFS backed state store We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements. > RocksDB state store Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort(). While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs. This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback. ### Why are the changes needed? Please refer the above section on explanation of bugs and treatments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UT which failed without this PR and passes with this PR. Closes #33870 from HeartSaVioR/SPARK-36619. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `60a72c938a`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-01 00:51:58 +08:00
gengjiaan	d42536a6ee	[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source ### What changes were proposed in this pull request? Spark SQL includes a data source that can read data from other databases using JDBC. Spark also supports the case-insensitive option `pushDownPredicate`. According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. But I find it still be pushed down to JDBC data source. ### Why are the changes needed? Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. The output of query will not change. ### How was this patch tested? Jenkins test. Closes #33822 from beliefer/SPARK-36574. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `fcc91cfec4`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-30 19:09:45 +08:00
Gengliang Wang	3719d87668	[SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide ### What changes were proposed in this pull request? The `try_add` function allows the following inputs: - number, number - date, number - date, interval - timestamp, interval - interval, interval And, the `try_divide` function allows the following inputs: - number, number - interval, number However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too. ### Why are the changes needed? Improve documentation and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Also build docs for preview: ![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png) Closes #33861 from gengliangwang/enhanceTryDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a52ad9f82`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-29 10:30:13 +09:00
Jungtaek Lim	118a53d87f	[SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc ### What changes were proposed in this pull request? This PR proposes to document `window` & `session_window` function in SQL API doc page. Screenshot of functions: > window ![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png) > session_window ![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png) ### Why are the changes needed? Description is missing in both `window` / `session_window` functions for SQL API page. ### Does this PR introduce _any_ user-facing change? Yes, the description of `window` / `session_window` functions will be available in SQL API page. ### How was this patch tested? Only doc changes. Closes #33846 from HeartSaVioR/SPARK-36595. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `bc32144a91`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-27 12:39:21 +09:00
Gengliang Wang	eca81cc0ae	[SPARK-36457][DOCS][3.2] Review and fix issues in Scala/Java API docs ### What changes were proposed in this pull request? Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Improve API docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33845 from gengliangwang/SPARK-36457-3.2. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-27 10:51:27 +08:00
Cheng Su	c21303f02c	[SPARK-36594][SQL][3.2] ORC vectorized reader should properly check maximal number of fields ### What changes were proposed in this pull request? This is the patch on branch-3.2 for https://github.com/apache/spark/pull/33842. See the description in the other PR. ### Why are the changes needed? Avoid OOM/performance regression when reading ORC table with nested column types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `OrcSourceSuite.scala`. Closes #33843 from c21/branch-3.2. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 14:55:21 +08:00
Hyukjin Kwon	26ae9e93da	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `93cec49212`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:03:00 +09:00
Gengliang Wang	a313082d67	[SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package. ### Why are the changes needed? There are two packages for `org.apache.spark.sql.connector` SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst. ### Does this PR introduce _any_ user-facing change? No, the trait is not released yet. ### How was this patch tested? Existing UT. Closes #33826 from gengliangwang/moveLocalScan. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5b4c216478`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:24:06 -07:00
Huaxin Gao	e48de7884d	[SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package ### What changes were proposed in this pull request? Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there. ### Why are the changes needed? Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33815 from huaxingao/agg_package. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `cd2342691d`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-23 15:31:35 -07:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
Yuming Wang	5b971650bd	[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning ### What changes were proposed in this pull request? Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example: ```sql SELECT date_id, product_id FROM fact_sk f JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s ON f.store_id = s.new_store_id ``` Before this PR: ``` == Physical Plan == (2) Project [date_id#3998, product_id#3999] +- (2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- (2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274] +- (1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- (1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- (1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` After this PR: ``` == Physical Plan == (2) Project [date_id#3998, product_id#3999] +- (2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- (2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] : +- (1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- (1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- (1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] ``` This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`: ``` == Physical Plan == (2) Project [date_id#3998, product_id#3999] +- (2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- (2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- Subquery subquery#4009, [id=#284] : +- (2) HashAggregate(keys=[new_store_id#3997#4008], functions=[]) : +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280] : +- (1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[]) : +- (1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- (1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- (1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305] +- (1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- (1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- *(1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` ### Why are the changes needed? Improve DPP to support more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test: SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- TPC-DS q58 \| 40 \| 20 TPC-DS q83 \| 18 \| 14 Closes #33664 from wangyum/SPARK-36444. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit `2310b99e14`) Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-08-19 16:45:22 +08:00
Shixiong Zhu	54cca7f82e	[SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries ### What changes were proposed in this pull request? RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version. A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format. In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed. We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint. ### Why are the changes needed? Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change. ### Does this PR introduce _any_ user-facing change? No. RocksDB state store is a new feature in Spark 3.2, which has not yet released. ### How was this patch tested? The new unit tests. Closes #33749 from zsxwing/SPARK-36519. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `ea4919801a`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-19 00:23:52 -07:00
gengjiaan	3d69d0d003	[SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp ### What changes were proposed in this pull request? The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter. This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code. ### Why are the changes needed? Simplify `make_timestamp`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New tests. Closes #33775 from beliefer/SPARK-36428-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `707eefa3c7`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-18 22:57:27 +08:00
Gengliang Wang	70635b4b26	Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases" ### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `8bfb4f1e72`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-17 20:24:09 +08:00
Max Gekk	07c6976f79	[SPARK-36524][SQL] Common class for ANSI interval types ### What changes were proposed in this pull request? Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType` ### Why are the changes needed? To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance: ```scala case _: YearMonthIntervalType \| _: DayTimeIntervalType => false ``` by ```scala case _: AnsiIntervalType => false ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing test suites. Closes #33753 from MaxGekk/ansi-interval-type-trait. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `82a31508af`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-17 12:28:07 +03:00
Gengliang Wang	41e5144b53	[SPARK-36521][SQL] Disallow comparison between Interval and String ### What changes were proposed in this pull request? Disallow comparison between Interval and String in the default type coercion rules. ### Why are the changes needed? If a binary comparison contains interval type and string type, we can't decide which interval type the string should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33750 from gengliangwang/disallowCom. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `26d6b952dc`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-16 22:41:25 +03:00
Liang-Chi Hsieh	3aa933b162	[SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `8b8d91cf64`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-16 11:06:16 +09:00
Huaxin Gao	ede1d1e9a7	[SPARK-34952][SQL][FOLLOWUP] Normalize pushed down aggregate col name and group by col name ### What changes were proposed in this pull request? Normalize pushed down aggregate col names and group by col names ... ### Why are the changes needed? to handle case sensitive col names ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modify existing test Closes #33739 from huaxingao/normalize. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `3f8ec0dae4`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-13 22:31:30 -07:00
Gengliang Wang	c898a940e2	[SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal ### What changes were proposed in this pull request? If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. The related binary operation for Interval contains - Add - Subtract - Comparisions Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them. ### Why are the changes needed? Avoid ambiguously implicit casting string literals to interval types. ### Does this PR introduce _any_ user-facing change? No, the ANSI type coercion is not released yet. ### How was this patch tested? New tests. Closes #33737 from gengliangwang/disallowStringAndInterval. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `ecdea91602`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-14 10:50:43 +08:00
Gengliang Wang	eaf92bea99	[SPARK-36499][SQL][TESTS] Test Interval multiply / divide null ### What changes were proposed in this pull request? Test the following valid operations: ``` year-month interval * null null * year-month interval year-month interval / null ``` and invalid operations: ``` null / interval int / interval ``` ### Why are the changes needed? Improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33729 from gengliangwang/addTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `eb6be7f1ee`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-13 11:06:09 +03:00
gengjiaan	eb840578f7	[SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type ### What changes were proposed in this pull request? With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode. ``` org.apache.spark.sql.AnalysisException cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7 ``` We should update the function `make_timestamp` to allow integer type 'seconds' parameter. ### Why are the changes needed? Make `make_timestamp` could accepts integer as 'seconds' parameter. ### Does this PR introduce _any_ user-facing change? 'Yes'. `make_timestamp` could accepts integer as 'seconds' parameter. ### How was this patch tested? New tests. Closes #33665 from beliefer/SPARK-36428. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `7d82336734`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 13:13:15 +08:00
Maryann Xue	ca91292cf0	[SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances. The major changes include: 1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined. 2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference. 3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined. 4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse. A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces. This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results. No. Added UTs. Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs. Closes #33671 from maryannxue/spark-36447. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `29b1e394c6`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 11:48:46 +08:00
Gengliang Wang	f7017694e8	[SPARK-36497][SQL] Support Interval add/subtract NULL ### What changes were proposed in this pull request? Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type. This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well. To change it, I propose to cast the null as the same data type of the other element in the add/subtract: ``` null + interval => cast(null as interval) + interval null - interval => cast(null as interval) - interval interval + null=> interval + cast(null as interval) interval - null => interval - cast(null as interval) ``` ### Why are the changes needed? Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval` ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33727 from gengliangwang/intervalTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `d4466d55ca`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 11:10:45 +08:00
Gengliang Wang	3785738838	[SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions ### What changes were proposed in this pull request? 1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted ``` string - timestamp => cast(string as timestamp) - timestamp ``` This PR is to promote the following string literals: ``` string - date => cast(string as date) - date date - string => date - cast(date as string) timestamp - string => timestamp ``` It is very straightforward to cast the string literal as the data type of the other side in the subtraction. 2. Merge the string promotion logic from the rule `StringLiteralCoercion`: ``` date_sub(date, string) => date_sub(date, cast(string as int)) date_add(date, string) => date_add(date, cast(string as int)) ``` ### Why are the changes needed? 1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL 2. Certerize all the string literal promotion in the ANSI type coercion rule ### Does this PR introduce _any_ user-facing change? No, the new ANSI type coercion rules are not released yet. ### How was this patch tested? Existing UT Closes #33724 from gengliangwang/datetimeTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `48e333af54`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-13 01:02:54 +08:00
Wenchen Fan	89cc547afd	[SPARK-35881][SQL][FOLLOWUP] Remove the AQE post stage creation extension ### What changes were proposed in this pull request? This is a followup of #33140 It turns out that we may be able to complete the AQE and columnar execution integration without the AQE post stage creation extension. The rule `ApplyColumnarRulesAndInsertTransitions` can add to-columnar transition if the shuffle/broadcast supports columnar. ### Why are the changes needed? remove APIs that are not needed. ### Does this PR introduce _any_ user-facing change? No, the APIs are not released yet. ### How was this patch tested? existing and manual tests Closes #33701 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `124d011ee7`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-12 21:35:44 +08:00

1 2 3 4 5 ...

8395 commits