ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
PengLei	c2881c5ee2	[SPARK-36107][SQL] Refactor first set of 20 query execution errors to use error classes ### What changes were proposed in this pull request? Refactor some exceptions in QueryExecutionErrors to use error classes. as follows: ``` columnChangeUnsupportedError logicalHintOperatorNotRemovedDuringAnalysisError cannotEvaluateExpressionError cannotGenerateCodeForExpressionError cannotTerminateGeneratorError castingCauseOverflowError cannotChangeDecimalPrecisionError invalidInputSyntaxForNumericError cannotCastFromNullTypeError cannotCastError cannotParseDecimalError simpleStringWithNodeIdUnsupportedError evaluateUnevaluableAggregateUnsupportedError dataTypeUnsupportedError dataTypeUnsupportedError failedExecuteUserDefinedFunctionError divideByZeroError invalidArrayIndexError mapKeyNotExistError rowFromCSVParserNotExpectedError ``` ### Why are the changes needed? [SPARK-36107](https://issues.apache.org/jira/browse/SPARK-36107) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Testcase Closes #33538 from Peng-Lei/SPARK-36017. Lead-authored-by: PengLei <peng.8lei@gmail.com> Co-authored-by: Lei Peng <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-20 10:34:19 +09:00
Angerszhuuuu	69e006dd53	[SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:42:08 +08:00
Liang-Chi Hsieh	cdd7ae937d	[SPARK-36673][SQL] Fix incorrect schema of nested types of union ### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes #34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:37:19 +08:00
Wenchen Fan	651904a2ef	[SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions ### What changes were proposed in this pull request? The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true: - A user can invoke some expensive UDF, this now gets invoked more often than originally intended. - A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here. This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category). ### Why are the changes needed? We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 . ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT and existing test Closes #33958 from cloud-fan/collapse. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:34:21 +08:00
Angerszhuuuu	e356f6aa11	[SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_distinct(array(cast('nan' as double), cast('nan' as double))) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayDistinct won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33993 from AngersZhuuuu/SPARK-36741. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 20:48:17 +08:00
Wenchen Fan	4145498826	[SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 16:49:54 +09:00
Wenchen Fan	dfd5237c0c	[SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project ### What changes were proposed in this pull request? `ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project. Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule. ### Why are the changes needed? Fix a bug that violates the semantic of nondeterministic expressions. ### Does this PR introduce _any_ user-facing change? Most likely no change, but in some cases, this is a correctness bug fix which changes the query result. ### How was this patch tested? existing tests Closes #34023 from cloud-fan/scan. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 10:51:15 +08:00
Josh Rosen	3ae6e6775b	[SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit * FAILED * (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit * FAILED * (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2021-09-16 14:28:47 -07:00
Huaxin Gao	fb11c466ae	[SPARK-36587][SQL][FOLLOWUP] Remove unused CreateNamespaceStatement ### What changes were proposed in this pull request? remove `CreateNamespaceStatement` ### Why are the changes needed? remove unused code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34015 from huaxingao/rm_create_ns_stmt. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-16 19:56:45 +08:00
Yannis Sismanis	afd406e4d0	[SPARK-36745][SQL] ExtractEquiJoinKeys should return the original predicates on join keys ### What changes were proposed in this pull request? This PR updates `ExtractEquiJoinKeys` to return an extra field for the join condition with join keys. ### Why are the changes needed? Sometimes we need to restore the original join condition. Before this PR, we need to build `EqualTo` expressions with the join keys, which is not always the original join condition. E.g. `EqualNullSafe(a, b)` will become `EqualTo(Coalesce(a, lit), Coalesce(b, lit))`. After this PR, we can simply use the new returned field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33985 from YannisSismanis/SPARK-36475-fix. Authored-by: Yannis Sismanis <yannis.sismanis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-16 13:16:16 +08:00
Angerszhuuuu	b665782f0d	[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZhuuuu/SPARK-36755. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:31:46 +08:00
Angerszhuuuu	638085953f	[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:04:09 +08:00
Kousuke Saruta	e43b9e8520	[SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields ### What changes were proposed in this pull request? This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields). The root cause is `SchemaPruning.sortLeftFieldsByRight` does N * M order searching. ``` val filteredRightFieldNames = rightStruct.fieldNames .filter(name => leftStruct.fieldNames.exists(resolver(_, name))) ``` To fix this issue, this PR proposes to use `HashMap` to expect a constant order searching. This PR also adds `case _ if left == right => left` to the method as a short-circuit code. ### Why are the changes needed? To fix a perf issue. ### Does this PR introduce _any_ user-facing change? No. The logic should be identical. ### How was this patch tested? I confirmed that the following micro benchmark finishes within a few seconds. ``` import org.apache.spark.sql.catalyst.expressions.SchemaPruning import org.apache.spark.sql.types._ var struct1 = new StructType() (1 to 50000).foreach { i => struct1 = struct1.add(new StructField(i + "", IntegerType)) } var struct2 = new StructType() (50001 to 100000).foreach { i => struct2 = struct2.add(new StructField(i + "", IntegerType)) } SchemaPruning.sortLeftFieldsByRight(struct1, struct2) SchemaPruning.sortLeftFieldsByRight(struct2, struct2) ``` The correctness should be checked by existing tests. Closes #33981 from sarutak/improve-schemapruning-performance. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-15 10:33:58 +09:00
Angerszhuuuu	f71f37755d	[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_union(array(cast('nan' as double), cast('nan' as double)), array()) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayUnion won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-14 18:25:47 +08:00
Fu Chen	52c5ff20ca	[SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf ### What changes were proposed in this pull request? Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`. Before this pr, the following case will throw an exception. ```scala spark.udf.register("vec", (i: Int) => (0 until i).toArray) sql("select explode(vec(8)) as c1").show ``` ``` Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation java.lang.RuntimeException: Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130) at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166) at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259) at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731) at org.apache.spark.sql.Dataset.head(Dataset.scala:2755) at org.apache.spark.sql.Dataset.take(Dataset.scala:2962) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at org.apache.spark.sql.Dataset.show(Dataset.scala:807) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #33956 from cfmcgrady/SPARK-36715. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-14 09:26:11 +09:00
Lukas Rytz	1a62e6a2c1	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:50 -05:00
Max Gekk	bd62ad9982	[SPARK-36736][SQL] Support ILIKE (ALL \| ANY \| SOME) - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL \| ANY \| SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_example(subject varchar(20)); spark-sql> insert into ilike_example values > ('jane doe'), > ('Jane Doe'), > ('JANE DOE'), > ('John Doe'), > ('John Smith'); spark-sql> select * > from ilike_example > where subject ilike any ('jane%', '%SMITH') > order by subject; JANE DOE Jane Doe John Smith jane doe ``` The syntax of `ILIKE` is similar to `LIKE`: ``` str NOT? ILIKE (ANY \| SOME \| ALL) (pattern+) ``` ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test to test parsing of `ILIKE`: ``` $ build/sbt "test:testOnly *.ExpressionParserSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql" ``` Closes #33966 from MaxGekk/ilike-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 22:51:49 +08:00
Kousuke Saruta	e858cd568a	[SPARK-36724][SQL] Support timestamp_ntz as a type of time column for SessionWindow ### What changes were proposed in this pull request? This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does. ### Why are the changes needed? For better usability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33965 from sarutak/session-window-ntz. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-13 21:47:43 +08:00
Yuto Akutsu	3747cfdb40	[SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API ### What changes were proposed in this pull request? Fixed wrong documentation on Cot API ### Why are the changes needed? [Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check. Closes #33978 from yutoacts/SPARK-36738. Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-13 21:51:29 +09:00
ulysses-you	4a6b2b9fc8	[SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle ### What changes were proposed in this pull request? - move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase. - run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase. - add `SkewJoinAwareCost` to support estimate skewed join cost - add new config to decide if force optimize skewed join - in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost. ### Why are the changes needed? In general, skewed join has more impact on performance than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle. A common case: ``` HashAggregate SortMergJoin Sort Exchange Sort Exchange ``` and after this PR, the plan looks like: ``` HashAggregate Exchange SortMergJoin (isSkew=true) Sort Exchange Sort Exchange ``` Note that, the new introduced shuffle also can be optimized by AQE. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? * Add new test * pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle` * pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition` Closes #32816 from ulysses-you/support-extra-shuffle. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 17:21:27 +08:00
Huaxin Gao	1f679ed8e9	[SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-11 10:12:21 -07:00
dgd-contributor	711577e238	[SPARK-36687][SQL][CORE] Rename error classes with _ERROR suffix ### What changes were proposed in this pull request? redundant _ERROR suffix in error-classes.json ### Why are the changes needed? Clean up error classes to reduce clutter ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33944 from dgd-contributor/SPARK-36687. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-10 10:00:28 +09:00
Max Gekk	b74a1ba69f	[SPARK-36674][SQL] Support ILIKE - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `like` expression - `ilike`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_ex(subject varchar(20)); spark-sql> insert into ilike_ex values > ('John Dddoe'), > ('Joe Doe'), > ('John_down'), > ('Joe down'), > (null); spark-sql> select * > from ilike_ex > where subject ilike '%j%h%do%' > order by 1; John Dddoe John_down ``` The syntax of `ilike` is similar to `like`: ``` str ILIKE pattern[ ESCAPE escape] ``` #### Implementation details `ilike` is implemented as a runtime replaceable expression to `Like(Lower(left), Lower(right), escapeChar)`. Such replacement is acceptable because `ilike`/`like` recognise only `_` and `%` as special characters but not special character classes. Note: The PR aims to support `ilike` in SQL only. Others APIs can be updated separately on demand. ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [ClickHouse](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#ilike) - [Vertica](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm) - [Impala](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_operators.html#ilike) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test: ``` $ build/sbt "test:testOnly .RegexpExpressionsSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "test:testOnly SQLKeywordSuite" $ build/sbt "sql/testOnly *ExpressionsSchemaSuite" ``` Closes #33919 from MaxGekk/ilike-single-pattern. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:55:20 +08:00
Andrew Liu	9b633f2075	[SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe ### What changes were proposed in this pull request? fix SimplifyConditionalsInPredicate to be null-safe Reproducible: ``` import org.apache.spark.sql.types.{StructField, BooleanType, StructType} import org.apache.spark.sql.Row val schema = List( StructField("b", BooleanType, true) ) val data = Seq( Row(true), Row(false), Row(null) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) // cartesian product of true / false / null val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal")) df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // actual: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // +-----+--------+ spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // expected: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // \| null\| true\| // +-----+--------+ ``` ### Why are the changes needed? is a regression that leads to incorrect results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate. Authored-by: Andrew Liu <andrewlliu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:32:40 +08:00
Hyukjin Kwon	34f80ef313	[SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark ### What changes were proposed in this pull request? This PR adds: - the support of `TimestampNTZType` in pandas API on Spark. - the support of Py4J handling of `spark.sql.timestampType` configuration ### Why are the changes needed? To complete `TimestampNTZ` support. In more details: - ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp): > Because naive datetime objects are treated by many datetime methods as local times ... semantically it is legitimate to assume: - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone). - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time) - ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278): - `Timestamp(..., timezone=sparkLocalTimezone)` -> `TimestamType` - `Timestamp(..., timezone=null)` -> `TimestampNTZType` - (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time). One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default: ```python >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int") 0 0 ``` whereas in Spark: ```python >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show() +------+ \| _1\| +------+ \|-32400\| +------+ >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show() +---+ \| _1\| +---+ \| 0\| +---+ ``` In contrast, some APIs like `pandas.fromtimestamp` assume they are local times: ```python >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0]) Timestamp('1970-01-01 09:00:00') ``` For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work. As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work. ### Does this PR introduce _any_ user-facing change? Yes, now pandas API on Spark can handle `TimestampNTZType`. ```python import datetime spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark() ``` ``` dt 0 2021-08-31 19:58:55.024410 ``` This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration: ```python >>> lit(datetime.datetime.now()) Column<'TIMESTAMP '2021-09-03 19:34:03.949998''> ``` ```python >>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") >>> lit(datetime.datetime.now()) Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''> ``` ### How was this patch tested? Unittests were added. Closes #33877 from HyukjinKwon/SPARK-36625. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 09:57:38 +09:00
Huaxin Gao	23794fb303	[SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference ### What changes were proposed in this pull request? Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead ### Why are the changes needed? `FieldReference` is a private class, should use `NamedReference` instead ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #33927 from huaxingao/agg_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 14:05:44 +08:00
Venkata Sai Akhil Gudesa	2ed6e7bc5d	[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root \|-- a: struct (nullable = true) \| \|-- c: struct (nullable = true) \| \| \|-- e: string (nullable = true) \| \|-- d: integer (nullable = true) \|-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 18:15:48 -07:00
yangjie01	bdb73bbc27	[SPARK-36613][SQL][SS] Use EnumSet as the implementation of Table.capabilities method return value ### What changes were proposed in this pull request? The `Table.capabilities` method return a `java.util.Set` of `TableCapability` enumeration type, which is implemented using `java.util.HashSet` now. Such Set can be replaced `with java.util.EnumSet` because `EnumSet` implementations can be much more efficient compared to other sets. ### Why are the changes needed? Use more appropriate data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. - Add a new benchmark to compare `create` and `contains` operation between `EnumSet` and `HashSet` Closes #33867 from LuciferYang/SPARK-36613. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:23:05 -05:00
yangjie01	35848385ae	[SPARK-36602][COER][SQL] Clean up redundant asInstanceOf casts ### What changes were proposed in this pull request? The change of this pr is remove redundant asInstanceOf casts in Spark code. ### Why are the changes needed? Code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. Closes #33852 from LuciferYang/cleanup-asInstanceof. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:22:28 -05:00
Senthil Kumar	6bd491ecb8	[SPARK-36643][SQL] Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### What changes were proposed in this pull request? This PR adds additional information to ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### Why are the changes needed? Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not. ### Does this PR introduce any user-facing change? Yes. Trivial change in the error messages is included ### How was this patch tested? New Test added - SPARK-36643: Show migration guide when attempting SparkConf Closes #33894 from senthh/1st_Sept_2021. Lead-authored-by: Senthil Kumar <senthh@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-03 23:49:45 -07:00
Kousuke Saruta	cf3bc65e69	[SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-03 23:25:18 +09:00
Kent Yao	7f1ad7be18	[SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### What changes were proposed in this pull request? Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### Why are the changes needed? spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues. It's better to leave this choice to users ### Does this PR introduce _any_ user-facing change? spark.sql.execution.topKSortFallbackThreshold is now user-facing ### How was this patch tested? passing GA Closes #33904 from yaooqinn/SPARK-36659. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-09-03 19:11:37 +08:00
Cheng Su	9054a6ac00	[SPARK-36652][SQL] AQE dynamic join selection should not apply to non-equi join ### What changes were proposed in this pull request? Currently `DynamicJoinSelection` has two features: 1.demote broadcast hash join, and 2.promote shuffled hash join. Both are achieved by adding join hint in query plan, and only works for equi join. However [the rule is matching with `Join` operator now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala#L71), so it would add hint for non-equi join by mistake (See added test query in `JoinHintSuite.scala` for an example). This PR is to fix `DynamicJoinSelection` to only apply to equi-join, and improve `checkHintNonEquiJoin` to check we should not add `PREFER_SHUFFLE_HASH` for non-equi join. ### Why are the changes needed? Improve the logic of codebase to be better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinHintSuite.scala`. Closes #33899 from c21/aqe-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-02 20:04:49 -07:00
Huaxin Gao	38b6fbd9b8	[SPARK-36351][SQL] Refactor filter push down in file source v2 ### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-02 19:11:43 -07:00
Angerszhuuuu	568ad6aa44	[SPARK-36637][SQL] Provide proper error message when use undefined window frame ### What changes were proposed in this pull request? Two case of using undefined window frame as below should provide proper error message 1. For case using undefined window frame with window function ``` SELECT nth_value(employee_name, 2) OVER w second_highest_salary FROM basic_pays; ``` origin error message is ``` Window function nth_value(employee_name#x, 2, false) requires an OVER clause. ``` It's confused that in use use a window frame `w` but it's not defined. Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` 2. For case using undefined window frame with aggregation function ``` SELECT SUM(salary) OVER w sum_salary FROM basic_pays; ``` origin error message is ``` Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34] +- SubqueryAlias spark_catalog.default.basic_pays +- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []] ``` In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` ### Why are the changes needed? Provide proper error message ### Does this PR introduce _any_ user-facing change? Yes, error messages are improved as described in desc ### How was this patch tested? Added UT Closes #33892 from AngersZhuuuu/SPARK-36637. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-02 22:32:31 +08:00
Kazuyuki Tanimura	799a0116a8	[SPARK-36607][SQL] Support BooleanType in UnwrapCastInBinaryComparison ### What changes were proposed in this pull request? This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only. The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`. This work is an extension of SPARK-24994 and SPARK-32858 ### Why are the changes needed? Current implementation of Spark without this PR cannot properly optimize the filter for the following case ``` SELECT * FROM t WHERE boolean_field = 2 ``` The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passed existing tests ``` build/sbt "catalyst/test" build/sbt "sql/test" ``` Added unit tests ``` build/sbt "catalyst/testOnly UnwrapCastInBinaryComparisonSuite -- -z SPARK-36607" build/sbt "sql/testOnly UnwrapCastInComparisonEndToEndSuite -- -z SPARK-36607" ``` Closes #33865 from kazuyukitanimura/SPARK-36607. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-01 14:27:30 +08:00
Bo Zhang	e33cdfb317	[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches ### What changes were proposed in this pull request? This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one. To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called. This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`. For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query. ### Why are the changes needed? Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory. ### Does this PR introduce _any_ user-facing change? Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change. ### How was this patch tested? Added unit tests. Closes #33763 from bozhang2820/new-trigger. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-01 15:02:21 +09:00
Hyukjin Kwon	4ed2dab5ee	[SPARK-36608][SQL] Support TimestampNTZ in Arrow ### What changes were proposed in this pull request? This PR proposes to add the support of `TimestampNTZType` in Arrow APIs. Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow. ### Why are the changes needed? To complete the support of `TimestampNTZType` in Apache Spark. ### Does this PR introduce _any_ user-facing change? Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType` ### How was this patch tested? Unittests were added. Closes #33875 from HyukjinKwon/SPARK-36608-arrow. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-01 10:23:42 +09:00
Gengliang Wang	8a52ad9f82	[SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide ### What changes were proposed in this pull request? The `try_add` function allows the following inputs: - number, number - date, number - date, interval - timestamp, interval - interval, interval And, the `try_divide` function allows the following inputs: - number, number - interval, number However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too. ### Why are the changes needed? Improve documentation and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Also build docs for preview: ![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png) Closes #33861 from gengliangwang/enhanceTryDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-29 10:30:04 +09:00
Gengliang Wang	e650d06ba9	[SPARK-36597][DOCS] Fix issues in SQL function docs ### What changes were proposed in this pull request? * the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional * remove the `\|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get * Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699 ### Why are the changes needed? Improve the documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and preview: ![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png) ![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png) ![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png) ![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png) Closes #33847 from gengliangwang/auditSQLDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-27 13:29:34 +08:00
Jungtaek Lim	bc32144a91	[SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc ### What changes were proposed in this pull request? This PR proposes to document `window` & `session_window` function in SQL API doc page. Screenshot of functions: > window ![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png) > session_window ![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png) ### Why are the changes needed? Description is missing in both `window` / `session_window` functions for SQL API page. ### Does this PR introduce _any_ user-facing change? Yes, the description of `window` / `session_window` functions will be available in SQL API page. ### How was this patch tested? Only doc changes. Closes #33846 from HeartSaVioR/SPARK-36595. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-27 12:39:09 +09:00
Wenchen Fan	72d6d64835	[SPARK-36587][SQL] Migrate CreateNamespaceStatement to v2 command framework ### What changes were proposed in this pull request? This PR migrates CreateNamespaceStatement to the v2 command framework. Two new logical plans `UnresolvedObjectName` and `ResolvedObjectName` are introduced to support these CreateXXXStatements. ### Why are the changes needed? Avoid duplicated code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33835 from cloud-fan/ddl. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 20:36:04 +08:00
Gengliang Wang	1a42aa5bd4	[SPARK-36457][DOCS] Review and fix issues in Scala/Java API docs ### What changes were proposed in this pull request? Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Improve API docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33824 from gengliangwang/auditDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-26 12:59:18 +08:00
Pablo Langa	14622fcec8	[SPARK-36488][SQL] Improve error message with quotedRegexColumnNames ### What changes were proposed in this pull request? When `spark.sql.parser.quotedRegexColumnNames=true` and a pattern is used in a place where is not allowed the message is a little bit confusing ``` scala> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)") org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'divide' ``` This PR attempts to improve the error message ``` scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)") org.apache.spark.sql.AnalysisException: Invalid usage of regular expression in expression 'divide' ``` ### Why are the changes needed? To clarify the error message with this option active ### Does this PR introduce _any_ user-facing change? Yes, change the error message ### How was this patch tested? Unit testing and manual testing Closes #33802 from planga82/feature/spark36488_improve_error_message. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 11:33:40 +08:00
Max Gekk	159ff9fd14	[SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone ### What changes were proposed in this pull request? In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on. ### Why are the changes needed? Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings: ```sql $ export TZ="Europe/Amsterdam" $ ./bin/spark-sql -S spark-sql> select timestamp_ntz'now'; 2021-08-25 18:12:36.233 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 18:14:40.547 ``` ### Does this PR introduce _any_ user-facing change? Yes. For the example above, after the changes: ```sql spark-sql> select timestamp_ntz'now'; 2021-08-25 18:47:46.832 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 09:48:05.211 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #33838 from MaxGekk/fix-ts_ntz-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 10:09:18 +08:00
Gengliang Wang	18143fb426	[SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry ### What changes were proposed in this pull request? Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`: - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like This PR is to: * Support setting `since` version in FunctionRegistry * Correct the `since` version of `regexp` and `regexp_like` ### Why are the changes needed? Correct the SQL doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run ``` sh sql/create-docs.sh ``` and check the SQL doc manually Closes #33834 from gengliangwang/allowSQLFunVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-25 22:32:20 +08:00
Max Gekk	df0ec56723	[SPARK-36567][SQL] Support foldable special datetime strings by `CAST` ### What changes were proposed in this pull request? In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable. ### Why are the changes needed? 1. To avoid a breaking change. 2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance, at Spark 3.1: ```sql select ts_col > 'now'; ``` but the query fails at the moment, and users have to use typed timestamp literal: ```sql select ts_col > timestamp'now'; ``` ### Does this PR introduce _any_ user-facing change? No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714. ### How was this patch tested? 1. Manually tested via the sql command line: ```sql spark-sql> select cast('today' as date); 2021-08-24 spark-sql> select timestamp('today'); 2021-08-24 00:00:00 spark-sql> select timestamp'tomorrow' > 'today'; true ``` 2. By running new test suite: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite" ``` Closes #33816 from MaxGekk/foldable-datetime-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-25 14:08:59 +08:00
Hyukjin Kwon	93cec49212	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:02:53 +09:00
PengLei	3e32ea17db	[SPARK-36336][SQL] Add new exception of base exception used in QueryExecutionErrors ### What changes were proposed in this pull request? When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type. according the example [SparkArithmeticException](`f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75)`) Add SparkXXXException as follows: - `SparkClassNotFoundException` - `SparkConcurrentModificationException` - `SparkDateTimeException` - `SparkFileAlreadyExistsException` - `SparkFileNotFoundException` - `SparkNoSuchMethodException` - `SparkIndexOutOfBoundsException` - `SparkIOException` - `SparkSecurityException` - `SparkSQLException` - `SparkSQLFeatureNotSupportedException` Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception Some added by [PR](https://github.com/apache/spark/pull/33538) as follows: - `SparkUnsupportedOperationException` - `SparkIllegalStateException` - `SparkNumberFormatException` - `SparkIllegalArgumentException` - `SparkArrayIndexOutOfBoundsException` - `SparkNoSuchElementException` ### Why are the changes needed? [SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed ut test Closes #33573 from Peng-Lei/SPARK-36336. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 09:39:28 +09:00
Gengliang Wang	5b4c216478	[SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package. ### Why are the changes needed? There are two packages for `org.apache.spark.sql.connector` SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst. ### Does this PR introduce _any_ user-facing change? No, the trait is not released yet. ### How was this patch tested? Existing UT. Closes #33826 from gengliangwang/moveLocalScan. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:23:50 -07:00

1 2 3 4 5 ...

5722 commits