ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	c77c0d41e1	[SPARK-36825][SQL] Read/write dataframes with ANSI intervals from/to parquet files ### What changes were proposed in this pull request? Allow saving and loading of ANSI intervals - `YearMonthIntervalType` and `DayTimeIntervalType` to/from the Parquet datasource. After the changes, Spark saves ANSI intervals as primitive physical Parquet types: - year-month intervals as `INT32` - day-time intervals as `INT64` w/o any modifications. To load the values as intervals back, Spark puts the info about interval types to the extra key `org.apache.spark.sql.parquet.row.metadata`: ``` $ java -jar parquet-tools-1.12.0.jar meta ./part-...-c000.snappy.parquet creator: parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d) extra: org.apache.spark.version = 3.3.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[...,{"name":"i","type":"interval year to month","nullable":false,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- ... i: REQUIRED INT32 R:0 D:0 ``` Note: The given PR focus on support of ANSI intervals in the Parquet datasource via write or read as a column in `Dataset`. ### Why are the changes needed? To improve user experience with Spark SQL. At the moment, users can make ANSI intervals "inside" Spark or parallelize Java collections of `Period`/`Duration` objects but cannot save the intervals to any built-in datasources. After the changes, users can save datasets/dataframes with year-month/day-time intervals to load them back later by Apache Spark. For example: ```scala scala> sql("select date'today' - date'2021-01-01' as diff").write.parquet("/Users/maximgekk/tmp/parquet_interval") scala> val readback = spark.read.parquet("/Users/maximgekk/tmp/parquet_interval") readback: org.apache.spark.sql.DataFrame = [diff: interval day] scala> readback.printSchema root \|-- diff: interval day (nullable = true) scala> readback.show +------------------+ \| diff\| +------------------+ \|INTERVAL '264' DAY\| +------------------+ ``` ### Does this PR introduce _any_ user-facing change? In some sense, yes. Before the changes, users get an error while saving of ANSI intervals as dataframe columns to parquet files but the operation should complete successfully after the changes. ### How was this patch tested? 1. By running the existing test suites: ``` $ build/sbt "test:testOnly ParquetFileFormatV2Suite" $ build/sbt "test:testOnly FileBasedDataSourceSuite" $ build/sbt "sql/test:testOnly JsonV2Suite" ``` 2. Added new tests: ``` $ build/sbt "sql/test:testOnly ParquetIOSuite" $ build/sbt "sql/test:testOnly *ParquetSchemaSuite" ``` Closes #34057 from MaxGekk/ansi-interval-save-parquet. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-24 09:55:11 +03:00
Kousuke Saruta	0b65daad2a	[SPARK-36760][DOCS][FOLLOWUP] Fix wrong JavaDoc style ### What changes were proposed in this pull request? This PR fixes a JavaDoc style error introduced in SPARK-36760 (#34073). Due to this error, build on GA fails and the following error message appears. ``` [error] * internal -> external data conversion. ``` ### Why are the changes needed? To recover GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by GA itself. Closes #34078 from sarutak/fix-doc-error. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-24 02:18:53 +09:00
Huaxin Gao	6e815da8ea	[SPARK-36760][SQL][FOLLOWUP] Add interface SupportsPushDownV2Filters ### What changes were proposed in this pull request? update java doc... ### Why are the changes needed? to highlight the difference between this new interface `SupportsPushDownV2Filters` and the old one `SupportsPushDownFilters` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test not needed Closes #34073 from huaxingao/followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-23 23:48:02 +08:00
Hyukjin Kwon	0076eba8d0	[MINOR][SQL][DOCS] Correct the 'options' description on UnresolvedRelation ### What changes were proposed in this pull request? This PR fixes the 'options' description on `UnresolvedRelation`. This comment was added in https://github.com/apache/spark/pull/29535 but not valid anymore because V1 also uses this `options` (and merge the options with the table properties) per https://github.com/apache/spark/pull/29712. This PR can go through from `master` to `branch-3.1`. ### Why are the changes needed? To make `UnresolvedRelation.options`'s description clearer. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Scala linter by `dev/linter-scala`. Closes #34075 from HyukjinKwon/minor-comment-unresolved-releation. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-09-22 23:00:15 -07:00
allisonwang-db	4a8dc5f7a3	[SPARK-36747][SQL] Do not collapse Project with Aggregate when correlated subqueries are present in the project list ### What changes were proposed in this pull request? This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions. For example ```sql select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t) ``` ``` == Optimized Logical Plan == Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions. +- Project [sum(c2)#10L] +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) <--- Aggregate expression in join condition :- LocalRelation [c2#3] +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] +- LocalRelation [c1#2, c2#3] java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false]) ``` Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions. `079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)` ### Why are the changes needed? To fix an existing optimizer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #33990 from allisonwang-db/spark-36747-collapse-agg. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-23 12:50:27 +08:00
Angerszhuuuu	a7cbe69986	[SPARK-36753][SQL] ArrayExcept handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN, 1d], but it should return [1d]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayExcept won't show handle equal `NaN` value ### How was this patch tested? Added UT Closes #33994 from AngersZhuuuu/SPARK-36753. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 23:51:41 +08:00
Huaxin Gao	1d7d972021	[SPARK-36760][SQL] Add interface SupportsPushDownV2Filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? This is the 2nd PR for V2 Filter support. This PR does the following: - Add interface SupportsPushDownV2Filters Future work: - refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them - For V2 file source: implement v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser` - For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter - We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths The reasons that we have reached the above conclusion: - The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters. - We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters - There are lots of work to implement v2 filters in the V2 file sources because of the following reasons: possible approaches for implementing V2Filter: 1. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter We don't need v1->v2 and v2->v1 problem with this approach: there are lots of code duplication 2. We will implement v2 filter -> parquet/orc filter file source v1: v1 filter -> v2 filter -> parquet/orc filter We will need V1 -> V2 This is the approach I am using in https://github.com/apache/spark/pull/33973 In that PR, I have v2 orc： v2 filter -> orc filter V1 orc： v1 -> v2 -> orc filter v2 csv: v2->v1, new UnivocityParser v1 csv: new UnivocityParser v2 Json: v2->v1, new JacksonParser v1 Json: new JacksonParser csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use v1 because no need to change. I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have v2 parquet： v2 filter -> parquet filter v1 parquet： v1 -> v2 -> parquet filter Problem with this approach: 1. It's not easy to implement V1->V2 because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter. 2. parquet is OK Use Timestamp as example, parquet filter takes long for timestamp v2 parquet： v2 filter -> parquet filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Long） V1 parquet： v1 -> v2 -> parquet filter timestamp Expression （Long） -> v1 filter （timestamp） -> v2 filter （LiteralValue Long）-> parquet filter （Long） but we have problem for orc because orc filter takes java Timestamp v2 orc： v2 filter -> orc filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） V1 orc： v1 -> v2 -> orc filter Expression （Long） -> v1 filter (timestamp) -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） This defeats the purpose of implementing v2 filters. 3. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2: v2 filter -> v1 filter -> parquet/orc filter We will need V2 -> V1 we have similar problem as approach 2. So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them. ### Why are the changes needed? Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added new UT Closes #34001 from huaxingao/v2filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-22 16:58:13 +08:00
Yufei Gu	688b95b136	[SPARK-36814][SQL] Make class ColumnarBatch extendable ### What changes were proposed in this pull request? Change class ColumnarBatch to a non-final class ### Why are the changes needed? To support better vectorized reading in multiple data source, ColumnarBatch need to be extendable. For example, To support row-level delete( https://github.com/apache/iceberg/issues/3141) in Iceberg's vectorized read, we need to filter out deleted rows in a batch, which requires ColumnarBatch to be extendable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test needed. Closes #34054 from flyrain/columnarbatch-extendable. Authored-by: Yufei Gu <yufei_gu@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-21 15:24:55 +00:00
Max Gekk	d2340f8e1c	[SPARK-36807][SQL] Merge ANSI interval types to a tightest common type ### What changes were proposed in this pull request? In the PR, I propose to modify `StructType` to support merging of ANSI interval types with different fields. ### Why are the changes needed? This will allow merging of schemas from different datasource files. ### Does this PR introduce _any_ user-facing change? No, the ANSI interval types haven't released yet. ### How was this patch tested? Added new test to `StructTypeSuite`. Closes #34049 from MaxGekk/merge-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-21 10:20:16 +03:00
Yuto Akutsu	30d17b6333	[SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC ### What changes were proposed in this pull request? Add new built-in SQL functions: secant and cosecant, and add them as Scala and Python functions. ### Why are the changes needed? Cotangent has been supported in Spark SQL but Secant and Cosecant are missing though I believe they can be used as much as cot. Related Links: [SPARK-20751](https://github.com/apache/spark/pull/17999) [SPARK-36660](https://github.com/apache/spark/pull/33906) ### Does this PR introduce _any_ user-facing change? Yes, users can now use these functions. ### How was this patch tested? Unit tests Closes #33988 from yutoacts/SPARK-36683. Authored-by: Yuto Akutsu <yuto.akutsu@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-20 22:38:47 +09:00
Angerszhuuuu	2fc7f2f702	[SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [NaN], but it should return []. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayIntersect won't show equal `NaN` value ### How was this patch tested? Added UT Closes #33995 from AngersZhuuuu/SPARK-36754. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-20 16:48:59 +08:00
PengLei	c2881c5ee2	[SPARK-36107][SQL] Refactor first set of 20 query execution errors to use error classes ### What changes were proposed in this pull request? Refactor some exceptions in QueryExecutionErrors to use error classes. as follows: ``` columnChangeUnsupportedError logicalHintOperatorNotRemovedDuringAnalysisError cannotEvaluateExpressionError cannotGenerateCodeForExpressionError cannotTerminateGeneratorError castingCauseOverflowError cannotChangeDecimalPrecisionError invalidInputSyntaxForNumericError cannotCastFromNullTypeError cannotCastError cannotParseDecimalError simpleStringWithNodeIdUnsupportedError evaluateUnevaluableAggregateUnsupportedError dataTypeUnsupportedError dataTypeUnsupportedError failedExecuteUserDefinedFunctionError divideByZeroError invalidArrayIndexError mapKeyNotExistError rowFromCSVParserNotExpectedError ``` ### Why are the changes needed? [SPARK-36107](https://issues.apache.org/jira/browse/SPARK-36107) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Testcase Closes #33538 from Peng-Lei/SPARK-36017. Lead-authored-by: PengLei <peng.8lei@gmail.com> Co-authored-by: Lei Peng <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-20 10:34:19 +09:00
Angerszhuuuu	69e006dd53	[SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test ### What changes were proposed in this pull request? Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this ### Why are the changes needed? Add Unit test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34008 from AngersZhuuuu/SPARK-36740. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:42:08 +08:00
Liang-Chi Hsieh	cdd7ae937d	[SPARK-36673][SQL] Fix incorrect schema of nested types of union ### What changes were proposed in this pull request? This patch proposes to fix incorrect schema of `union`. ### Why are the changes needed? The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition. So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of incorrect schema. ### How was this patch tested? Added unit test. Closes #34025 from viirya/SPARK-36673. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:37:19 +08:00
Wenchen Fan	651904a2ef	[SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions ### What changes were proposed in this pull request? The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true: - A user can invoke some expensive UDF, this now gets invoked more often than originally intended. - A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here. This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category). ### Why are the changes needed? We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 . ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT and existing test Closes #33958 from cloud-fan/collapse. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 21:34:21 +08:00
Angerszhuuuu	e356f6aa11	[SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_distinct(array(cast('nan' as double), cast('nan' as double))) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr fix this based on https://github.com/apache/spark/pull/33955 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayDistinct won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33993 from AngersZhuuuu/SPARK-36741. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 20:48:17 +08:00
Wenchen Fan	4145498826	[SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-17 16:49:54 +09:00
Wenchen Fan	dfd5237c0c	[SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project ### What changes were proposed in this pull request? `ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project. Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule. ### Why are the changes needed? Fix a bug that violates the semantic of nondeterministic expressions. ### Does this PR introduce _any_ user-facing change? Most likely no change, but in some cases, this is a correctness bug fix which changes the query result. ### How was this patch tested? existing tests Closes #34023 from cloud-fan/scan. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-17 10:51:15 +08:00
Josh Rosen	3ae6e6775b	[SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite ### What changes were proposed in this pull request? This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`. The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`. In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary. ### Why are the changes needed? Previously, `SparkSubmitSuite` tests would fail with messages like: ``` [info] - launch simple application with spark-submit * FAILED * (1 second, 832 milliseconds) [info] Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command. After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces: ``` [info] - launch simple application with spark-submit * FAILED * (2 seconds, 800 milliseconds) [info] spark-submit returned with exit code 101. [info] Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar' [info] [info] 2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. [info] 2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [info] 2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually ran the affected test suites. Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2021-09-16 14:28:47 -07:00
Huaxin Gao	fb11c466ae	[SPARK-36587][SQL][FOLLOWUP] Remove unused CreateNamespaceStatement ### What changes were proposed in this pull request? remove `CreateNamespaceStatement` ### Why are the changes needed? remove unused code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34015 from huaxingao/rm_create_ns_stmt. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-16 19:56:45 +08:00
Yannis Sismanis	afd406e4d0	[SPARK-36745][SQL] ExtractEquiJoinKeys should return the original predicates on join keys ### What changes were proposed in this pull request? This PR updates `ExtractEquiJoinKeys` to return an extra field for the join condition with join keys. ### Why are the changes needed? Sometimes we need to restore the original join condition. Before this PR, we need to build `EqualTo` expressions with the join keys, which is not always the original join condition. E.g. `EqualNullSafe(a, b)` will become `EqualTo(Coalesce(a, lit), Coalesce(b, lit))`. After this PR, we can simply use the new returned field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33985 from YannisSismanis/SPARK-36475-fix. Authored-by: Yannis Sismanis <yannis.sismanis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-16 13:16:16 +08:00
Angerszhuuuu	b665782f0d	[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZhuuuu/SPARK-36755. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:31:46 +08:00
Angerszhuuuu	638085953f	[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-15 22:04:09 +08:00
Kousuke Saruta	e43b9e8520	[SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields ### What changes were proposed in this pull request? This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields). The root cause is `SchemaPruning.sortLeftFieldsByRight` does N * M order searching. ``` val filteredRightFieldNames = rightStruct.fieldNames .filter(name => leftStruct.fieldNames.exists(resolver(_, name))) ``` To fix this issue, this PR proposes to use `HashMap` to expect a constant order searching. This PR also adds `case _ if left == right => left` to the method as a short-circuit code. ### Why are the changes needed? To fix a perf issue. ### Does this PR introduce _any_ user-facing change? No. The logic should be identical. ### How was this patch tested? I confirmed that the following micro benchmark finishes within a few seconds. ``` import org.apache.spark.sql.catalyst.expressions.SchemaPruning import org.apache.spark.sql.types._ var struct1 = new StructType() (1 to 50000).foreach { i => struct1 = struct1.add(new StructField(i + "", IntegerType)) } var struct2 = new StructType() (50001 to 100000).foreach { i => struct2 = struct2.add(new StructField(i + "", IntegerType)) } SchemaPruning.sortLeftFieldsByRight(struct1, struct2) SchemaPruning.sortLeftFieldsByRight(struct2, struct2) ``` The correctness should be checked by existing tests. Closes #33981 from sarutak/improve-schemapruning-performance. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-15 10:33:58 +09:00
Angerszhuuuu	f71f37755d	[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan ### What changes were proposed in this pull request? For query ``` select array_union(array(cast('nan' as double), cast('nan' as double)), array()) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayUnion won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-14 18:25:47 +08:00
Fu Chen	52c5ff20ca	[SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf ### What changes were proposed in this pull request? Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`. Before this pr, the following case will throw an exception. ```scala spark.udf.register("vec", (i: Int) => (0 until i).toArray) sql("select explode(vec(8)) as c1").show ``` ``` Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation java.lang.RuntimeException: Once strategy's idempotence is broken for batch Infer Filters GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12] +- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3] +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) ! +- OneRowRelation at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130) at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166) at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259) at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731) at org.apache.spark.sql.Dataset.head(Dataset.scala:2755) at org.apache.spark.sql.Dataset.take(Dataset.scala:2962) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at org.apache.spark.sql.Dataset.show(Dataset.scala:807) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #33956 from cfmcgrady/SPARK-36715. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-14 09:26:11 +09:00
Lukas Rytz	1a62e6a2c1	[SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile) As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom. ### What changes were proposed in this pull request? This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script. I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor. - removed OSGi metadata - renamed some internal inner classes - added `Automatic-Module-Name` ### Why are the changes needed? According to the posts, this solves issues for developers that write unit tests for their applications. Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally Closes #33948 from lrytz/parCollDep. Authored-by: Lukas Rytz <lukas.rytz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-13 11:06:50 -05:00
Max Gekk	bd62ad9982	[SPARK-36736][SQL] Support ILIKE (ALL \| ANY \| SOME) - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL \| ANY \| SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_example(subject varchar(20)); spark-sql> insert into ilike_example values > ('jane doe'), > ('Jane Doe'), > ('JANE DOE'), > ('John Doe'), > ('John Smith'); spark-sql> select * > from ilike_example > where subject ilike any ('jane%', '%SMITH') > order by subject; JANE DOE Jane Doe John Smith jane doe ``` The syntax of `ILIKE` is similar to `LIKE`: ``` str NOT? ILIKE (ANY \| SOME \| ALL) (pattern+) ``` ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test to test parsing of `ILIKE`: ``` $ build/sbt "test:testOnly *.ExpressionParserSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql" ``` Closes #33966 from MaxGekk/ilike-any. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 22:51:49 +08:00
Kousuke Saruta	e858cd568a	[SPARK-36724][SQL] Support timestamp_ntz as a type of time column for SessionWindow ### What changes were proposed in this pull request? This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does. ### Why are the changes needed? For better usability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33965 from sarutak/session-window-ntz. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-13 21:47:43 +08:00
Yuto Akutsu	3747cfdb40	[SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API ### What changes were proposed in this pull request? Fixed wrong documentation on Cot API ### Why are the changes needed? [Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check. Closes #33978 from yutoacts/SPARK-36738. Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-13 21:51:29 +09:00
ulysses-you	4a6b2b9fc8	[SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle ### What changes were proposed in this pull request? - move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase. - run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase. - add `SkewJoinAwareCost` to support estimate skewed join cost - add new config to decide if force optimize skewed join - in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost. ### Why are the changes needed? In general, skewed join has more impact on performance than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle. A common case: ``` HashAggregate SortMergJoin Sort Exchange Sort Exchange ``` and after this PR, the plan looks like: ``` HashAggregate Exchange SortMergJoin (isSkew=true) Sort Exchange Sort Exchange ``` Note that, the new introduced shuffle also can be optimized by AQE. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? * Add new test * pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle` * pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition` Closes #32816 from ulysses-you/support-extra-shuffle. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-13 17:21:27 +08:00
Huaxin Gao	1f679ed8e9	[SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-11 10:12:21 -07:00
dgd-contributor	711577e238	[SPARK-36687][SQL][CORE] Rename error classes with _ERROR suffix ### What changes were proposed in this pull request? redundant _ERROR suffix in error-classes.json ### Why are the changes needed? Clean up error classes to reduce clutter ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33944 from dgd-contributor/SPARK-36687. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-10 10:00:28 +09:00
Max Gekk	b74a1ba69f	[SPARK-36674][SQL] Support ILIKE - case insensitive LIKE ### What changes were proposed in this pull request? In the PR, I propose to support a case-insensitive variant of the `like` expression - `ilike`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example: ```sql spark-sql> create table ilike_ex(subject varchar(20)); spark-sql> insert into ilike_ex values > ('John Dddoe'), > ('Joe Doe'), > ('John_down'), > ('Joe down'), > (null); spark-sql> select * > from ilike_ex > where subject ilike '%j%h%do%' > order by 1; John Dddoe John_down ``` The syntax of `ilike` is similar to `like`: ``` str ILIKE pattern[ ESCAPE escape] ``` #### Implementation details `ilike` is implemented as a runtime replaceable expression to `Like(Lower(left), Lower(right), escapeChar)`. Such replacement is acceptable because `ilike`/`like` recognise only `_` and `%` as special characters but not special character classes. Note: The PR aims to support `ilike` in SQL only. Others APIs can be updated separately on demand. ### Why are the changes needed? 1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses. 2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL: - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike) - [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html) - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html) - [ClickHouse](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#ilike) - [Vertica](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm) - [Impala](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_operators.html#ilike) ### Does this PR introduce _any_ user-facing change? No, it doesn't. The PR extends existing APIs. ### How was this patch tested? 1. By running of expression examples via: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` 2. Added new test: ``` $ build/sbt "test:testOnly .RegexpExpressionsSuite" ``` 3. Via existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "test:testOnly SQLKeywordSuite" $ build/sbt "sql/testOnly *ExpressionsSchemaSuite" ``` Closes #33919 from MaxGekk/ilike-single-pattern. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:55:20 +08:00
Andrew Liu	9b633f2075	[SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe ### What changes were proposed in this pull request? fix SimplifyConditionalsInPredicate to be null-safe Reproducible: ``` import org.apache.spark.sql.types.{StructField, BooleanType, StructType} import org.apache.spark.sql.Row val schema = List( StructField("b", BooleanType, true) ) val data = Seq( Row(true), Row(false), Row(null) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) // cartesian product of true / false / null val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal")) df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // actual: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // +-----+--------+ spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate") spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show() // expected: // +-----+--------+ // \| cond\|falseVal\| // +-----+--------+ // \|false\| true\| // \| null\| true\| // +-----+--------+ ``` ### Why are the changes needed? is a regression that leads to incorrect results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate. Authored-by: Andrew Liu <andrewlliu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-09 11:32:40 +08:00
Hyukjin Kwon	34f80ef313	[SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark ### What changes were proposed in this pull request? This PR adds: - the support of `TimestampNTZType` in pandas API on Spark. - the support of Py4J handling of `spark.sql.timestampType` configuration ### Why are the changes needed? To complete `TimestampNTZ` support. In more details: - ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp): > Because naive datetime objects are treated by many datetime methods as local times ... semantically it is legitimate to assume: - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone). - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time) - ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278): - `Timestamp(..., timezone=sparkLocalTimezone)` -> `TimestamType` - `Timestamp(..., timezone=null)` -> `TimestampNTZType` - (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time). One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default: ```python >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int") 0 0 ``` whereas in Spark: ```python >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show() +------+ \| _1\| +------+ \|-32400\| +------+ >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show() +---+ \| _1\| +---+ \| 0\| +---+ ``` In contrast, some APIs like `pandas.fromtimestamp` assume they are local times: ```python >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0]) Timestamp('1970-01-01 09:00:00') ``` For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work. As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work. ### Does this PR introduce _any_ user-facing change? Yes, now pandas API on Spark can handle `TimestampNTZType`. ```python import datetime spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark() ``` ``` dt 0 2021-08-31 19:58:55.024410 ``` This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration: ```python >>> lit(datetime.datetime.now()) Column<'TIMESTAMP '2021-09-03 19:34:03.949998''> ``` ```python >>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") >>> lit(datetime.datetime.now()) Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''> ``` ### How was this patch tested? Unittests were added. Closes #33877 from HyukjinKwon/SPARK-36625. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 09:57:38 +09:00
Huaxin Gao	23794fb303	[SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference ### What changes were proposed in this pull request? Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead ### Why are the changes needed? `FieldReference` is a private class, should use `NamedReference` instead ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #33927 from huaxingao/agg_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-08 14:05:44 +08:00
Venkata Sai Akhil Gudesa	2ed6e7bc5d	[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root \|-- a: struct (nullable = true) \| \|-- c: struct (nullable = true) \| \| \|-- e: string (nullable = true) \| \|-- d: integer (nullable = true) \|-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 18:15:48 -07:00
yangjie01	bdb73bbc27	[SPARK-36613][SQL][SS] Use EnumSet as the implementation of Table.capabilities method return value ### What changes were proposed in this pull request? The `Table.capabilities` method return a `java.util.Set` of `TableCapability` enumeration type, which is implemented using `java.util.HashSet` now. Such Set can be replaced `with java.util.EnumSet` because `EnumSet` implementations can be much more efficient compared to other sets. ### Why are the changes needed? Use more appropriate data structures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. - Add a new benchmark to compare `create` and `contains` operation between `EnumSet` and `HashSet` Closes #33867 from LuciferYang/SPARK-36613. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:23:05 -05:00
yangjie01	35848385ae	[SPARK-36602][COER][SQL] Clean up redundant asInstanceOf casts ### What changes were proposed in this pull request? The change of this pr is remove redundant asInstanceOf casts in Spark code. ### Why are the changes needed? Code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA or Jenkins Tests. Closes #33852 from LuciferYang/cleanup-asInstanceof. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-05 08:22:28 -05:00
Senthil Kumar	6bd491ecb8	[SPARK-36643][SQL] Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### What changes were proposed in this pull request? This PR adds additional information to ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set ### Why are the changes needed? Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not. ### Does this PR introduce any user-facing change? Yes. Trivial change in the error messages is included ### How was this patch tested? New Test added - SPARK-36643: Show migration guide when attempting SparkConf Closes #33894 from senthh/1st_Sept_2021. Lead-authored-by: Senthil Kumar <senthh@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-03 23:49:45 -07:00
Kousuke Saruta	cf3bc65e69	[SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-03 23:25:18 +09:00
Kent Yao	7f1ad7be18	[SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### What changes were proposed in this pull request? Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### Why are the changes needed? spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues. It's better to leave this choice to users ### Does this PR introduce _any_ user-facing change? spark.sql.execution.topKSortFallbackThreshold is now user-facing ### How was this patch tested? passing GA Closes #33904 from yaooqinn/SPARK-36659. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-09-03 19:11:37 +08:00
Cheng Su	9054a6ac00	[SPARK-36652][SQL] AQE dynamic join selection should not apply to non-equi join ### What changes were proposed in this pull request? Currently `DynamicJoinSelection` has two features: 1.demote broadcast hash join, and 2.promote shuffled hash join. Both are achieved by adding join hint in query plan, and only works for equi join. However [the rule is matching with `Join` operator now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala#L71), so it would add hint for non-equi join by mistake (See added test query in `JoinHintSuite.scala` for an example). This PR is to fix `DynamicJoinSelection` to only apply to equi-join, and improve `checkHintNonEquiJoin` to check we should not add `PREFER_SHUFFLE_HASH` for non-equi join. ### Why are the changes needed? Improve the logic of codebase to be better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinHintSuite.scala`. Closes #33899 from c21/aqe-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-02 20:04:49 -07:00
Huaxin Gao	38b6fbd9b8	[SPARK-36351][SQL] Refactor filter push down in file source v2 ### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-02 19:11:43 -07:00
Angerszhuuuu	568ad6aa44	[SPARK-36637][SQL] Provide proper error message when use undefined window frame ### What changes were proposed in this pull request? Two case of using undefined window frame as below should provide proper error message 1. For case using undefined window frame with window function ``` SELECT nth_value(employee_name, 2) OVER w second_highest_salary FROM basic_pays; ``` origin error message is ``` Window function nth_value(employee_name#x, 2, false) requires an OVER clause. ``` It's confused that in use use a window frame `w` but it's not defined. Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` 2. For case using undefined window frame with aggregation function ``` SELECT SUM(salary) OVER w sum_salary FROM basic_pays; ``` origin error message is ``` Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34] +- SubqueryAlias spark_catalog.default.basic_pays +- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []] ``` In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression Now the error message is ``` Window specification w is not defined in the WINDOW clause. ``` ### Why are the changes needed? Provide proper error message ### Does this PR introduce _any_ user-facing change? Yes, error messages are improved as described in desc ### How was this patch tested? Added UT Closes #33892 from AngersZhuuuu/SPARK-36637. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-02 22:32:31 +08:00
Kazuyuki Tanimura	799a0116a8	[SPARK-36607][SQL] Support BooleanType in UnwrapCastInBinaryComparison ### What changes were proposed in this pull request? This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only. The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`. This work is an extension of SPARK-24994 and SPARK-32858 ### Why are the changes needed? Current implementation of Spark without this PR cannot properly optimize the filter for the following case ``` SELECT * FROM t WHERE boolean_field = 2 ``` The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passed existing tests ``` build/sbt "catalyst/test" build/sbt "sql/test" ``` Added unit tests ``` build/sbt "catalyst/testOnly UnwrapCastInBinaryComparisonSuite -- -z SPARK-36607" build/sbt "sql/testOnly UnwrapCastInComparisonEndToEndSuite -- -z SPARK-36607" ``` Closes #33865 from kazuyukitanimura/SPARK-36607. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-01 14:27:30 +08:00
Bo Zhang	e33cdfb317	[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches ### What changes were proposed in this pull request? This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one. To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called. This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`. For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query. ### Why are the changes needed? Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory. ### Does this PR introduce _any_ user-facing change? Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change. ### How was this patch tested? Added unit tests. Closes #33763 from bozhang2820/new-trigger. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-01 15:02:21 +09:00
Hyukjin Kwon	4ed2dab5ee	[SPARK-36608][SQL] Support TimestampNTZ in Arrow ### What changes were proposed in this pull request? This PR proposes to add the support of `TimestampNTZType` in Arrow APIs. Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow. ### Why are the changes needed? To complete the support of `TimestampNTZType` in Apache Spark. ### Does this PR introduce _any_ user-facing change? Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType` ### How was this patch tested? Unittests were added. Closes #33875 from HyukjinKwon/SPARK-36608-arrow. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-01 10:23:42 +09:00
Gengliang Wang	8a52ad9f82	[SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide ### What changes were proposed in this pull request? The `try_add` function allows the following inputs: - number, number - date, number - date, interval - timestamp, interval - interval, interval And, the `try_divide` function allows the following inputs: - number, number - interval, number However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too. ### Why are the changes needed? Improve documentation and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Also build docs for preview: ![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png) Closes #33861 from gengliangwang/enhanceTryDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-29 10:30:04 +09:00

1 2 3 4 5 ...

5733 commits