Commit graph

5681 commits

Author SHA1 Message Date
Angerszhuuuu 337a1979d2 [SPARK-36754][SQL] ArrayIntersect handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value

### How was this patch tested?
Added UT

Closes #33995 from AngersZhuuuu/SPARK-36754.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f702)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-20 16:51:31 +08:00
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
Angerszhuuuu 61d7f1da1b [SPARK-36767][SQL] ArrayMin/ArrayMax/SortArray/ArraySort add comment and Unit test
### What changes were proposed in this pull request?
Add comment about how ArrayMin/ArrayMax/SortArray/ArraySort handle NaN and add Unit test for this

### Why are the changes needed?
Add Unit test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #34008 from AngersZhuuuu/SPARK-36740.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 69e006dd53)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 21:42:21 +08:00
Liang-Chi Hsieh 895218996a [SPARK-36673][SQL] Fix incorrect schema of nested types of union
### What changes were proposed in this pull request?

This patch proposes to fix incorrect schema of `union`.

### Why are the changes needed?

The current `union` result of nested struct columns is incorrect. By definition of `union` API, it should resolve columns by position, not by name. Right now when determining the `output` (aka. the schema) of union plan, we use `merge` API which actually merges two structs (simply think it as concatenate fields from two structs if not overlapping). The merging behavior doesn't match the `union` definition.

So currently we get incorrect schema but the query result is correct. We should fix the incorrect schema.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a bug of incorrect schema.

### How was this patch tested?

Added unit test.

Closes #34025 from viirya/SPARK-36673.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit cdd7ae937d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 21:37:40 +08:00
Angerszhuuuu a78c4c44ed [SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e356f6aa11)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 20:48:39 +08:00
Wenchen Fan 16215755b7 [SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions
### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 4145498826)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 16:50:01 +09:00
Wenchen Fan c1bfe1a5c4 [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project
### What changes were proposed in this pull request?

`ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project.

Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule.

### Why are the changes needed?

Fix a bug that violates the semantic of nondeterministic expressions.

### Does this PR introduce _any_ user-facing change?

Most likely no change, but in some cases, this is a correctness bug fix which changes the query result.

### How was this patch tested?

existing tests

Closes #34023 from cloud-fan/scan.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit dfd5237c0c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 10:51:29 +08:00
Josh Rosen 3502fda783 [SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite
### What changes were proposed in this pull request?

This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`.

The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`.

In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary.

### Why are the changes needed?

Previously, `SparkSubmitSuite` tests would fail with messages like:

```
[info] - launch simple application with spark-submit *** FAILED *** (1 second, 832 milliseconds)
[info]   Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command.

After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces:

```
[info] - launch simple application with spark-submit *** FAILED *** (2 seconds, 800 milliseconds)
[info]   spark-submit returned with exit code 101.
[info]   Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar'
[info]
[info]   2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[info]   2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually ran the affected test suites.

Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
(cherry picked from commit 3ae6e6775b)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2021-09-16 14:31:22 -07:00
Angerszhuuuu 75bffd972d [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZhuuuu/SPARK-36755.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b665782f0d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:32:18 +08:00
Angerszhuuuu e64155691f [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized  NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 638085953f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:04:24 +08:00
Angerszhuuuu a472612eb8 [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f71f37755d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-14 18:26:02 +08:00
Fu Chen 303590b3e9 [SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf
### What changes were proposed in this pull request?

Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`.
Before this pr, the following case will throw an exception.

```scala
spark.udf.register("vec", (i: Int) => (0 until i).toArray)
sql("select explode(vec(8)) as c1").show
```

```
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

java.lang.RuntimeException:
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

	at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130)
	at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
	at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259)
	at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2755)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2962)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

Unit test.

Closes #33956 from cfmcgrady/SPARK-36715.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 52c5ff20ca)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 09:26:21 +09:00
Lukas Rytz 2e7583799e [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile)
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.

### What changes were proposed in this pull request?

This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.

I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
  - removed OSGi metadata
  - renamed some internal inner classes
  - added `Automatic-Module-Name`

### Why are the changes needed?

According to the posts, this solves issues for developers that write unit tests for their applications.

Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Locally

Closes #33948 from lrytz/parCollDep.

Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 1a62e6a2c1)
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-13 11:06:58 -05:00
Yuto Akutsu b043ee4de7 [SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API
### What changes were proposed in this pull request?

Fixed wrong documentation on Cot API

### Why are the changes needed?

[Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual check.

Closes #33978 from yutoacts/SPARK-36738.

Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3747cfdb40)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-13 21:51:43 +09:00
Andrew Liu 6cb23c163c [SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe
### What changes were proposed in this pull request?

fix SimplifyConditionalsInPredicate to be null-safe

Reproducible:

```
import org.apache.spark.sql.types.{StructField, BooleanType, StructType}
import org.apache.spark.sql.Row

val schema = List(
  StructField("b", BooleanType, true)
)
val data = Seq(
  Row(true),
  Row(false),
  Row(null)
)
val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

// cartesian product of true / false / null
val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal"))
df2.createOrReplaceTempView("df2")

spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// actual:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false|    true|
// +-----+--------+
spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate")
spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// expected:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false|    true|
// | null|    true|
// +-----+--------+
```

### Why are the changes needed?

is a regression that leads to incorrect results

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate.

Authored-by: Andrew Liu <andrewlliu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9b633f2075)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-09 11:32:59 +08:00
Huaxin Gao 7e8860751c [SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference
### What changes were proposed in this pull request?
Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead

### Why are the changes needed?
`FieldReference` is a private class, should use `NamedReference` instead

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #33927 from huaxingao/agg_followup.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 23794fb303)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-08 14:06:26 +08:00
Venkata Sai Akhil Gudesa 4a92b0e278 [SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections
### What changes were proposed in this pull request?

This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections.

### Why are the changes needed?

To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception.

Consider the following schema:
```
root
 |-- a: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- e: string (nullable = true)
 |    |-- d: integer (nullable = true)
 |-- b: string (nullable = true)
```
and the query:
`SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b`

Executing the query before this PR will result in the error:
```
java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
  at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...
```
The optimised plan before this PR is:

```
'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
   +- Relation default.test_aggregates[a#0,b#1] parquet
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:
```
 Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]
```

Closes #33921 from vicennial/spark-36677.

Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 2ed6e7bc5d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-07 18:16:02 -07:00
Kent Yao aa96a374b2 [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config
### What changes were proposed in this pull request?

Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config

### Why are the changes needed?

spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big,  there would be performance issues.

It's better to leave this choice to users

### Does this PR introduce _any_ user-facing change?

 spark.sql.execution.topKSortFallbackThreshold is now user-facing

### How was this patch tested?

passing GA

Closes #33904 from yaooqinn/SPARK-36659.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 7f1ad7be18)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-09-03 10:27:10 -07:00
Kousuke Saruta a3901ed384 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0
### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit cf3bc65e69)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-03 23:25:33 +09:00
Angerszhuuuu 8b4cc90c44 [SPARK-36637][SQL] Provide proper error message when use undefined window frame
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message

1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

### Why are the changes needed?
Provide proper error message

### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc

### How was this patch tested?
Added UT

Closes #33892 from AngersZhuuuu/SPARK-36637.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 568ad6aa44)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-02 22:32:47 +08:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
Gengliang Wang 3719d87668 [SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide
### What changes were proposed in this pull request?

The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval

And, the `try_divide` function allows the following inputs:

- number, number
- interval, number

However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.

### Why are the changes needed?

Improve documentation and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)

Closes #33861 from gengliangwang/enhanceTryDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a52ad9f82)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:30:13 +09:00
Gengliang Wang 9b28c2b09e [SPARK-36597][DOCS][3.2] Fix issues in SQL function docs
### What changes were proposed in this pull request?

* the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional
* remove the `|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get
* Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699

### Why are the changes needed?

Improve the documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build doc and preview:
![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png)
![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png)

![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png)

![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png)

Closes #33857 from gengliangwang/SPARK-36597-3.2.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-27 13:00:12 -07:00
Jungtaek Lim 118a53d87f [SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc
### What changes were proposed in this pull request?

This PR proposes to document `window` & `session_window` function in SQL API doc page.

Screenshot of functions:

> window

![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)

> session_window

![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)

### Why are the changes needed?

Description is missing in both `window` / `session_window` functions for SQL API page.

### Does this PR introduce _any_ user-facing change?

Yes, the description of `window` / `session_window` functions will be available in SQL API page.

### How was this patch tested?

Only doc changes.

Closes #33846 from HeartSaVioR/SPARK-36595.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit bc32144a91)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-27 12:39:21 +09:00
Gengliang Wang eca81cc0ae [SPARK-36457][DOCS][3.2] Review and fix issues in Scala/Java API docs
### What changes were proposed in this pull request?

Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:

- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?

Improve API docs

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33845 from gengliangwang/SPARK-36457-3.2.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 10:51:27 +08:00
Gengliang Wang c25f1e4347 [SPARK-36227][SQL][FOLLOWUP][3.2] Remove unused import in TimestampNTZType.scala
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/33837
It is to fix compilation error: https://github.com/apache/spark/runs/3431646840

### Why are the changes needed?

Fix a compilation error

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass existing UTs

Closes #33851 from gengliangwang/fixCompile.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 02:18:24 +08:00
Gengliang Wang 52b3b2d5bc [SPARK-36227][SQL][DOCS][3.2] Remove TimestampNTZ from API docs
### What changes were proposed in this pull request?

Although we try to remove TimestampNTZ from Branch 3.2 in , it still shows up in our API doc:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/scala/org/apache/spark/sql/types/TimestampNTZType.html
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/java/org/apache/spark/sql/types/DataType.html

This PR is to clean it up in the API docs by
* making the TimestampNTZ type private
* remove TimestampNTZ in DataTypes

The changes are only for branch 3.2.

### Why are the changes needed?

Fix API doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually check generated docs

Closes #33837 from gengliangwang/privateNTZ.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-26 18:27:33 +08:00
Max Gekk 0c364e607d [SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.

Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```

Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```

By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```

Closes #33838 from MaxGekk/fix-ts_ntz-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 159ff9fd14)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:11:37 +08:00
Gengliang Wang 464841224c [SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry
### What changes were proposed in this pull request?

Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:

- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like

This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`

### Why are the changes needed?

Correct the SQL doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually

Closes #33834 from gengliangwang/allowSQLFunVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 18143fb426)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-25 22:32:49 +08:00
Max Gekk a4c5140242 [SPARK-36567][SQL] Support foldable special datetime strings by CAST
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.

### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```

### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.

### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```

Closes #33816 from MaxGekk/foldable-datetime-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df0ec56723)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-25 14:09:13 +08:00
Hyukjin Kwon 26ae9e93da [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 93cec49212)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:03:00 +09:00
Gengliang Wang a313082d67 [SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package.

### Why are the changes needed?

There are two packages for `org.apache.spark.sql.connector`
SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector
Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector

As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst.
### Does this PR introduce _any_ user-facing change?

No, the trait is not released yet.

### How was this patch tested?

Existing UT.

Closes #33826 from gengliangwang/moveLocalScan.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5b4c216478)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:24:06 -07:00
Huaxin Gao e48de7884d [SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.

### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33815 from huaxingao/agg_package.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit cd2342691d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-23 15:31:35 -07:00
Gengliang Wang 69be513c5e Preparing development version 3.2.1-SNAPSHOT 2021-08-20 12:40:47 +00:00
Gengliang Wang 6bb3523d8e Preparing Spark release v3.2.0-rc1 2021-08-20 12:40:40 +00:00
Gengliang Wang fafdc1482b Revert "Preparing Spark release v3.2.0-rc1"
This reverts commit 8e58fafb05.
2021-08-20 20:07:02 +08:00
Gengliang Wang c829ed53ff Revert "Preparing development version 3.2.1-SNAPSHOT"
This reverts commit 4f1d21571d.
2021-08-20 20:07:01 +08:00
Gengliang Wang 4f1d21571d Preparing development version 3.2.1-SNAPSHOT 2021-08-19 14:08:32 +00:00
Gengliang Wang 8e58fafb05 Preparing Spark release v3.2.0-rc1 2021-08-19 14:08:26 +00:00
gengjiaan 7041c0f9f0 [SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/33775 commits the debug code mistakely.
This PR revert the test path.

### Why are the changes needed?
Revoke debug code.

### Does this PR introduce _any_ user-facing change?
 'No'.
Just adjust test.

### How was this patch tested?
Revert non-ansi test path.

Closes #33787 from beliefer/SPARK-36428-followup2.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 462aa7cd3c)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:33:39 +08:00
Shixiong Zhu 54cca7f82e [SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries
### What changes were proposed in this pull request?

RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version.

A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format.

In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed.

We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint.

### Why are the changes needed?

Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change.

### Does this PR introduce _any_ user-facing change?

No. RocksDB state store is a new feature in Spark 3.2, which has not yet released.

### How was this patch tested?

The new unit tests.

Closes #33749 from zsxwing/SPARK-36519.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit ea4919801a)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-19 00:23:52 -07:00
gengjiaan 3d69d0d003 [SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp
### What changes were proposed in this pull request?
The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter.
This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code.

### Why are the changes needed?
Simplify `make_timestamp`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
New tests.

Closes #33775 from beliefer/SPARK-36428-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 707eefa3c7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-18 22:57:27 +08:00
Gengliang Wang 70635b4b26 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 8bfb4f1e72)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:24:09 +08:00
Max Gekk 07c6976f79 [SPARK-36524][SQL] Common class for ANSI interval types
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
    case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
    case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 82a31508af)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-17 12:28:07 +03:00
Gengliang Wang 41e5144b53 [SPARK-36521][SQL] Disallow comparison between Interval and String
### What changes were proposed in this pull request?

Disallow comparison between Interval and String in the default type coercion rules.

### Why are the changes needed?

If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33750 from gengliangwang/disallowCom.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 26d6b952dc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 22:41:25 +03:00
Liang-Chi Hsieh 3aa933b162 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:16 +09:00
Gengliang Wang c898a940e2 [SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal
### What changes were proposed in this pull request?

If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc.
The related binary operation for Interval contains
- Add
- Subtract
- Comparisions

Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them.

### Why are the changes needed?

Avoid ambiguously implicit casting string literals to interval types.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion is not released yet.

### How was this patch tested?

New tests.

Closes #33737 from gengliangwang/disallowStringAndInterval.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit ecdea91602)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-14 10:50:43 +08:00
gengjiaan eb840578f7 [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
### What changes were proposed in this pull request?
With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode.

```
org.apache.spark.sql.AnalysisException
cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7
```

We should update the function `make_timestamp` to allow integer type 'seconds' parameter.

### Why are the changes needed?
Make `make_timestamp` could accepts integer as 'seconds' parameter.

### Does this PR introduce _any_ user-facing change?
'Yes'.
`make_timestamp` could accepts integer as 'seconds' parameter.

### How was this patch tested?
New tests.

Closes #33665 from beliefer/SPARK-36428.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7d82336734)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 13:13:15 +08:00