Commit graph

4373 commits

Author SHA1 Message Date
Kent Yao 5945d46c11 [SPARK-31225][SQL] Override sql method of OuterReference
### What changes were proposed in this pull request?

OuterReference is one LeafExpression, so it's children is Nil, which makes its SQL representation always be outer(). This makes our explain-command and error msg unclear when OuterReference exists.
e.g.

```scala
org.apache.spark.sql.AnalysisException:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(in.`value` = max(outer()))]
Invalid expressions: [max(outer())];;
```
This PR override its `sql` method with its `prettyName` and single argment `e`'s `sql` methond

### Why are the changes needed?

improve err message

### Does this PR introduce any user-facing change?

yes, the err msg caused by OuterReference has changed
### How was this patch tested?

modified ut results

Closes #27985 from yaooqinn/SPARK-31225.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-27 15:21:19 +08:00
DB Tsai cb0db21373 [SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][TEST-HIVE1.2] Nested Column Predicate Pushdown for Parquet
### What changes were proposed in this pull request?
1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions.
2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet.

### Why are the changes needed?
Better performance for handling nested predicate pushdown.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New tests are added.

Closes #27728 from dbtsai/SPARK-17636.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-27 14:28:57 +08:00
Wenchen Fan 05498af72e [SPARK-31201][SQL] Add an individual config for skewed partition threshold
### What changes were proposed in this pull request?

Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so.

Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold.

This PR adds a new config for the threshold and set default value as 256 MB.

### Why are the changes needed?

Avoid skew join handling that may introduce a  perf regression.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27967 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-26 22:57:01 +09:00
HyukjinKwon 3bd10ce007 [SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type
### What changes were proposed in this pull request?

This PR targets for non-nullable null type not to coerce to nullable type in complex types.

Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types.

This PR also reverts and supersedes d7b97a1d0d

### Why are the changes needed?

To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields:

```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema()
spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema()
```

### Does this PR introduce any user-facing change?

Yes.

```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema()
```

**Before:**

```
org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;;
'Project [cast(array() as array<int>) AS array()#68]
+- Range (0, 1, step=1, splits=Some(12))

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
```

```
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = true)
```

**After:**

```
root
 |-- array(): array (nullable = false)
 |    |-- element: integer (containsNull = false)
```

```
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = false)
```

### How was this patch tested?

Unittests were added and manually tested.

Closes #27991 from HyukjinKwon/SPARK-31227.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-26 15:42:54 +08:00
Maxim Gekk cec9604eae [SPARK-31237][SQL][TESTS] Replace 3-letter time zones by zone offsets
### What changes were proposed in this pull request?
In the PR, I propose to add a few `ZoneId` constant values to the `DateTimeTestUtils` object, and reuse the constants in tests. Proposed the following constants:
- PST = -08:00
- UTC = +00:00
- CEST = +02:00
- CET = +01:00
- JST = +09:00
- MIT = -09:30
- LA = America/Los_Angeles

### Why are the changes needed?
All proposed constant values (except `LA`) are initialized by zone offsets according to their definitions. This will allow to avoid:
- Using of 3-letter time zones that have been already deprecated in JDK, see _Three-letter time zone IDs_ in https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html
- Incorrect mapping of 3-letter time zones to zone offsets, see SPARK-31237. For example, `PST` is mapped to `America/Los_Angeles` instead of the `-08:00` zone offset.

Also this should improve stability and maintainability of test suites.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running affected test suites.

Closes #28001 from MaxGekk/replace-pst.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-26 13:36:00 +08:00
Wenchen Fan 4f274a4de9
[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables
### What changes were proposed in this pull request?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2
```

**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3
```

According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

### Why are the changes needed?

avoid confusing/wrong behavior

### Does this PR introduce any user-facing change?

yes, now users can't create/alter non-Hive tables with CHAR type.

### How was this patch tested?

new tests

Closes #27902 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-25 09:25:55 -07:00
Takeshi Yamamuro da49f50621
[SPARK-25121][SQL][FOLLOWUP] Add more unit tests for multi-part identifiers in join strategy hints
### What changes were proposed in this pull request?

This pr intends to add unit tests for the other join hints (`MERGEJOIN`, `SHUFFLE_HASH`, and `SHUFFLE_REPLICATE_NL`). This is a followup PR of #27935.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28013 from maropu/SPARK-25121-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-25 08:37:28 -07:00
Maxim Gekk 27d53de10f [SPARK-31232][SQL][DOCS] Specify formats of spark.sql.session.timeZone
### What changes were proposed in this pull request?
In the PR, I propose to update the doc for `spark.sql.session.timeZone`, and restrict format of config's values to 2 forms:
1. Geographical regions, such as `America/Los_Angeles`.
2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`.

### Why are the changes needed?
Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html).

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`, and manual testing.

Closes #27999 from MaxGekk/doc-session-time-zone.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-25 16:32:28 +08:00
samsetegne 44431d4b1a [SPARK-30822][SQL] Remove semicolon at the end of a sql query
# What changes were proposed in this pull request?
This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception.

# Why are the changes needed?
When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior).

# Does this PR introduce any user-facing change?
No

# How was this patch tested?
Unit test added to `PlanParserSuite`
```
sbt> project catalyst
sbt> testOnly *PlanParserSuite
[info] - case insensitive (565 milliseconds)
[info] - explain (9 milliseconds)
[info] - set operations (41 milliseconds)
[info] - common table expressions (31 milliseconds)
[info] - simple select query (47 milliseconds)
[info] - hive-style single-FROM statement (11 milliseconds)
[info] - multi select query (32 milliseconds)
[info] - query organization (41 milliseconds)
[info] - insert into (12 milliseconds)
[info] - aggregation (24 milliseconds)
[info] - limit (11 milliseconds)
[info] - window spec (11 milliseconds)
[info] - lateral view (17 milliseconds)
[info] - joins (62 milliseconds)
[info] - sampled relations (11 milliseconds)
[info] - sub-query (11 milliseconds)
[info] - scalar sub-query (9 milliseconds)
[info] - table reference (2 milliseconds)
[info] - table valued function (8 milliseconds)
[info] - SPARK-20311 range(N) as alias (2 milliseconds)
[info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds)
[info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds)
[info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds)
[info] - inline table (23 milliseconds)
[info] - simple select query with !> and !< (5 milliseconds)
[info] - select hint syntax (34 milliseconds)
[info] - SPARK-20854: select hint syntax with expressions (12 milliseconds)
[info] - SPARK-20854: multiple hints (4 milliseconds)
[info] - TRIM function (16 milliseconds)
[info] - OVERLAY function (16 milliseconds)
[info] - precedence of set operations (18 milliseconds)
[info] - create/alter view as insert into table (4 milliseconds)
[info] - Invalid insert constructs in the query (10 milliseconds)
[info] - relation in v2 catalog (3 milliseconds)
[info] - CTE with column alias (2 milliseconds)
[info] - statement containing terminal semicolons (3 milliseconds)
[info] ScalaTest
[info] Run completed in 3 seconds, 129 milliseconds.
[info] Total number of tests run: 36
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 36, Failed 0, Errors 0, Passed 36
```

### Current behavior:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+

scala> val df = sql("select 1;")
// org.apache.spark.sql.catalyst.parser.ParseException:
// extraneous input ';' expecting <EOF>(line 1, pos 8)

// == SQL ==
// select 1;
// --------^^^

//   at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
//   at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
//   at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
//   at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
//   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
//   at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
//   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
//   ... 47 elided
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#|  1|
#+---+
#|  1|
#+---+

df = spark.sql('select 1;')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql
#     return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
#   File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__
#   File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco
#     raise converted
# pyspark.sql.utils.ParseException:
# extraneous input ';' expecting <EOF>(line 1, pos 8)

# == SQL ==
# select 1;
# --------^^^
```

### Behavior after proposed fix:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+

scala> val df = sql("select 1;")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#|    1 |
#+---+
#|    1 |
#+---+

df = spark.sql('select 1;')
df.show()
#+---+
#|    1 |
#+---+
#|    1 |
#+---+
```

Closes #27567 from samsetegne/semicolon.

Authored-by: samsetegne <samuelsetegne@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-25 15:00:15 +08:00
yi.wu f6ff7d0cf8 [SPARK-30127][SQL] Support case class parameter for typed Scala UDF
### What changes were proposed in this pull request?

To support  case class parameter for typed Scala UDF, e.g.

```
case class TestData(key: Int, value: String)
val f = (d: TestData) => d.key * d.value.toInt
val myUdf = udf(f)
val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2")
checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil)
```

### Why are the changes needed?

Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class.

And here's benchmark result of using case class comparing to row:

```scala

// case class:  58ms 65ms 59ms 64ms 61ms
// row:         59ms 64ms 73ms 84ms 69ms
val f1 = (d: TestData) => s"${d.key}, ${d.value}"
val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}"
val udf1 = udf(f1)
// set spark.sql.legacy.allowUntypedScalaUDF=true
val udf2 = udf(f2, StringType)

val df = spark.range(100000).selectExpr("cast (id as int) as id")
    .select(struct('id, lit("str")).as("col"))
df.cache().collect()

// warmup to exclude some extra influence
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()

start = System.currentTimeMillis()
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

start = System.currentTimeMillis()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

```

### Does this PR introduce any user-facing change?

Yes. User now could be able to use typed Scala UDF with case class as input parameter.

### How was this patch tested?

Added unit tests.

Closes #27937 from Ngone51/udf_caseclass_support.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 23:03:57 +08:00
Maxim Gekk 1fd4607d84 [SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types
### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.

Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 21:14:25 +08:00
HyukjinKwon bd324007d5 [SPARK-31229][SQL][TESTS] Add unit tests TypeCoercion.findTypeForComplex and Cast.canCast in null <> complex types
### What changes were proposed in this pull request?

This PR (SPARK-31229) is rather a followup of https://github.com/apache/spark/pull/27926 (SPARK-31166). It adds unittests for `TypeCoercion.findTypeForComplex` and `Cast.canCast` about struct, map and array with the respect to null types.

### Why are the changes needed?

To detect which scope was broken in the future easily.

### Does this PR introduce any user-facing change?

No, it's a test-only.

### How was this patch tested?

Unittests were added.

Closes #27990 from HyukjinKwon/SPARK-31166-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-24 14:10:59 +09:00
Wenchen Fan 1d0f54951e [SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 12:07:22 +08:00
yi.wu 5c4d44bb83 [SPARK-31190][SQL] ScalaReflection should not erasure user defined AnyVal type
### What changes were proposed in this pull request?

Improve `ScalaReflection` to only don't erasure non user defined `AnyVal` type, but still erasure other types, e.g. `Any`. And this brings two benefits:

1. Give better encode error message for some unsupported types, e.g. `Any`

2. Won't miss the walk path for the `AnyVal` type

### Why are the changes needed?

Firstly, PR #15284 added encode(serializeFor/deserializeFor) support for value class, which extends `AnyVal`, by not erasure types. But, this also introduce a problem that when user try to encoder unsupported types, e.g. `Any`, it will fail on `java.lang.ClassNotFoundException: scala.Any` due to the reason that `scala.Any` doesn't erasure to `java.lang.Object`.

Also, in current `getClassNameFromType()`, it always erasure types which could missing walked path for user defined `AnyVal` types.

### Does this PR introduce any user-facing change?

Yes. For the test below:

```
case class Bar(i: Any)
case class Foo(i: Bar) extends AnyVal

test() {
  implicitly[ExpressionEncoder[Foo]]
}
```

Before:

```
java.lang.ClassNotFoundException: scala.Any
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
 ...
````

After:
```
java.lang.UnsupportedOperationException: No Encoder found for Any
 - field (class: "java.lang.Object", name: "i")
 - field (class: "org.apache.spark.sql.catalyst.encoders.Bar", name: "i")
 - root class: "org.apache.spark.sql.catalyst.encoders.Foo"
 at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:561)
```

### How was this patch tested?

Added unit test and test manually.

Closes #27959 from Ngone51/impr_anyval.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 16:28:34 +08:00
Maxim Gekk db6247faa8 [SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years
### What changes were proposed in this pull request?
In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date **1000-02-29** exists in the Julian calendar but not in Proleptic Gregorian calendar.

I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, **1000-02-29** doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is **1000-03-01**.

### Why are the changes needed?
Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year
```
The parquet files were saved via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-02-29|
+----------+
```

### Does this PR introduce any user-facing change?
Yes, after the fix:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-03-01|
+----------+
```

### How was this patch tested?
Added tests to `DateTimeUtilsSuite`.

Closes #27974 from MaxGekk/julian-date-29-feb.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 14:21:24 +08:00
Kent Yao 88ae6c4481 [SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document
### What changes were proposed in this pull request?

Fix errors and missing parts for datetime pattern document
1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own.
2. Some pattern letters are missing
3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N')
4. the second fraction pattern different logic for parsing and formatting

### Why are the changes needed?

fix and improve doc
### Does this PR introduce any user-facing change?

yes, new and updated doc
### How was this patch tested?

pass Jenkins
viewed locally with `jekyll serve`
![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png)

Closes #27956 from yaooqinn/SPARK-31189.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 21:59:26 +08:00
Maxim Gekk 4766a36647 [SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.avro.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds.
2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on.
5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-07|
+----------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-01|
+----------+
```

### How was this patch tested?
1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro")

scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df2: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> :paste
// Entering paste mode (ctrl-D to finish)

  val timestampSchema = s"""
    |  {
    |    "namespace": "logical",
    |    "type": "record",
    |    "name": "test",
    |    "fields": [
    |      {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null}
    |    ]
    |  }
    |""".stripMargin

// Exiting paste mode, now interpreting.
scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")

```

2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. :
  - `rebasing microseconds timestamps in write`
  - `rebasing milliseconds timestamps in write`
  - `rebasing dates in write`

Closes #27953 from MaxGekk/rebase-avro-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 13:57:49 +08:00
Dongjoon Hyun f1cc86792f [SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases
### What changes were proposed in this pull request?

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

### Why are the changes needed?

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 12:28:57 +08:00
Takeshi Yamamuro ca499e9409
[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution
### What changes were proposed in this pull request?

This pr fixed code to respect a database name for broadcast table hint resolution.
Currently, spark ignores a database name in multi-part names;
```
scala> sql("CREATE DATABASE testDb")
scala> spark.range(10).write.saveAsTable("testDb.t")

// without this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#24L]
+- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=4)
   +- *(2) Project [id#26L]
      +- *(2) Filter isnotnull(id#26L)
         +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

// with this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#3L]
+- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight
   :- *(2) Range (0, 10, step=1, splits=4)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
      +- *(1) Project [id#5L]
         +- *(1) Filter isnotnull(id#5L)
            +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>
```

This PR comes from https://github.com/apache/spark/pull/22198

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #27935 from maropu/SPARK-25121-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-19 20:11:04 -07:00
Wenchen Fan ac262cb272 [SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26933

Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode.

### Why are the changes needed?

correct the ANSI cast behavior from string to integral

### Does this PR introduce any user-facing change?

Yes under ANSI mode, but ANSI mode is off by default.

### How was this patch tested?

new test

Closes #27957 from cloud-fan/ansi.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-20 00:52:09 +09:00
Maxim Gekk bb295d80e3 [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.parquet.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion.
2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on.
5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`.
6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`.
7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`.

### Why are the changes needed?
- For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.
- It fixes the bug of incorrect saving/loading timestamps of the `INT96` type

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-07 11:32:20.123456|
+--------------------------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)

scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-01 01:02:03.123456|
+--------------------------+
```

### How was this patch tested?
1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date")

scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96")
```
2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes):
```bash
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```
Read the saved date/timestamp by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```

Closes #27915 from MaxGekk/rebase-parquet-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-19 12:49:51 +08:00
Wenchen Fan 8643e5d9c5 [SPARK-31171][SQL][FOLLOWUP] update document
### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/27936 to update document.

### Why are the changes needed?

correct document

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27950 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:29:31 +09:00
Kent Yao 3d695954e5 [SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text
### What changes were proposed in this pull request?

pattern `''` means literal `'`

```sql
select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS");
5377-02-14 06:27:19'000000519
```
0946a9514f missed this case and this pr add it back.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #27949 from yaooqinn/SPARK-31150-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:27:06 +09:00
yi.wu 8bfaa62f2f [SPARK-31175][SQL] Avoid creating reverse comparator for each compare in InterpretedOrdering
### What changes were proposed in this pull request?

Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering.

### Why are the changes needed?

Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27938 from Ngone51/reverse_comparator.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 23:56:48 +08:00
Kent Yao 57fcc49306 [SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar
### What changes were proposed in this pull request?

The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g.

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee');
```
Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before.

This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class.
And 'e' and 'c' char checking in this method.

Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g.

```sql
spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'");
20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")]
java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S

spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'");
NULL
```
### Why are the changes needed?

avoid vagueness
bug fix

### Does this PR introduce any user-facing change?

no, these are not  exposed yet

### How was this patch tested?

add ut

Closes #27939 from yaooqinn/SPARK-31176.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 20:19:50 +08:00
Kent Yao f1d27cdd91 [SPARK-31119][SQL] Add interval value support for extract expression as extract source
### What changes were proposed in this pull request?

```
<extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren>

<extract source> ::= <datetime value expression> | <interval value expression>
```
We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval.

This pr adds interval value support for `extract` expression as extract source

### Why are the changes needed?

For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions.

### Does this PR introduce any user-facing change?

yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals

### How was this patch tested?

add unit tests

Closes #27876 from yaooqinn/SPARK-31119.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 12:29:39 +08:00
Wenchen Fan dc5ebc2d5b
[SPARK-31171][SQL] size(null) should return null under ansi mode
### What changes were proposed in this pull request?

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

### Why are the changes needed?

In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

### Does this PR introduce any user-facing change?

No as ANSI mode is off by default.

### How was this patch tested?

new tests

Closes #27936 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 11:48:54 -07:00
Kent Yao 0946a9514f [SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

### Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

### Does this PR introduce any user-facing change?

no, the related changes are newly added
### How was this patch tested?

add uts

Closes #27906 from yaooqinn/SPARK-31150.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 21:53:46 +08:00
Wenchen Fan d7b97a1d0d [SPARK-31166][SQL] UNION map<null, null> and other maps should not fail
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.

The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.

### Why are the changes needed?

To avoid breaking queries.

### Does this PR introduce any user-facing change?

Yes, now some queries that work in 2.x can work in 3.0 as well.

### How was this patch tested?

new test

Closes #27926 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 12:01:29 +08:00
HyukjinKwon 6704103499
[SPARK-31146][SQL] Leverage the helper method for aliasing in built-in SQL expressions
### What changes were proposed in this pull request?

This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable.

- `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default.
- When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`,
`FormatString` and `XPathDouble`.

This PR fixes the aliases of the functions below:

| class                    | alias            |
|--------------------------|------------------|
|`Rand`                    |`random`          |
|`Ceil`                    |`ceiling`         |
|`Remainder`               |`mod`             |
|`Pow`                     |`pow`             |
|`Signum`                  |`sign`            |
|`Chr`                     |`char`            |
|`Length`                  |`char_length`     |
|`Length`                  |`character_length`|
|`FormatString`            |`printf`          |
|`Substring`               |`substr`          |
|`Upper`                   |`ucase`           |
|`XPathDouble`             |`xpath_number`    |
|`DayOfMonth`              |`day`             |
|`CurrentTimestamp`        |`now`             |
|`Size`                    |`cardinality`     |
|`Sha1`                    |`sha`             |
|`CallMethodViaReflection` |`java_method`     |

Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser.

Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller.

### Why are the changes needed?

To respect expression name.

### Does this PR introduce any user-facing change?

Yes, it will change the output column name.

### How was this patch tested?

Manually tested, and unittests were added.

Closes #27901 from HyukjinKwon/31146.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-16 11:22:34 -07:00
Wenchen Fan 50a29672e0 [SPARK-30958][SQL] do not set default era for DateTimeFormatter
### What changes were proposed in this pull request?

It's not needed at all as now we replace "y" with "u" if there is no "G". So the era is either explicitly specified (e.g. "yyyy G") or can be inferred from the year (e.g. "uuuu").

### Why are the changes needed?

By default we use "uuuu" as the year pattern, which indicates the era already. If we set a default era, it can get conflicted and fail the parsing.

### Does this PR introduce any user-facing change?

yea, now spark can parse date/timestamp with negative year via the "yyyy" pattern, which will be converted to "uuuu" under the hood.

### How was this patch tested?

new tests

Closes #27707 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-16 16:48:31 +09:00
Wenchen Fan b27b3c91f1 [SPARK-31090][SPARK-25457] Revert "IntegralDivide returns data type of the operands"
### What changes were proposed in this pull request?

This reverts commit 47d6e80a2e.

### Why are the changes needed?

There is no standard requiring that `div` must return the type of the operand, and always returning long type looks fine. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `div` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes #27835 from cloud-fan/revert2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-13 10:47:36 +09:00
Kent Yao 7b4b29e8d9
[SPARK-31131][SQL] Remove the unnecessary config spark.sql.legacy.timeParser.enabled
### What changes were proposed in this pull request?

spark.sql.legacy.timeParser.enabled should be removed from SQLConf and the migration guide
spark.sql.legacy.timeParsePolicy is the right one

### Why are the changes needed?

fix doc

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Pass the jenkins

Closes #27889 from yaooqinn/SPARK-31131.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-12 09:24:49 -07:00
Wenchen Fan 77c49cb702 [SPARK-31124][SQL] change the default value of minPartitionNum in AQE
### What changes were proposed in this pull request?

AQE has a perf regression when using the default settings: if we coalesce the shuffle partitions into one or few partitions, we may leave many CPU cores idle and the perf is worse than with AQE off (which leverages all CPU cores).

Technically, this is not a bad thing. If there are many queries running at the same time, it's better to coalesce shuffle partitions into fewer partitions. However, the default settings of AQE should try to avoid any perf regression as possible as we can.

This PR changes the default value of minPartitionNum when coalescing shuffle partitions, to be `SparkContext.defaultParallelism`, so that AQE can leverage all the CPU cores.

### Why are the changes needed?

avoid AQE perf regression

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #27879 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-12 21:28:24 +08:00
yi.wu feb9b9e771 [SPARK-31010][SQL][FOLLOW-UP] Give an example for typed Scala UDF in error message
### What changes were proposed in this pull request?

In the error message, adding an example for typed Scala UDF.

### Why are the changes needed?

Help user to know how to migrate to typed Scala UDF.

### Does this PR introduce any user-facing change?

No, it's a new error message in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27884 from Ngone51/spark_31010_followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-12 21:16:02 +09:00
Wenchen Fan 8efb71013d
[SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066

### Why are the changes needed?

There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `size(null)` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes #27834 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-11 09:55:24 -07:00
Wenchen Fan 5be0d04f16 [SPARK-31117][SQL][TEST] reduce the test time of DateTimeUtilsSuite
### What changes were proposed in this pull request?

`DateTimeUtilsSuite.daysToMicros and microsToDays` takes 30 seconds, which is too long for a UT.

This PR changes the test to check random data, to reduce testing time. Now this test takes 1 second.

### Why are the changes needed?

make test faster

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27873 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 23:47:13 +08:00
Maxim Gekk 3d3e366aa8 [SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
### What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.

### Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.

Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```

### Does this PR introduce any user-facing change?
Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```

### How was this patch tested?
By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes #27807 from MaxGekk/rebase-timestamp-before-1582.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 20:53:56 +08:00
Liang-Chi Hsieh 15557a7d05 [SPARK-31071][SQL] Allow annotating non-null fields when encoding Java Beans
### What changes were proposed in this pull request?

When encoding Java Beans to Spark DataFrame, respecting `javax.annotation.Nonnull` and producing non-null fields.

### Why are the changes needed?

When encoding Java Beans to Spark DataFrame, non-primitive types are encoded as nullable fields. Although It works for most cases, it can be an issue under a few situations, e.g. the one described in the JIRA ticket when saving DataFrame to Avro format with non-null field.

We should allow Spark users more flexibility when creating Spark DataFrame from Java Beans. Currently, Spark users cannot create DataFrame with non-nullable fields in the schema from beans with non-nullable properties.

Although it is possible to project top-level columns with SQL expressions like `AssertNotNull` to make it non-null, for nested fields it is more tricky to do it similarly.

### Does this PR introduce any user-facing change?

Yes. After this change, Spark users can use `javax.annotation.Nonnull` to annotate non-null fields in Java Beans when encoding beans to Spark DataFrame.

### How was this patch tested?

Added unit test.

Closes #27851 from viirya/SPARK-31071.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 18:27:48 +08:00
Yuanjian Li 3493162c78 [SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime
### What changes were proposed in this pull request?
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)

### Why are the changes needed?
For backward compatibility.

### Does this PR introduce any user-facing change?
No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

### How was this patch tested?
Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)

Closes #27830 from xuanyuanking/SPARK-31030.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 14:11:13 +08:00
Kent Yao 3bd6ebff81 [SPARK-30189][SQL] Interval from year-month/date-time string should handle whitespaces
### What changes were proposed in this pull request?

Currently, we parse interval from multi units strings or from date-time/year-month pattern strings, the former handles all whitespace, the latter not or even spaces.

### Why are the changes needed?

behavior consistency

### Does this PR introduce any user-facing change?
yes, interval in date-time/year-month like
```
select interval '\n-\t10\t 12:34:46.789\t' day to second
-- !query 126 schema
struct<INTERVAL '-10 days -12 hours -34 minutes -46.789 seconds':interval>
-- !query 126 output
-10 days -12 hours -34 minutes -46.789 seconds
```
is valid now.

### How was this patch tested?

add ut.

Closes #26815 from yaooqinn/SPARK-30189.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 22:08:58 +08:00
Eric Wu 15df2a3f40 [SPARK-31079][SQL] Logging QueryExecutionMetering in RuleExecutor logger
### What changes were proposed in this pull request?
RuleExecutor already support metering for analyzer/optimizer rules. By providing such information in `PlanChangeLogger`, user can get more information when debugging rule changes .

This PR enhanced `PlanChangeLogger` to display RuleExecutor metrics. This can be easily done by calling the existing API `resetMetrics` and `dumpTimeSpent`, but there might be conflicts if user is also collecting total metrics of a sql job. Thus I introduced `QueryExecutionMetrics`, as the snapshot of `QueryExecutionMetering`, to better support this feature.

Information added to `PlanChangeLogger`
```
=== Metrics of Executed Rules ===
Total number of runs: 554
Total time: 0.107756568 seconds
Total number of effective runs: 11
Total time of effective runs: 0.047615486 seconds
```

### Why are the changes needed?
Provide better plan change debugging user experience

### Does this PR introduce any user-facing change?
Only add more debugging info of `planChangeLog`, default log level is TRACE.

### How was this patch tested?
Update existing tests to verify the new logs

Closes #27846 from Eric5553/ExplainRuleExecMetrics.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 19:08:59 +08:00
HyukjinKwon 815c7929c2
[SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
### What changes were proposed in this pull request?

This PR proposes two things:

1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string.

2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference.

### Why are the changes needed?

To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option.

### Does this PR introduce any user-facing change?
Yes, it does.

```scala
import collection.JavaConverters._
import org.apache.spark.sql.functions._

spark.range(1).select(schema_of_json(lit("""{"id": ""}"""))).show()
spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false)
```

**Before:**

```
struct<id:null>
struct<drop:struct<drop:null>,id:string>
```

**After:**

```
struct<id:string>
struct<id:string>
```

### How was this patch tested?

Manually tested, and unittests were added.

Closes #27854 from HyukjinKwon/SPARK-31065.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-10 00:33:32 -07:00
Yuchen Huo a22994333a [SPARK-30902][SQL][FOLLOW-UP] Allow ReplaceTableAsStatement to have none provider
### What changes were proposed in this pull request?

This is a follow up for https://github.com/apache/spark/pull/27650 where allow None provider for create table. Here we are doing the same thing for ReplaceTable.

### Why are the changes needed?

Although currently the ASTBuilder doesn't seem to allow `replace` without `USING` clause. This would allow `DataFrameWriterV2` to use the statements instead of commands directly.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #27838 from yuchenhuo/SPARK-30902.

Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 11:37:31 +08:00
Wenchen Fan 1aa184763a
[SPARK-31053][SQL] mark connector APIs as Evolving
### What changes were proposed in this pull request?

The newly added catalog APIs are marked as Experimental but other DS v2 APIs are marked as Evolving.

This PR makes it consistent and mark all Connector APIs as Evolving.

### Why are the changes needed?

For consistency.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27811 from cloud-fan/tag.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-08 11:41:09 -07:00
beliefer f8a3730fd7 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691, https://github.com/apache/spark/pull/27730 and https://github.com/apache/spark/pull/27770
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.redaction.options.regex | 2.2.2 | SPARK-23850 | 6a55d8b03053e616dcacb79cd2c29a06d219dc32#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.redaction.string.regex | 2.3.0 | SPARK-22791 | 28315714ddef3ddcc192375e98dd5207cf4ecc98#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.function.concatBinaryAsString | 2.3.0 | SPARK-22771 | f2b3525c17d660cf6f082bbafea8632615b4f58e#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.function.eltOutputAsString | 2.3.0 | SPARK-22937 | bf853018cabcd3b3abf84bfe534d2981020b4a71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.validatePartitionColumns | 3.0.0 | SPARK-26263 | 5a140b7844936cf2b65f08853b8cfd8c499d4f13#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.epochBacklogQueueSize | 3.0.0 | SPARK-24063 | c4bbfd177b4e7cb46f47b39df9fd71d2d9a12c6d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.executorQueueSize | 2.3.0 | SPARK-22789 | 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.executorPollIntervalMs | 2.3.0 | SPARK-22789 | 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.useV1SourceList | 3.0.0 | SPARK-28747 | cb06209fc908bac6ce6a8f20653865489773cbc3#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.disabledV2Writers | 2.3.1 | SPARK-23196 | 588b9694c1967ff45774431441e84081ee6eb515#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.disabledV2MicroBatchReaders | 2.4.0 | SPARK-23362 | 0a73aa31f41c83503d5d99eff3c9d7b406014ab3#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.partitionOverwriteMode | 2.3.0 | SPARK-20236 | b96248862589bae1ddcdb14ce4c802789a001306#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.storeAssignmentPolicy | 3.0.0 | SPARK-28730 | 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.ansi.enabled | 3.0.0 | SPARK-30125 | d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.sortBeforeRepartition | 2.1.4 | SPARK-23207 and SPARK-22905 and SPARK-24564 and SPARK-25114 | 4d2d3d47e00e78893b1ecd5a9a9070adc5243ac9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.nestedSchemaPruning.enabled | 2.4.1 | SPARK-4502 | dfcff38394929970fee454c69864d0e10d59f8d4#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.serializer.nestedSchemaPruning.enabled | 3.0.0 | SPARK-26837 | 0f2c0b53e8fb18c86c67b5dd679c006db93f94a5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.expression.nestedPruning.enabled | 3.0.0 | SPARK-27707 | 127bc899ae78d73332a87f0972b5db3c9936c1f1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.topKSortFallbackThreshold | 2.4.0 | SPARK-24193 | 8a837bf4f3f2758f7825d2362cf9de209026651a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.csv.parser.columnPruning.enabled | 2.4.0 | SPARK-24244 and SPARK-24368 | 64fad0b519cf35b8c0a0dec18dd3df9488a5ed25#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.repl.eagerEval.enabled | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.repl.eagerEval.maxNumRows | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.repl.eagerEval.truncate | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.codegen.aggregate.fastHashMap.capacityBit | 2.4.0 | SPARK-24978 | 6193a202aab0271b4532ee4b740318290f2c44a1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.sizeOfNull | 2.4.0 | SPARK-24605 | d08f53dc61f662f5291f71bcbe1a7b9f531a34d2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.setopsPrecedence.enabled | 2.4.0 | SPARK-24966 | 73dd6cf9b558f9d752e1f3c13584344257ad7863#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.exponentLiteralAsDecimal.enabled | 3.0.0 | SPARK-29956 | 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowNegativeScaleOfDecimal | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.createHiveTableByDefault.enabled | 3.0.0 | SPARK-30098 | 58be82ad4b98fc17e821e916e69e77a6aa36209d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.integralDivide.returnBigint | 3.0.0 | SPARK-25457 | 47d6e80a2e64823fabb596503fb6a6cc6f51f713#diff-9a6b543db706f1a90f790783d6930a13 | Exists in branch-3.0 branch, but the pom.xml file corresponding to the commit log is 2.5.0-SNAPSHOT
spark.sql.legacy.bucketedTableScan.outputOrdering | 3.0.0 | SPARK-28595 | 469423f33887a966aaa33eb75f5e7974a0a97beb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.parser.havingWithoutGroupByAsWhere | 2.4.1 | SPARK-25708 | 3dba5d41f1a66ae5eb08404d103284110c45a351#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.json.allowEmptyString.enabled | 3.0.0 | SPARK-25040 | d3de7568f32e298442f07b0a28b2c906de72c797#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.createEmptyCollectionUsingStringType | 3.0.0 | SPARK-30790 | 8ab6ae3ede96adb093347470a5cbbf17fe8c04e9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowUntypedScalaUDF | 3.0.0 | SPARK-26580 | bc30a07ce262840c99a752db4fbd3a423f652017#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.truncateTable.ignorePermissionAcl.enabled | 2.4.6 | SPARK-30312 | 830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue | 3.0.0 | SPARK-26085 | ab2eafb3cdc7631452650c6cac03a92629255347#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.debug.maxToStringFields | 3.0.0 | SPARK-26066 | 81550b38e43fb20f89f529d2127575c71a54a538#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.maxPlanStringLength | 3.0.0 | SPARK-26103 | 812ad5546148d2194ab0e4230ee85b8f6a5be2fb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.setCommandRejectsSparkCoreConfs | 3.0.0 | SPARK-26060 | 1ab3d3e474ce2e36d58aea8ad09fb61f0c73e5c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.datetime.java8API.enabled | 3.0.0 | SPARK-27008 | 52671d631d2a64ed1cfa0c6e01168908faf92df8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.binaryFile.maxLength | 3.0.0 | SPARK-27588 | 618d6bff71073c8c93501ab7392c3cc579730f0b#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.typeCoercion.datetimeToString.enabled | 3.0.0 | SPARK-27638 | 83d289eef492de8c7f3e5145f9bd75431608b500#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.defaultCatalog | 3.0.0 | SPARK-29753 | 942753a44beeae5f0142ceefa307e90cbc1234c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.catalog.$SESSION_CATALOG_NAME | 3.0.0 | SPARK-29412 | 9407fba0375675d6ee6461253f3b8230e8d67509#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.doLooseUpcast | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.ctePrecedencePolicy | 3.0.0 | SPARK-30829 | 00943be81afbca6be13e1e72b24536cd98a788d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.timeParserPolicy | 3.1.0 | SPARK-30668 | 7db0af578585ecaeee9fd23f8189292289b52a97#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.followThreeValuedLogicInArrayExists | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.maven.additionalRemoteRepositories | 3.0.0 | SPARK-29175 | 3d7359ad4202067b26a199657b6a3e1f38be0e4d#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.fromDayTimeString.enabled | 3.0.0 | SPARK-29864 and SPARK-29920 | e933539cdd557297daf97ff5e532a3f098896979#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.notReserveProperties | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.addSingleFileInAddFile | 3.0.0 | SPARK-30234 | 8a8d1fbb10af6da481f26831cd519ef46ccbce6c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.mssqlserver.numericMapping.enabled | 2.4.5 | SPARK-28152 | 69de7f31c37a7e0298e66cc814afc1b0aa948bbb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.csv.filterPushdown.enabled | 3.0.0 | SPARK-30323 | 4e50f0291f032b4a5c0b46ed01fdef14e4cbb050#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.addPartitionInBatch.size | 3.0.0 | SPARK-29938 | 5ccbb38a71890b114c707279e7395d1f6284ebfd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.timeParser.enabled | 3.0.0 | SPARK-30668 | 92f57237871400ab9d499e1174af22a867c01988#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowDuplicatedMapKeys | 3.0.0 | SPARK-25829 | 33329caa81827a245b84158b13234b88a4746e56#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27829 from beliefer/add-version-to-sql-config-part-four.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-08 12:33:02 +09:00
iRakson cba17e07e9 [SPARK-30899][SQL] CreateArray/CreateMap's data type should not depend on SQLConf.get
### What changes were proposed in this pull request?
Introduced a new parameter `emptyCollection` for `CreateMap` and `CreateArray` functiion to remove dependency on SQLConf.get.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27657 from iRakson/SPARK-30899.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 16:45:06 +08:00
Takeshi Yamamuro 71c73d58f6 [SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID
### What changes were proposed in this pull request?

This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs;
e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)

For example, the query below generates wrong grouping IDs in the master;
```

scala> val numCols = 32 // or, 31
scala> val cols = (0 until numCols).map { i => s"c$i" }
scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet")
scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
scala> sql(s"insert into test_$numCols values ($insertVals,3)")
scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false)
scala> sql(s"drop table test_$numCols")

// numCols = 32
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0            |3     |
|0            |3     | // Wrong Grouping ID
+-------------+------+

// numCols = 31
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0            |3     |
|1            |3     |
+-------------+------+
```
To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values.
### Why are the changes needed?

To support more cases in `GROUPING_ID`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #26918 from maropu/FixGroupingIdIssue.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-06 16:57:03 +09:00
Maxim Gekk 59f1e76b82 [SPARK-31020][SPARK-31023][SPARK-31025][SPARK-31044][SQL] Support foldable args by from_csv/json and schema_of_csv/json
### What changes were proposed in this pull request?
In the PR, I propose:

1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression.
2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression.
3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression.
4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression.

### Why are the changes needed?
This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
```
and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7;
'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)]
+- OneRowRelation
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7;
'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)]
+- OneRowRelation
```

### Does this PR introduce any user-facing change?
Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
```
After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
struct<_c0:double,_c1:int>
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
struct<id:bigint,price:double>
```

### How was this patch tested?
Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`.

Closes #27804 from MaxGekk/foldable-arg-csv-json-func.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 12:29:35 +08:00
Dongjoon Hyun afb84e9d37 [SPARK-30886][SQL] Deprecate two-parameter TRIM/LTRIM/RTRIM functions
### What changes were proposed in this pull request?

This PR aims to show a deprecation warning on two-parameter TRIM/LTRIM/RTRIM function usages based on the community decision.
- https://lists.apache.org/thread.html/r48b6c2596ab06206b7b7fd4bbafd4099dccd4e2cf9801aaa9034c418%40%3Cdev.spark.apache.org%3E

### Why are the changes needed?

For backward compatibility, SPARK-28093 is reverted. However, from Apache Spark 3.0.0, we should give a safe guideline to use SQL syntax instead of the esoteric function signatures.

### Does this PR introduce any user-facing change?

Yes. This shows a directional warning.

### How was this patch tested?

Pass the Jenkins with a newly added test case.

Closes #27643 from dongjoon-hyun/SPARK-30886.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-05 20:09:39 -08:00
maryannxue d705d36c0c [SPARK-31045][SQL] Add config for AQE logging level
### What changes were proposed in this pull request?
This PR adds an internal config for changing the logging level of adaptive execution query plan evolvement.

### Why are the changes needed?
To make AQE debugging easier.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #27798 from maryannxue/aqe-log-level.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 11:41:45 +08:00
beliefer d9254b26f1 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691 and https://github.com/apache/spark/pull/27730
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.execution.useObjectHashAggregateExec | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.jsonGenerator.ignoreNullFields | 3.0.0 | SPARK-29444 | 78b0cbe265c4e8cc3d4d8bf5d734f2998c04d376#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.fileSink.log.deletion | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSink.log.compactInterval | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSink.log.cleanupDelay | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.deletion | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.compactInterval | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.cleanupDelay | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.schema.forceNullable | 3.0.0 | SPARK-28651 | 5bb69945e4aaf519cd10a5c5083332f618039af0#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.fileSource.cleaner.numThreads | 3.0.0 | SPARK-29876 | abf759a91e01497586b8bb6b7a314dd28fd6cff1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.schemaInference | 2.0.0 | SPARK-15458 | 1fb7b3a0a2e3a5c5f784aab662df93fcc1449c36#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.streaming.pollingDelay | 2.0.0 | SPARK-16002 | afa14b71b28d788c53816bd2616ccff0c3967f40#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.streaming.stopTimeout | 3.0.0 | SPARK-30143 | 4c37a8a3f4a489b52f1919d2db84f6e32c6a05cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.noDataProgressEventInterval | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.noDataMicroBatches.enabled | 2.4.1 | SPARK-24157 | 535bf1cc9e6b54df7059ac3109b8cba30057d040#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.metricsEnabled | 2.0.2 | SPARK-17731 | 881e0eb05782ea74cf92a62954466b14ea9e05b6#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.numRecentProgressUpdates | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.checkpointFileManagerClass | 2.4.0 | SPARK-23966 | cbb41a0c5b01579c85f06ef42cc0585fbef216c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.checkpoint.escapedPathCheck.enabled | 3.0.0 | SPARK-26824 | 77b99af57330cf2e5016a6acc69642d54041b041#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.parallelFileListingInStatsComputation.enabled | 2.4.1 | SPARK-24626 | f11f44548903bbab7ab764574d6bed326cf4cd8d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.defaultSizeInBytes | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.statistics.fallBackToHdfs | 2.0.0 | SPARK-15960 | 5c53442cc098dd618ba1430962727c74b2de2e68#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.statistics.ndv.maxError | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.statistics.histogram.enabled | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.histogram.numBins | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.percentile.accuracy | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.size.autoUpdate.enabled | 2.3.0 | SPARK-21127 | d5202259d9aa9ad95d572af253bf4a722b7b437a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.enabled | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.planStats.enabled | 3.0.0 | SPARK-24690 | 3f3a18fff116a02ff7996d45a1061f48a2de3102#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.joinReorder.enabled | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.dp.threshold | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.card.weight | 2.2.0 | SPARK-19915 | c083b6b7dec337d680b54dabeaa40e7a0f69ae69#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.dp.star.filter | 2.2.0 | SPARK-20233 | fbe4216e1e83d243a7f0521b76bfb20c25278281#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.starSchemaDetection | 2.2.0 | SPARK-17791 | 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.starJoinFTRatio | 2.2.0 | SPARK-17791 | 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.session.timeZone | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.windowExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.windowExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sortMergeJoinExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sortMergeJoinExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cartesianProductExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cartesianProductExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parser.quotedRegexColumnNames | 2.3.0 | SPARK-12139 | 2cbfc975ba937a4eb761de7a6473b7747941f386#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.rangeExchange.sampleSizePerPartition | 2.3.0 | SPARK-22160 | 323806e68f91f3c7521327186a37ddd1436267d0#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.enabled | 2.3.0 | SPARK-22159 | d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.arrow.pyspark.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.sparkr.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.fallback.enabled | 2.4.0 | SPARK-23380 | d6632d185e147fcbe6724545488ad80dce20277e#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.arrow.pyspark.fallback.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.maxRecordsPerBatch | 2.3.0 | SPARK-13534 | d03aebbe6508ba441dc87f9546f27aeb27553d77#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.pandas.udf.buffer.size | 3.1.0 | SPARK-27870 | 692e3ddb4e517638156f7427ade8b62fb37634a7#diff-9a6b543db706f1a90f790783d6930a13 | Exists in master, not branch-3.0
spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName | 2.4.1 | SPARK-24324 | 3f203050ac764516e68fb43628bba0df5963e44d#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.pandas.convertToArrowArraySafely | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.replaceExceptWithFilter | 2.3.0 | SPARK-22181 | 01f6ba0e7a12ef818d56e7d5b1bd889b79f2b57c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.decimalOperations.allowPrecisionLoss | 2.3.1 | SPARK-22036 | 8a98274823a4671cee85081dd19f40146e736325#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.literal.pickMinimumPrecision | 2.3.3 | SPARK-25454 | 26d893a4f64de18222942568f7735114447a6ab7#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27770 from beliefer/add-version-to-sql-config-part-three.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-06 11:19:08 +09:00
HyukjinKwon fc12165f48 [SPARK-31036][SQL] Use stringArgs in Expression.toString to respect hidden parameters
### What changes were proposed in this pull request?

This PR proposes to respect hidden parameters by using `stringArgs`  in `Expression.toString `. By this, we can show the strings properly in some cases such as `NonSQLExpression`.

### Why are the changes needed?

To respect "hidden" arguments in the string representation.

### Does this PR introduce any user-facing change?

Yes, for example, on the top of https://github.com/apache/spark/pull/27657,

```scala
val identify = udf((input: Seq[Int]) => input)
spark.range(10).select(identify(array("id"))).show()
```

shows hidden parameter `useStringTypeWhenEmpty`.

```
+---------------------+
|UDF(array(id, false))|
+---------------------+
|                  [0]|
|                  [1]|
...
```

whereas:

```scala
spark.range(10).select(array("id")).show()
```

```
+---------+
|array(id)|
+---------+
|      [0]|
|      [1]|
...
```

### How was this patch tested?

Manually tested as below:

```scala
val identify = udf((input: Boolean) => input)
spark.range(10).select(identify(exists(array(col("id")), _ % 2 === 0))).show()
```

Before:

```
+-------------------------------------------------------------------------------------+
|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false), true))|
+-------------------------------------------------------------------------------------+
|                                                                                 true|
|                                                                                false|
|                                                                                 true|
...
```

After:

```
+-------------------------------------------------------------------------------+
|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false)))|
+-------------------------------------------------------------------------------+
|                                                                           true|
|                                                                          false|
|                                                                           true|
...
```

Closes #27788 from HyukjinKwon/arguments-str-repr.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-06 10:33:20 +09:00
DB Tsai fe126a6a05 [SPARK-31058][SQL][TEST-HIVE1.2] Consolidate the implementation of quoteIfNeeded
### What changes were proposed in this pull request?
There are two implementation of quoteIfNeeded.  One is in `org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote` and the other is in `OrcFiltersBase.quoteAttributeNameIfNeeded`. This PR will consolidate them into one.

### Why are the changes needed?
Simplify the codebase.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27814 from dbtsai/SPARK-31058.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-03-06 00:13:57 +00:00
Wenchen Fan ba86524b25 [SPARK-31037][SQL] refine AQE config names
### What changes were proposed in this pull request?

When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:
1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere.
2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`.
3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions`
4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace
5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum`
6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin`

### Why are the changes needed?

Make the config names easy to understand.

### Does this PR introduce any user-facing change?

deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize`

### How was this patch tested?

N/A

Closes #27793 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 00:46:34 +08:00
Maxim Gekk 1fd9a91c66 [SPARK-31005][SQL] Support time zone ids in casting strings to timestamps
### What changes were proposed in this pull request?
In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats:
- no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone`
- -[h]h:[m]m
- +[h]h:[m]m
- Z
- Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS
- Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats:
  - +|-h[h]
  - +|-hh[:]mm
  - +|-hh:mm:ss
  - +|-hhmmss
- Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB).

### Why are the changes needed?
- To use `stringToTimestamp` as a substitution of removed `stringToTime`, see https://github.com/apache/spark/pull/27710#discussion_r385020173
- Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance:
```sql
spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp);
NULL
```

### Does this PR introduce any user-facing change?
Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings:
```sql
spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp);
2015-03-18 12:03:17.123456
```

### How was this patch tested?
- Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`.
- Run `CastSuite` and `AnsiCastSuite`.

Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 20:49:43 +08:00
Wenchen Fan 807ea413b4 [SPARK-31019][SQL] make it clear that people can deduplicate map keys
### What changes were proposed in this pull request?

rename the config and make it non-internal.

### Why are the changes needed?

Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday.

### Does this PR introduce any user-facing change?

no, just rename a config which was added in 3.0

### How was this patch tested?

add more tests for the fail behavior.

Closes #27772 from cloud-fan/map.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-05 20:43:52 +09:00
Kent Yao f45ae7f2c5 [SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone
### What changes were proposed in this pull request?

The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid.

### Why are the changes needed?

improve configuration
### Does this PR introduce any user-facing change?

yes, will fail fast if the value is a wrong timezone id
### How was this patch tested?

add ut

Closes #27792 from yaooqinn/SPARK-31038.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 19:38:20 +08:00
Terry Kim 66b4fd040e [SPARK-31024][SQL] Allow specifying session catalog name spark_catalog in qualified column names for v1 tables
### What changes were proposed in this pull request?

Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```
fails with `cannot resolve 'spark_catalog.default.t.i`.

This is inconsistent with v2 table behavior where catalog name can be used:
```
SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id
```

This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables.

### Why are the changes needed?

Fixing an inconsistent behavior.

### Does this PR introduce any user-facing change?

Yes, now the following query works:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```

### How was this patch tested?

Added new tests.

Closes #27776 from imback82/spark_catalog_col_name_resolution.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 18:33:59 +08:00
Yuanjian Li 7db0af5785 [SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter
### What changes were proposed in this pull request?
This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message.

### Why are the changes needed?
Avoid silent result change for new behavior in 3.0.

### Does this PR introduce any user-facing change?
Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null.

### How was this patch tested?
Extend existing UT.

Closes #27537 from xuanyuanking/SPARK-30668-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 15:29:39 +08:00
Terry Kim b30278107f [SPARK-30885][SQL][FOLLOW-UP] Fix issues where some V1 commands allow tables that are not fully qualified
### What changes were proposed in this pull request?

There are few V1 commands such as `REFRESH TABLE` that still allow `spark_catalog.t` because they run the commands with parsed table names without trying to load them in the catalog. This PR addresses this issue.

The PR also addresses the issue brought up in https://github.com/apache/spark/pull/27642#discussion_r382402104.

### Why are the changes needed?

To fix a bug where for some V1 commands, `spark_catalog.t` is allowed.

### Does this PR introduce any user-facing change?

Yes, a bug is fixed and `REFRESH TABLE spark_catalog.t` is not allowed.

### How was this patch tested?

Added new test.

Closes #27718 from imback82/fix_TempViewOrV1Table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-04 18:09:48 +08:00
Wenchen Fan e4c61e35da [SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser
### What changes were proposed in this pull request?

Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243.

This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4):
1. the fields can have one or two letters
```
scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false)
+-------------------------------------------+
|jsontostructs({"time":"1123-2-22 2:22:22"})|
+-------------------------------------------+
|[1123-02-22 02:22:22]                      |
+-------------------------------------------+
```
2. the separator between data and time can be "T" as well
```
scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false)
+---------------------------------------------+
|jsontostructs({"time":"2000-12-12T12:12:12"})|
+---------------------------------------------+
|[2000-12-12 12:12:12]                        |
+---------------------------------------------+
```
3. the second fraction can be arbitrary length
```
scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false)
+----------------------------------------------------------+
|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})|
+----------------------------------------------------------+
|[1123-02-15 02:22:22.123]                                 |
+----------------------------------------------------------+
```
4. date string can end up with any chars after "T" or space
```
scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false)
+----------------------------------------+
|jsontostructs({"time":"1123-02-22Tabc"})|
+----------------------------------------+
|[1123-02-22]                            |
+----------------------------------------+
```
5. remove "GMT" from the string before parsing
```
scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false)
+--------------------------------------------------+
|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})|
+--------------------------------------------------+
|[1123-02-22 02:22:22.123]                         |
+--------------------------------------------------+
```
### Why are the changes needed?

It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed.

### Does this PR introduce any user-facing change?

yes, to make 3.0 support parsing most of the csv/json values that were supported before.

### How was this patch tested?

new tests

Closes #27710 from cloud-fan/bug2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-04 18:27:44 +09:00
Takeshi Yamamuro 4a1d273a4a [SPARK-30997][SQL] Fix an analysis failure in generators with aggregate functions
### What changes were proposed in this pull request?

We have supported generators in SQL aggregate expressions by SPARK-28782.
But, the generator(explode) query with aggregate functions in DataFrame failed as follows;

```
// SPARK-28782: Generator support in aggregate expressions
scala> spark.range(3).toDF("id").createOrReplaceTempView("t")
scala> sql("select explode(array(min(id), max(id))) from t").show()
+---+
|col|
+---+
|  0|
|  2|
+---+

// A failure case handled in this pr
scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()
org.apache.spark.sql.AnalysisException:
The query operator `Generate` contains one or more unsupported
expression types Aggregate, Window or Generate.
Invalid expressions: [min(`id`), max(`id`)];;
Project [col#46L]
+- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L]
   +- Range (0, 3, step=1, splits=Some(4))

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129)
```

The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions
before `GlobalAggregates` replaces it with an aggregate as follows;

```
scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn")
scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Project [explode(array(min('id), max('id))) AS List()]   'Project [explode(array(min(id#72L), max(id#72L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator ===
!'Project [explode(array(min(id#72L), max(id#72L))) AS List()]   Project [col#76L]
!+- Range (0, 3, step=1, splits=Some(4))                         +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
!                                                                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Result of Batch Resolution ===
!'Project [explode(array(min('id), max('id))) AS List()]   Project [col#76L]
!+- Range (0, 3, step=1, splits=Some(4))                   +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
!                                                             +- Range (0, 3, step=1, splits=Some(4))

// the analysis failed here...
```

To avoid the case in `ExtractGenerator`, this pr addes a condition to ignore generators having aggregate functions.
A correct sequence of rules is as follows;

```
20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Project [explode(array(min('id), max('id))) AS List()]   'Project [explode(array(min(id#27L), max(id#27L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===
!'Project [explode(array(min(id#27L), max(id#27L))) AS List()]   'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                         +- Range (0, 3, step=1, splits=Some(4))

20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator ===
!'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()]   'Project [explode(_gen_input_0#31) AS List()]
!+- Range (0, 3, step=1, splits=Some(4))                           +- Aggregate [array(min(id#27L), max(id#27L)) AS _gen_input_0#31]
!                                                                     +- Range (0, 3, step=1, splits=Some(4))

```

### Why are the changes needed?

A bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #27749 from maropu/ExplodeInAggregate.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-03 12:25:12 -08:00
Terry Kim c263c15408 [SPARK-31015][SQL] Star(*) expression fails when used with qualified column names for v2 tables
### What changes were proposed in this pull request?

For a v2 table created with `CREATE TABLE testcat.ns1.ns2.tbl (id bigint, name string) USING foo`, the following works as expected
```
SELECT testcat.ns1.ns2.tbl.id FROM testcat.ns1.ns2.tbl
```
, but a query with qualified column name with star(*)
```
SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl
[info]   org.apache.spark.sql.AnalysisException: cannot resolve 'testcat.ns1.ns2.tbl.*' given input columns 'id, name';
```
fails to resolve. And this PR proposes to fix this issue.

### Why are the changes needed?

To fix a bug as describe above.

### Does this PR introduce any user-facing change?

Yes, now `SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl` works as expected.

### How was this patch tested?

Added new test.

Closes #27766 from imback82/fix_star_expression.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-04 00:55:26 +08:00
Takeshi Yamamuro 313e62c376 [SPARK-30998][SQL] ClassCastException when a generator having nested inner generators
### What changes were proposed in this pull request?

A query below failed in the master;

```
scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show()
20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    ...
```

This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators.

### Why are the changes needed?

A bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #27750 from maropu/HandleNestedGenerators.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-03 19:00:33 +09:00
Josh Rosen f0010c81e2 [SPARK-31003][TESTS] Fix incorrect uses of assume() in tests
### What changes were proposed in this pull request?

This patch fixes several incorrect uses of `assume()` in our tests.

If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable).

In contrast, `assert(condition)` will fail the test if the condition doesn't hold.

If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test.

This patch fixes several such cases, replacing certain `assume()` calls with `assert()`.

Credit to ahirreddy for spotting this problem.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27754 from JoshRosen/fix-assume-vs-assert.

Lead-authored-by: Josh Rosen <rosenville@gmail.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-02 15:20:45 -08:00
Jungtaek Lim (HeartSaVioR) f24a46011c [SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable
### What changes were proposed in this pull request?

This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable.

It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled.

We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase.

0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)

Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value.

### Why are the changes needed?

Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT added.

Closes #27747 from HeartSaVioR/SPARK-30993.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-02 22:33:11 +08:00
Jiaan Geng a429ac83e4 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.orc.compression.codec | 2.3.0 | SPARK-21839 | d8f45408635d4fccac557cb1e877dfe9267fb326#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.filterPushdown | 1.4.0 | SPARK-2883 | 65d71bd9fbfe6fe1b741c80fed72d6ae3d22b028#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.orc.mergeSchema | 3.0.0 | SPARK-11412 | 73183b3c8c2022846587f08e8dea5c387ed3b8d5#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.verifyPartitionPath | 1.4.0 | Spark-5068 | 1f39a61118184e136f38381a9f3ba0b2d5d589d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.metastorePartitionPruning | 1.5.0 | SPARK-9386 | ce89ff477aea6def68265ed218f6105680755c9a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.manageFilesourcePartitions | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.filesourcePartitionFileCacheSize | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.caseSensitiveInferenceMode | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.optimizer.metadataOnly | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.columnNameOfCorruptRecord | 1.2.0 | SPARK-3339 | 1c7f0ab302de9f82b1bd6da852d133823bc67c66#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.broadcastTimeout | 1.3.0 | SPARK-4269 | fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.scheduler.pool | 1.1.1 | SPARK-3025 | 496f62d9a98067256d8a51fd1e7a485ff6492fa8#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftServer.incrementalCollect | 2.0.3 | SPARK-18857 | c94288b57b5ce2232e58e35cada558d8d5b8ec6e#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.thriftserver.ui.retainedStatements | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.ui.retainedSessions | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.default | 1.3.0 | SPARK-5658 | a21090ebe1ef7a709709300712de7d928a923244#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.convertCTAS | 2.0.0 | SPARK-15646 | 5a835b99f9852b0c2a35f9c75a51d493474994ea#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.hive.gatherFastStats | 2.0.1 | SPARK-17063 | 3d283f6c9d9daef53fa4e90b0ead2a94710a37a7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.partitionColumnTypeInference.enabled | 1.5.0 | SPARK-7939 | 03ef6be9ce61a13dcd9d8c71298fb4be39119411#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.bucketing.enabled | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.bucketing.maxBuckets | 2.4.0 | SPARK-23997 | de46df549acee7fda56bb0871f444d2f3b49e582#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.crossJoin.enabled | 2.0.0 | SPARK-15425 | 4462da7071462084c5b55cc414c7faa0e1396a18#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.orderByOrdinal | 2.0.0 | SPARK-12789 | 2c5b18fb0fdeabd378dd97e91f72d1eac4e21cc7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByOrdinal | 2.0.0 | SPARK-13957 | 05f652d6c2bbd764a1dd5a45301811e14519486f#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByAliases | 2.2.0 | SPARK-14471 | af3a1411a28796d4d9a100eefb093b1d91532754#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.outputCommitterClass | 1.4.0 | SPARK-7567 | a385f4b8dd22e0e056569cffc4fa63047cb7c8f2#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.commitProtocolClass | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.parallelPartitionDiscovery.threshold | 1.5.0 | SPARK-8125 | a1064df0ee3daf496800be84293345a10e1497d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.parallelPartitionDiscovery.parallelism | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.ignoreDataLocality | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.selfJoinAutoResolveAmbiguity | 1.4.0 | SPARK-6231 | e61083ccab7764d1929248490a3d2e83987241e0#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.analyzer.failAmbiguousSelfJoin | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.retainGroupColumns | 1.4.0 | SPARK-7462 | 9c35f02b35fda80d6558573466735e79b3dd9124#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.pivotMaxValues | 1.6.0 | SPARK-8992 | 5940fc71d2a245cc6e50edb455c3dd3dbb8de43a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.runSQLOnFiles | 1.6.0 | SPARK-11197 | f8c6bec65784de89b47e96a367d3f9790c1b3115#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.codegen.wholeStage | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.useIdInClassName | 2.3.1 | SPARK-23032 | 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.maxFields | 2.0.0 | SPARK-14224 and SPARK-14223 and SPARK-14310 | 5a4b11a901703464b9261dea0642d80cf8d4856c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.factoryMode | 2.4.0 | SPARK-23711 | a40ffc656d62372da85e0fa932b67207839e7fde#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.fallback | 2.0.0 | SPARK-15759 | f0fa0a8946fb4bdf0f4697a8e389f49e98422871#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.logging.maxLines | 2.3.0 | SPARK-20871 | 2a53fbfce72b3faef020e39a1e8628d68bc95beb#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.hugeMethodLimit | 2.3.0 | SPARK-21871 | 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.methodSplitThreshold | 3.0.0 | SPARK-25850 | e017cb39642a5039abd8ce8127ad41712901bdbc#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.splitConsumeFuncByOperator | 2.3.1 | SPARK-21717 | c79e771f8952e6773c3a84cc617145216feddbcf#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxPartitionBytes | 2.0.0 | SPARK-13664 | 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.openCostInBytes | 2.0.0 | SPARK-14259 | 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.ignoreCorruptFiles | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.ignoreMissingFiles | 2.3.0 | SPARK-22366 | 8e9863531bebbd4d83eafcbc2b359b8bd0ac5734#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxRecordsPerFile | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.exchange.reuse | 2.0.0 | SPARK-13523 | 3dc9ae2e158e5b51df6f799767946fe1d190156b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.execution.reuseSubquery | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.providerClass | 2.3.0 | SPARK-20883 and SPARK-20376 | fa757ee1d41396ad8734a3f2dd045bb09bc82a2e#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.minDeltasForSnapshot | 2.0.0 | SPARK-13809 | 8c826880f5eaa3221c4e9e7d3fece54e821a0b98#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion | 2.4.0 | SPARK-22187 | b3d88ac02940eff4c867d3acb79fe5ff9d724e83#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.checkpointLocation | 2.0.0 | SPARK-13985 | caea15214571d9b12dcf1553e5c1cc8b83a8ba5b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.forceDeleteTempCheckpointLocation | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.minBatchesToRetain | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.maxBatchesToRetainInMemory | 2.4.0 | SPARK-24717 | 8b7d4f842fdc90b8d1c37080bdd9b5e1d070f5c0#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.aggregation.stateFormatVersion | 2.4.0 | SPARK-24763 | 6c5cb85856235efd464b109558896f81ae2c4c75#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stopActiveRunOnRestart | 3.0.0 | SPARK-29568 | 363af16c72abe19fc5cc5b5bdf9d8dc34975f2ba#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.join.stateFormatVersion | 3.0.0 | SPARK-26154 | c941362cb94b24bdf48d4928a1a4dff1b13a1484#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.unsupportedOperationCheck | 2.0.0 | SPARK-14473 | 775cf17eaaae1a38efe47b282b1d6bbdb99bd759#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.variable.substitute | 2.0.0 | SPARK-14769 | 334c293ec0bcc2195d502c574ca40dbc4769d666#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.aggregate.map.twolevel.enabled | 2.3.0 | SPARK-22159 | d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.map.vectorized.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.splitAggregateFunc.enabled | 3.0.0 | SPARK-21870 | cb0cddffe9452937033e0e6b1fc0e600d2c787ad#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.view.maxNestedViewDepth | 2.2.0 | SPARK-19877 | ee36bc1c9043ead3c3ba4fba7e68c6c47ad7ae7a#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.commitProtocolClass | 2.1.0 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.multipleWatermarkPolicy | 2.4.0 | SPARK-24730 | 6078b891da8fe7fc36579699473168ae7443284c#diff-9a6b543db706f1a90f790783d6930a13

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27730 from beliefer/add-version-to-sql-config-part-two.

Lead-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:13:58 +09:00
iRakson 92a5ae2ae4 [SPARK-30234][SQL][FOLLOWUP] Rename spark.sql.legacy.addDirectory.recursive.enabled to spark.sql.legacy.addSingleFileInAddFile
### What changes were proposed in this pull request?
Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile`

### Why are the changes needed?
To follow the naming convention

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27725 from iRakson/SPARK-30234_CONFIG.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-01 10:55:41 +09:00
iRakson a40a2f8338 [SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType'
### What changes were proposed in this pull request?
Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`.

### Why are the changes needed?
Better readability of configuration.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27719 from iRakson/SPARK-27619_FOLLOWUP.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-28 22:57:50 +08:00
Wenchen Fan f21894e5fa [SPARK-30902][SQL] Default table provider should be decided by catalog implementations
### What changes were proposed in this pull request?

When `CREATE TABLE` SQL statement does not specify the provider, leave it to the catalog implementations to decide.

### Why are the changes needed?

It's super weird if we set the default provider to parquet when creating a table in a JDBC catalog.

### Does this PR introduce any user-facing change?

Yes, v2 catalog will not see a "provider" property in table properties if it's not specified in `CREATE TABLE` SQL statement. V2 catalog is new in 3.0.

### How was this patch tested?

new tests

Closes #27650 from cloud-fan/create_table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 15:14:23 +09:00
Liang-Chi Hsieh ba032acf95 [SPARK-30955][SQL] Exclude Generate output when aliasing in nested column pruning
### What changes were proposed in this pull request?

When aliasing in nested column pruning in Project on top of Generate, we should exclude Generate outputs.

### Why are the changes needed?

Right now we would prune nested columns in Project on top of Generate. It is possible that referred nested columns are from Generate's outputs, not from its child. To address that case, we should exclude Generate outputs when aliasing in nested column pruning.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #27702 from viirya/fix-nested-pruning.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:29:46 +09:00
Kent Yao 2d2706cb86 [SPARK-30956][SQL][TESTS] Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
### What changes were proposed in this pull request?

In this PR, I addressed the comment from https://github.com/apache/spark/pull/27672#discussion_r383719562 to use `intercept` instead of `try-catch` block to assert  failures in the IntervalUtilsSuite

### Why are the changes needed?

improve tests
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Nah

Closes #27700 from yaooqinn/intervaltest.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-02-27 23:12:35 +09:00
beliefer 1515d45b8d [SPARK-27924][SQL][FOLLOW-UP] Improve ANSI SQL Boolean-Predicate
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/25074 and improves the implement.

### Why are the changes needed?
Improve code.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27699 from beliefer/improve-boolean-test.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 13:42:02 +08:00
beliefer 825d3dc11b [SPARK-30841][SQL][DOC] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
Add version information to the configuration of Spark SQL.
Note: Because SQLConf has a lot of configuration items, I split the items into two PR. Another PR will follows this PR.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.analyzer.maxIterations | 3.0.0 | SPARK-30138 | c2f29d5ea58eb4565cc5602937d6d0bb75558513#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.excludedRules | 2.4.0 | SPARK-24802 | 434319e73f8cb6e080671bdde42a72228bd814ef#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.maxIterations | 2.0.0 | SPARK-14677 | f4be0946af219379fb2476e6f80b2e50463adeb2#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetConversionThreshold | 2.0.0 | SPARK-14796 | 3647120a5a879edf3a96a5fd68fb7aa849ad57ef#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetSwitchThreshold | 3.0.0 | SPARK-26205 | 0c23a39384b7ae5fb4aeb4f7f6fe72007b84bbd2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.planChangeLog.level | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.rules | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.batches | 3.0.0 | SPARK-27088 | 074533334d01afdd7862a1ac6c5a7a672bcce3f8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.enabled | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.useStats | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly | 3.0.0 | SPARK-30528 | 59a13c9b7bc3b3aa5b5bc30a60344f849c0f8012#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.compressed | 1.0.1 | SPARK-2631 | 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.batchSize | 1.1.1 | SPARK-2650 | 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.partitionPruning | 1.2.0 | SPARK-2961 | 248067adbe90f93c7d5e23aa61b3072dfdf48a8a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryTableScanStatistics.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.enableVectorizedReader | 2.3.1 | SPARK-23312 | e5e9f9a430c827669ecfe9d5c13cc555fc89c980#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.columnVector.offheap.enabled | 2.3.0 | SPARK-20101 | 572af5027e45ca96e0d283a8bf7c84dcf476f9bc#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.join.preferSortMergeJoin | 2.0.0 | SPARK-13977 | 9c23c818ca0175c8f2a4a66eac261ec251d27c97#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.sort.enableRadixSort | 2.0.0 | SPARK-14724 | e2b5647ab92eb478b3f7b36a0ce6faf83e24c0e5#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.autoBroadcastJoinThreshold | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.limit.scaleUpFactor | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.hive.advancedPartitionPredicatePushdown.enabled | 2.3.0 | SPARK-20331 | d8cada8d1d3fce979a4bc1f9879593206722a3b9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.shuffle.partitions | 1.1.0 | SPARK-1508 | 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.enabled | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |
spark.sql.adaptive.forceApply | 3.0.0 | SPARK-30719 | b29cb1a82b1a1facf1dd040025db93d998dad4cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.reducePostShufflePartitions | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.minNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.targetPostShuffleInputSize | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.shuffle.maxNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.localShuffleReader.enabled | 3.0.0 | SPARK-29893 | 6e581cf164c3a2930966b270ac1406dc1195c942#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.enabled | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.skewedPartitionFactor | 3.0.0 | SPARK-30812 | 5b36cdbbfef147e93b35eaa4f8e0bea9690b6d06#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin | 3.0.0 | SPARK-9853 and SPARK-29002 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 and b2f06608b785f577999318c00f2c315f39d90889#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.subexpressionElimination.enabled | 1.6.0 | SPARK-10371 | f38509a763816f43a224653fe65e4645894c9fc4#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.caseSensitive | 1.4.0 | SPARK-4699 | 21bd7222e55b9cf684c072141998a0623a69f514#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.constraintPropagation.enabled | 2.2.0 | SPARK-19846 | e011004bedca47be998a0c14fe22a6f9bb5090cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parser.escapedStringLiterals | 2.2.1 | SPARK-20399 | 3d1908fd58fd9b1970cbffebdb731bfe4c776ad9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.fileCompressionFactor | 2.3.1 | SPARK-22790 | 0fc5533e53ad03eb67590ddd231f40c2713150c3#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.respectSummaryFiles | 1.5.0 | SPARK-8838 | 6175d6cfe795fbd88e3ee713fac375038a3993a8#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96TimestampConversion | 2.3.0 | SPARK-12297 | acf7ef3154e094875fa89f30a78ab111b267db91#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.outputTimestampType | 2.3.0 | SPARK-10365 | 21a7bfd5c324e6c82152229f1394f26afeae771c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown.date | 2.4.0 | SPARK-23727 | b02e76cbffe9e589b7a4e60f91250ca12a4420b2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.filterPushdown.timestamp | 2.4.0 | SPARK-24718 | 43e4e851b642bbee535d22e1b9e72ec6b99f6ed4#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.decimal | 2.4.0 | SPARK-24549 | 9549a2814951f9ba969955d78ac4bd2240f85989#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.string.startsWith | 2.4.0 | SPARK-24638 | 03545ce6de08bd0ad685c5f59b73bc22dfc40887#diff-9a6b543db706f1a90f790783d6930a13 | 
spark.sql.parquet.pushdown.inFilterThreshold | 2.4.0 | SPARK-17091 | e1de34113e057707dfc5ff54a8109b3ec7c16dfb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.output.committer.class | 1.5.0 | SPARK-8139 | 111d6b9b8a584b962b6ae80c7aa8c45845ce0099#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.enableVectorizedReader | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.parquet.recordLevelFilter.enabled | 2.3.0 | SPARK-17310 | 673c67046598d33b9ecf864024ca7a937c1998d6#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27691 from beliefer/add-version-to-sql-config-part-one.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 10:58:44 +09:00
iRakson c913b9d8b5 [SPARK-27619][SQL] MapType should be prohibited in hash expressions
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.

When `spark.sql.legacy.useHashOnMapType` is set to false:

```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```

when `spark.sql.legacy.useHashOnMapType` is set to true :

```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

```

### Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```

Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236.

### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.

### How was this patch tested?
UT added.

Closes #27580 from iRakson/SPARK-27619.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 01:48:12 +08:00
Terry Kim 73305475c1 [SPARK-30782][SQL] Column resolution doesn't respect current catalog/namespace for v2 tables
### What changes were proposed in this pull request?

This PR proposes to fix an issue where qualified columns are not matched for v2 tables if current catalog/namespace are used.

For v1 tables, you can currently perform the following:
```SQL
SELECT default.t.id FROM t;
```

For v2 tables, the following fails:
```SQL
USE testcat.ns1.ns2;
SELECT testcat.ns1.ns2.t.id FROM t;

org.apache.spark.sql.AnalysisException: cannot resolve '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7;
```

### Why are the changes needed?

It is a bug since qualified column names cannot match if current catalog/namespace are used.

### Does this PR introduce any user-facing change?

Yes, now the following works:
```SQL
USE testcat.ns1.ns2;
SELECT testcat.ns1.ns2.t.id FROM t;
```

### How was this patch tested?

Added new tests

Closes #27532 from imback82/qualifed_col_respect_current.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 00:21:38 +08:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
Wenchen Fan 8f247e5d36 [SPARK-30918][SQL] improve the splitting of skewed partitions
### What changes were proposed in this pull request?

Use the average size of the non-skewed partitions as the target size when splitting skewed partitions, instead of ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD

### Why are the changes needed?

The goal of skew join optimization is to make the data distribution move even. So it makes more sense the use the average size of the non-skewed partitions as the target size.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27669 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-25 14:10:29 -08:00
Maxim Gekk ffc0935e64 [SPARK-30869][SQL] Convert dates to/from timestamps in microseconds precision
### What changes were proposed in this pull request?
In the PR, I propose to replace:

1. `millisToDays()` by `microsToDays()` which accepts microseconds since the epoch and returns days since the epoch in the specified time zone. The last one is the internal representation of Catalyst's DateType.
2. `daysToMillis()` by `daysToMicros()` which accepts days since the epoch in some time zone and returns the number of microseconds since the epoch. The last one is internal representation of Catalyst's TimestampType.
3. `fromMillis()` by `millisToMicros()`
4. `toMillis()` by `microsToMillis()`

### Why are the changes needed?
Spark stores timestamps in microseconds precision, so, there is no actual need to convert dates to milliseconds, and then to microseconds. As examples, look at DateTimeUtils functions `monthsBetween()` and `truncTimestamp()`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites UnivocityParserSuite, DateExpressionsSuite, ComputeCurrentTimeSuite, DateTimeUtilsSuite, DateFunctionsSuite, JsonSuite, StreamSuite.

Closes #27618 from MaxGekk/replace-millis-by-micros.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-25 23:05:28 +08:00
Kent Yao 761209c1f2 [SPARK-30919][SQL] Make interval multiply and divide's overflow behavior consistent with other operations
### What changes were proposed in this pull request?

The current behavior of interval multiply and divide follows the ANSI SQL standard when overflow, it is compatible with other operations when `spark.sql.ansi.enabled` is true, but not compatible when `spark.sql.ansi.enabled` is false.

When `spark.sql.ansi.enabled` is false, as the factor is a double value, so it should use java's rounding or truncation behavior for casting double to integrals. when divided by zero, it returns `null`.  we also follow the natural rules for intervals as defined in the Gregorian calendar, so we do not add the month fraction to days but add days fraction to microseconds.

### Why are the changes needed?

Make interval multiply and divide's overflow behavior consistent with other interval operations

### Does this PR introduce any user-facing change?

no, these are new features in 3.0

### How was this patch tested?

add uts

Closes #27672 from yaooqinn/SPARK-30919.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-25 22:19:24 +08:00
Josh Rosen f152d2a0a8 [SPARK-30944][BUILD] Update URL for Google Cloud Storage mirror of Maven Central
### What changes were proposed in this pull request?

This PR is a followup to #27307: per https://travis-ci.community/t/maven-builds-that-use-the-gcs-maven-central-mirror-should-update-their-paths/5926, the Google Cloud Storage mirror of Maven Central has updated its URLs: the new paths are updated more frequently. The new paths are listed on https://storage-download.googleapis.com/maven-central/index.html

This patch updates our build files to use these new URLs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing build + tests.

Closes #27688 from JoshRosen/update-gcs-mirror-url.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-25 17:04:13 +09:00
Peter Toth 1a4e2423b2 [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure
### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.

### Why are the changes needed?
To fix a bug.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #27675 from peter-toth/SPARK-30870.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-24 13:46:21 -08:00
beliefer 621e37e2ab [SPARK-28880][SQL] Support ANSI nested bracketed comments
### What changes were proposed in this pull request?
Spark SQL support single comments and bracketed comments now. This PR will support nested bracketed comments.

There are some mainstream database support the syntax.
**PostgreSQL:**
https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS

**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Expressions/Comments.htm?zoom_highlight=comments

Note: Because Spark SQL not exists UT for single comments and bracketed comments, so I add some UT for them.

### Why are the changes needed?
nested bracketed comments is ANSI standard.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT

Closes #27495 from beliefer/nested-brancket-comments.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-02-24 00:28:46 -08:00
Peter Toth a372f76cbd [SPARK-30898][SQL] The behavior of MakeDecimal should not depend on SQLConf.get
### What changes were proposed in this pull request?
This PR adds a new `nullOnOverflow` parameter to `MakeDecimal` so as to avoid its value depending on `SQLConf.get` and change during planning.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #27656 from peter-toth/SPARK-30898.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-24 16:48:48 +09:00
Peter Toth 612f63f39e [SPARK-30897][SQL] The behavior of ArrayExists should not depend on SQLConf.get
### What changes were proposed in this pull request?
This PR adds a new `followThreeValuedLogic` parameter to `ArrayExists` so as to avoid its value depending on `SQLConf.get` and change during planning.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #27655 from peter-toth/SPARK-30897.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-24 16:47:08 +09:00
Burak Yavuz 4ff2718d54 [SPARK-30924][SQL][3.0] Add additional checks to Merge Into
### What changes were proposed in this pull request?

Merge Into is currently missing additional validation around:

 1. The lack of any WHEN statements
 2. The first WHEN MATCHED statement needs to have a condition if there are two WHEN MATCHED statements.
 3. Single use of UPDATE/DELETE

This PR introduces these validations.
(1) is required, because otherwise the MERGE statement is useless.
(2) is required, because otherwise the second WHEN MATCHED condition becomes dead code
(3) is up for debate, but the idea there is that a single expression should be sufficient to specify when you would like to update or delete your records. We restrict it for now to reduce surface area and ambiguity.

### Why are the changes needed?

To ease DataSource developers when building implementations for MERGE

### Does this PR introduce any user-facing change?

Adds additional validation checks

### How was this patch tested?

Unit tests

Closes #27677 from brkyvz/mergeChecks.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 15:16:37 +08:00
jiake f4696ba252 [SPARK-30922][SQL] remove the max splits config in skewed join
### What changes were proposed in this pull request?
When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits`  is to resolve the above ui shown issue. And after [PR#27493](https://github.com/apache/spark/pull/27493) combined the skewed readers into one, we not need this config.

### Why are the changes needed?
remove the unnecessary config

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing test

Closes #27673 from JkSelf/removeMaxSplitNum.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 14:29:25 +08:00
Maxim Gekk c41ef39819 [SPARK-30925][SQL] Prevent overflow/round errors in conversions of milliseconds to/from microseconds
### What changes were proposed in this pull request?
- Use `Math.multiplyExact()` in `DateTimeUtils.fromMillis()` to prevent silent overflow in conversion milliseconds to microseconds.
- Use `DateTimeUtils.fromMillis()` in all places where milliseconds are converted to microseconds
- Use `DateTimeUtils.toMillis()` in all places where microseconds are converted to milliseconds

### Why are the changes needed?

1. To prevent silent arithmetic overflow while multiplying by 1000 in `fromMillis()`. Instead of it, `new ArithmeticException("long overflow")` will be thrown, and handled accordantly.
2. To correctly round microseconds in conversion to milliseconds. For example, `1965-01-01 10:11:12.123456` is represented as `-157700927876544` in micro precision. In milliseconds precision the above needs to be represented as `-157700927877` or `1965-01-01 10:11:12.123`.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
By `TimestampFormatterSuite`, `CastSuite`, `DateExpressionsSuite`, `IntervalExpressionsSuite`, `ExpressionParserSuite`, `ExpressionParserSuite`, `DateTimeUtilsSuite`, `IntervalUtilsSuite`

Closes #27676 from MaxGekk/millis-2-micros-overflow.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 14:06:25 +08:00
Maxim Gekk 310c14ac8d [MINOR][SQL] Add a comment for removedSQLConfigs
### What changes were proposed in this pull request?
In the PR, I propose to explain in the description of `removedSQLConfigs` when removed SQL configs should NOT be placed to the map.

### Why are the changes needed?
To make the cases when SQL configs should be added to `removedSQLConfigs` more clear. Recently, `spark.sql.variable.substitute.depth` was removed from the map by #27646 because it contradicts to the condition described by the PR.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `./dev/scalastyle`

Closes #27653 from MaxGekk/removedSQLConfigs-comment.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-22 09:48:10 +09:00
beliefer 59d6d5cbb0 [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
### What changes were proposed in this pull request?
Spark `ConfigEntry` and `ConfigBuilder` missing Spark version information of each configuration at release. This is not good for Spark user when they visiting the page of spark configuration.
http://spark.apache.org/docs/latest/configuration.html
The new Spark SQL config docs looks like:
![sql配置截屏](https://user-images.githubusercontent.com/8486025/74604522-cb882f00-50f9-11ea-8683-57a90f9e3347.png)

```
> SET -v
spark.sql.adaptive.enabled      false   When true, enable adaptive query execution.
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin       0.2     The relation with a non-empty partition ratio lower than this config will not be considered as the build side of a broadcast-hash join in adaptive execution regardless of its size.This configuration only has an effect when 'spark.sql.adaptive.enabled' is enabled.
spark.sql.adaptive.optimizeSkewedJoin.enabled   true    When true and adaptive execution is enabled, a skewed join is automatically handled at runtime.
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor     10      A partition is considered as a skewed partition if its size is larger than this factor multiple the median partition size and also larger than  spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionMaxSplits  5       Configures the maximum number of task to handle a skewed partition in adaptive skewedjoin.
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold      64MB    Configures the minimum size in bytes for a partition that is considered as a skewed partition in adaptive skewed join.
spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled    true    Whether to fetch the continuous shuffle blocks in batch. Instead of fetching blocks one by one, fetching continuous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple continuous blocks exist in single fetch request only happen when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled, this feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.
spark.sql.adaptive.shuffle.localShuffleReader.enabled   true    When true and 'spark.sql.adaptive.enabled' is enabled, this enables the optimization of converting the shuffle reader to local shuffle reader for the shuffle exchange of the broadcast hash join in probe side.
spark.sql.adaptive.shuffle.maxNumPostShufflePartitions  <undefined>     The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled.
```

**Note**: Because there are so many configuration items that are exposed and require a lot of finishing, I will add the version numbers of these configuration items in another PR.

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Exists UT

Closes #27592 from beliefer/add-version-to-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-22 09:46:42 +09:00
Eric Wu 1f0300fb16 [SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style
### What changes were proposed in this pull request?
The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in
https://github.com/apache/spark/pull/27368#discussion_r376694496
https://github.com/apache/spark/pull/27368#discussion_r376927143

Observations/Ideas:
1. Using comma as the separator is not clear, especially commas are used inside the expressions too.
2. Show the column counts first? For example, `Results [4]: …`
3. Currently the attribute names are automatically generated, this need to refined.
4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between
`EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`.
5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

### Why are the changes needed?
The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan.

### Does this PR introduce any user-facing change?
Yes, `EXPLAIN FORMATTED` output style changed.

### How was this patch tested?
Update expect results of test cases in explain.sql

Closes #27509 from Eric5553/ExplainFormattedRefine.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 23:36:14 +08:00
Yuanjian Li a5efbb284e [SPARK-30809][SQL] Review and fix issues in SQL API docs
### What changes were proposed in this pull request?
- Add missing `since` annotation.
- Don't show classes under `org.apache.spark.sql.dynamicpruning` package in API docs.
- Fix the scope of `xxxExactNumeric` to remove it from the API docs.

### Why are the changes needed?
Avoid leaking APIs unintentionally in Spark 3.0.0.

### Does this PR introduce any user-facing change?
No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0.

### How was this patch tested?
Manually generated the API docs and verified the above issues have been fixed.

Closes #27560 from xuanyuanking/SPARK-30809.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 17:03:22 +08:00
Maxim Gekk abe0821ee9 [SPARK-30894][SQL] Make Size's nullable independent from SQL config changes
### What changes were proposed in this pull request?
In the PR, I propose to add the `legacySizeOfNull ` parameter to the `Size` expression, and pass the value of `spark.sql.legacy.sizeOfNull` if `legacySizeOfNull` is not provided on creation of `Size`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `CollectionExpressionsSuite`.

Closes #27658 from MaxGekk/Size-SQLConf-get-deps.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 15:32:11 +08:00
yi.wu 82ce4753aa [SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default
### What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).

And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.

### Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will  return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

### Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

### How was this patch tested?

Added test and updated some tests.

Closes #27488 from Ngone51/spark_26580_followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 14:46:54 +08:00
yi.wu 4d356554a6 [MINOR][SQL] Fix error position of NOSCAN
### What changes were proposed in this pull request?

Point to correct position when miswrite `NOSCAN` detects.

### Why are the changes needed?

Before:

```
[info]   org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 0)
[info]
[info] == SQL ==
[info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN
[info] ^^^
```

After:

```
[info]   org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 78)
[info]
[info] == SQL ==
[info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN
[info] ------------------------------------------------------------------------------^^^
```

### Does this PR introduce any user-facing change?

Yes, user will see better error message.

### How was this patch tested?

Manually test.

Closes #27662 from Ngone51/fix_noscan_reference.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-02-21 15:21:53 +09:00
wuyi 5eb004f4bb Revert "[SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue"
### What changes were proposed in this pull request?

This reverts commit bef5d9d6c3.

### Why are the changes needed?

Revert it according to https://github.com/apache/spark/pull/24902#issuecomment-584511167.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27540 from Ngone51/revert_spark_28093.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 12:55:32 +08:00
Maxim Gekk bb40ab09f4 [SPARK-30892][SQL] Exclude spark.sql.variable.substitute.depth from removedSQLConfigs
### What changes were proposed in this pull request?

Exclude the SQL config `spark.sql.variable.substitute.depth` from `SQLConf.removedSQLConfigs`

### Why are the changes needed?
By the #27169, the config was placed to `SQLConf.removedSQLConfigs`. And as a consequence of that when an user set it non-default value (1 for example),  he/she will get an exception. It is acceptable for SQL configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark 3.0 more difficult.

### Does this PR introduce any user-facing change?
Yes, before the changes users get an exception when he/she set `spark.sql.variable.substitute.depth` to a value different from `40`.

### How was this patch tested?
Run `spark.conf.set("spark.sql.variable.substitute.depth", 1)` in `spark-shell`.

Closes #27646 from MaxGekk/remove-substitute-depth-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 00:44:09 +08:00
Maxim Gekk a551715fd2 [SPARK-29930][SPARK-30416][SQL][FOLLOWUP] Move deprecated/removed config checks from RuntimeConfig to SQLConf
### What changes were proposed in this pull request?
- Output warnings for deprecated SQL configs in `SQLConf. setConfWithCheck()` and in `SQLConf. unsetConf()`
- Throw an exception for removed SQL configs in `SQLConf. setConfWithCheck()` when they set to non-default values
- Remove checking of deprecated and removed SQL configs from RuntimeConfig

### Why are the changes needed?
Currently, warnings/exceptions are printed only when a SQL config is set dynamically, for instance via `spark.conf.set()`. After the changes, removed/deprecated SQL configs will be checked when they set statically. For example:
```
$ bin/spark-shell --conf spark.sql.fromJsonForceNullableSchema=false
scala> spark.emptyDataFrame
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
...
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value.
```
```
$ bin/spark-shell --conf spark.sql.hive.verifyPartitionPath=false
scala> spark.emptyDataFrame
20/02/20 02:10:26 WARN SQLConf: The SQL config 'spark.sql.hive.verifyPartitionPath' has been deprecated in Spark v3.0 and may be removed in the future. This config is replaced by 'spark.files.ignoreMissingFiles'.
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
By `SQLConfSuite`

Closes #27645 from MaxGekk/remove-sql-configs-followup-2.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 00:00:48 +08:00
Maxim Gekk 4248b7fbb9 [SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes
### What changes were proposed in this pull request?
In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

OptionsAttachments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `ArithmeticExpressionSuite`.

Closes #27628 from MaxGekk/integral-divide-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-20 21:26:20 +08:00