Commit graph

9509 commits

Author SHA1 Message Date
Max Gekk 5539ecfdac [SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default
### What changes were proposed in this pull request?
Set default time zone and locale in the default constructor of `SparkFunSuite`:
- Default time zone to `America/Los_Angeles`
- Default locale to `Locale.US`

### Why are the changes needed?
1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite
2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running all affected test suites

Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:26:00 +00:00
Kent Yao 1d66085a93 [SPARK-31289][TEST][TEST-HIVE1.2] Eliminate org.apache.spark.sql.hive.thriftserver.CliSuite flakiness
### What changes were proposed in this pull request?

CliSuite seems to be flaky while using metastoreDir per test.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/org.apache.spark.sql.hive.thriftserver/CliSuite/

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/history/

According to the error stack trace in the failed test, the test failed to instantiate a hive metastore client because of derby requirements.
```scala
Caused by: ERROR XBM0A:
 The database directory '/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-9249ce52-0a06-42b6-a3df-e6295e880df0' exists.
However, it does not contain the expected 'service.properties' file.
Perhaps Derby was brought down in the middle of creating this database.
You may want to delete this directory and try creating the database again.
```

The derby requires the metastore dir does not exist, but it does exist probably due to the test case before it failed to clear the metastore dir

In this PR, the metastore is shared across the tests of CliSuite except those explicitly asked a separated metastore env itself

### Why are the changes needed?

CliSuite seems to be flaky while using metastoreDir per test.
 To eliminate test flakiness
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

modified test

Closes #28055 from yaooqinn/clisuite.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 09:11:21 +00:00
Yuanjian Li 86bd37f37e [SPARK-31663][SQL] Grouping sets with having clause returns the wrong result
### What changes were proposed in this pull request?
- Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`:
    - Change the operations resolving directions to top-down.
    - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions`
    - Push the aggregate expressions into the aggregate which contains the expanded operations.
- Use UnresolvedHaving for all having clause.

### Why are the changes needed?
Correctness bug fix. See the demo and analysis in SPARK-31663.

### Does this PR introduce _any_ user-facing change?
Yes, correctness bug fix for HAVING with GROUPING SETS.

### How was this patch tested?
New UTs added.

Closes #28501 from xuanyuanking/SPARK-31663.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 04:37:18 +00:00
yi.wu d8b001fa87 [SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery
### What changes were proposed in this pull request?

Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression  for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`.

### Why are the changes needed?

When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105

For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348

This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy.

Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`.

### Does this PR introduce _any_ user-facing change?

Yes, user hit error previously but run query successfully after this change.

### How was this patch tested?

Added a regression test.

Closes #28496 from Ngone51/spark-31620.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 15:36:28 +00:00
Dongjoon Hyun 5d90886523
[SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite
# What changes were proposed in this pull request?

This PR aims to provide a fallback version instead of `Nil` in `HiveExternalCatalogVersionsSuite`. The provided fallback Spark versions recovers Jenkins jobs instead of failing.

### Why are the changes needed?

Currently, `HiveExternalCatalogVersionsSuite` is aborted in all Jenkins jobs except JDK11 Jenkins jobs which don't have old Spark releases supporting JDK11.
```
HiveExternalCatalogVersionsSuite:
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
  Exception encountered when invoking run on a nested suite - Fail to get the lates Spark versions to test. (HiveExternalCatalogVersionsSuite.scala:180)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins

Closes #28536 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-15 00:30:25 -07:00
Kent Yao 503faa24d3 [SPARK-31715][SQL][TEST] Fix flaky SparkSQLEnvSuite that sometimes varies single derby instance standard
### What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122622/testReport/junit/org.apache.spark.sql.hive.thriftserver/SparkSQLEnvSuite/SPARK_29604_external_listeners_should_be_initialized_with_Spark_classloader/history/?start=25

According to the test report history of SparkSQLEnvSuite,this test fails frequently which is caused by single derby instance restriction.

```java
Caused by: sbt.ForkMain$ForkError: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/metastore_db.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
	at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
	at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)
	at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)
	at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)
	... 138 more
```

This PR adds a separate directory to locate the metastore_db for this test which runs in a dedicated JVM.

Besides, diable the UI for the potential race on `spark.ui.port` which may also let the test case become flaky.

### Why are the changes needed?
test fix

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
SparkSQLEnvSuite itself.

Closes #28537 from yaooqinn/SPARK-31715.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 06:36:34 +00:00
Max Gekk c7ce37dfa7 [SPARK-31712][SQL][TESTS] Check casting timestamps before the epoch to Byte/Short/Int/Long types
### What changes were proposed in this pull request?
Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes.

### Why are the changes needed?
To improve test coverage and prevent errors while modifying the CAST expression code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ ./build/sbt "test:testOnly *CastSuite"
```

Closes #28531 from MaxGekk/test-cast-timestamp-to-byte.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 04:24:58 +00:00
Dongjoon Hyun bbb62c5405
Revert "[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener"
This reverts commit 6994c64efd.
2020-05-14 12:01:03 -07:00
sunke.03 ddbce4edee [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
### What changes were proposed in this pull request?

This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster.  `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error.  But current implementation may miss this case of failure.

### Why are the changes needed?

When user uses a python script which contains a parse error, there will be no output. So  `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`.  But the `proc` may still be alive and `writerThread.exception` is not defined,  `checkFailureAndPropagate` cannot check this case of failure.  In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly.

For example, the error python script is blow.
``` python
# encoding: utf8
import unknow_module
import sys

for line in sys.stdin:
    print line
```
The bug can be reproduced by running the following code in our cluter.
```
spark.range(100*100).toDF("index").createOrReplaceTempView("test")
spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing UT

Closes #27724 from slamke/transformation.

Authored-by: sunke.03 <sunke.03@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-14 13:55:24 +00:00
Karuppayya Rajendran 72601460ad
[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory
### What changes were proposed in this pull request?
Pass hadoop confs  specifed via Spark confs to URLStreamHandlerfactory

### Why are the changes needed?

**BEFORE**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03
```

**AFTER**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem
```
The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested locally.
Added Unit test

Closes #28516 from karuppayya/SPARK-31692.

Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 23:18:38 -07:00
Wenchen Fan fd2d55c991 [SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files
### What changes were proposed in this pull request?

When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not.

### Why are the changes needed?

Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values.

### Does this PR introduce _any_ user-facing change?

Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config.

### How was this patch tested?

updated tests

Closes #28477 from cloud-fan/rebase.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-14 12:32:40 +09:00
Ali Smesseim 6994c64efd [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-12 09:14:34 -07:00
Max Gekk a3fafddf39 [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 14:05:31 +00:00
Javier Fuentes 178ca961fe [SPARK-31102][SQL] Spark-sql fails to parse when contains comment
### What changes were proposed in this pull request?

This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

### Why are the changes needed?

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes #27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:46:24 +00:00
Kent Yao ce714d8189 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs
### What changes were proposed in this pull request?

When I was finding the root cause for SPARK-31675, I noticed that it was very difficult for me to see what was actually going on, since it output nothing else but only
```sql
Error in query: java.lang.IllegalArgumentException: Wrong FS: blablah/.hive-staging_blahbla, expected: hdfs://cluster1
```

It is really hard for us to find causes through such a simple error message without a certain amount of experience.

In this PR, I propose to print all of the stack traces when AnalysisException occurs if there are underlying root causes, also we can escape this via `-S` option.

### Why are the changes needed?

In SPARK-11188,

>For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query.

But nowadays, some `AnalysisException`s do have useful information for us to debug, e.g. the `AnalysisException` below may contain exceptions from hive or Hadoop side.

a28ed86a38/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (L97-L112)

```scala
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468)
	at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626)
	at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
```

### Does this PR introduce _any_ user-facing change?

Yes, `bin/spark-sql` will print all the stack trace when an AnalysisException which contains root causes occurs, before this fix, only the message will be printed.

#### before

```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10;
```

#### After
```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
	at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:376)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:486)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:480)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:2093)
	at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:289)
	at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1221)
	at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2607)
	at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2892)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:927)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
	at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
	... 52 more
```

You can use `-S` option to restore old behavior if you find the error is too verbose.

### How was this patch tested?

Existing CliSuite -
`SPARK-11188 Analysis error reporting`

Add new test and verify manually

Closes #28499 from yaooqinn/SPARK-31678.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:37:13 +00:00
beliefer a89006aba0 [SPARK-31393][SQL] Show the correct alias in schema for expression
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `TimeWindow`
- `MaxBy`
- `MinBy`
- `UnaryMinus`
- `BitwiseCount`

This PR also fix a typo issue, please look at b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)

Note:

1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed.  We can reuse `prettyName` to replace `funcName`.
2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below:
`override def sql: String = s"bit_count(${child.sql})"`
I don't think it's elegant enough. Because `Expression` gives the following definitions.
```
  def sql: String = {
    val childrenSQL = children.map(_.sql).mkString(", ")
    s"$prettyName($childrenSQL)"
  }
```
By this definition, `BitwiseCount` should override `prettyName` method.

### Why are the changes needed?
Improve the implement of some expression.

### Does this PR introduce any user-facing change?
 'Yes'. This PR will let user see the correct alias in schema.

### How was this patch tested?
Jenkins test.

Closes #28164 from beliefer/elegant-pretty-name-for-function.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-12 10:25:04 +09:00
Gabor Somogyi 5a5af46a94 [SPARK-31575][SQL] Synchronise global JVM security configuration modification
### What changes were proposed in this pull request?
There is a race in secure JDBC connection providers. Namely multiple providers can read and/or write the exact same JVM security configuration at the same time. In this PR I've synchronised them with an object class. Since the configuration read and write takes couple of milliseconds it won't cause performance degradation.

### Why are the changes needed?
There is a race in secure JDBC connection providers.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing unit + integration tests.

Closes #28368 from gaborgsomogyi/SPARK-31575.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:10:58 -05:00
Max Gekk 32a5398b65 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps
### What changes were proposed in this pull request?
Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test  HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType.

### Why are the changes needed?
To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.

Closes #28481 from MaxGekk/test-random-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 12:59:41 +00:00
Max Gekk 5d5866be12 [SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`.

### Why are the changes needed?
This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala>
Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS")
  .select($"tsS".cast("timestamp").as("ts")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-07 00:32:20.123|
...
|1001-01-07 00:32:20.123|
+-----------------------+
```
Expected values **must be 1001-01-01 01:02:03.123** but not 1001-01-07 00:32:20.123.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-01 01:02:03.123|
...
|1001-01-01 01:02:03.123|
+-----------------------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 04:58:08 +00:00
Max Gekk 9f768fa991 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps
### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar.

### Why are the changes needed?
This makes RowEncoderSuite much stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY))
```

Closes #28486 from MaxGekk/fix-RowEncoderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 14:22:12 -05:00
Max Gekk ce63bef1da [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.

### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-10 13:31:26 +09:00
manuzhang 77c690a725 [SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan
### What changes were proposed in this pull request?
Show write commands on SQL UI of an AQE plan

### Why are the changes needed?
Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI.

#### Before

![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png)

#### After

![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add UT.

Closes #28474 from manuzhang/aqe-ui.

Lead-authored-by: manuzhang <owenzhang1990@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-08 10:24:13 -07:00
Max Gekk 272d229005 [SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing
### What changes were proposed in this pull request?
In PR, I propose to modify two tests of `ParquetIOSuite`:
- SPARK-31159: rebasing timestamps in write
- SPARK-31159: rebasing dates in write

to check non-vectorized Parquet reader together with vectorized reader.

### Why are the changes needed?
To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `PaquetIOSuite`:
```
$ ./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-07 07:52:29 +00:00
Kent Yao b31ae7bb0b [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
### What changes were proposed in this pull request?

The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods.

This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of  the RuntimeReplaceable ones with toPettySQL

### Why are the changes needed?

Consistency of schema output between RuntimeReplaceable expressions and normal ones.

For example, `date_format` vs `to_timestamp`, before this PR, they output differently

#### Before
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>
```
#### After

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")

struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>

````

### Does this PR introduce _any_ user-facing change?

Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example

### How was this patch tested?
regenerate all related tests

Closes #28420 from yaooqinn/SPARK-31615.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:40:26 +09:00
Liang-Chi Hsieh 9bf738724a [SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown
### What changes were proposed in this pull request?

This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document.

### Why are the changes needed?

Make developers less confusing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Only doc change.

Closes #28468 from viirya/SPARK-31365-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:57:08 +09:00
Max Gekk 3d38bc2605 [SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite
### What changes were proposed in this pull request?
Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`.

### Why are the changes needed?
Avro config is not relevant to the parquet tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetIOSuite` via
```
./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-07 09:46:42 +09:00
HyukjinKwon 5c5dd77d6a [SPARK-31647][SQL] Deprecate 'spark.sql.optimizer.metadataOnly' configuration
### What changes were proposed in this pull request?

This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release.

### Why are the changes needed?

This optimization can cause a potential correctness issue, see also SPARK-26709.
Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590.

Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side.

### Does this PR introduce _any_ user-facing change?

Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning:

```scala
scala> spark.conf.unset("spark.sql.optimizer.metadataOnly")
```
```
20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
 deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
```scala
scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
```
```
20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```

### How was this patch tested?

Manually tested.

Closes #28459 from HyukjinKwon/SPARK-31647.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:00:59 +09:00
yi.wu b16ea8e1ab [SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries
### What changes were proposed in this pull request?

Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages.

### Why are the changes needed?

If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries.

Before:

<img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png">

After:
<img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png">

### Does this PR introduce _any_ user-facing change?

No(AQE feature hasn't been released).

### How was this patch tested?

Tested manually.

Closes #28460 from Ngone51/fix_aqe_ui.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 12:52:53 +00:00
Liang-Chi Hsieh 4952f1a03c [SPARK-31365][SQL] Enable nested predicate pushdown per data sources
### What changes were proposed in this pull request?

This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown.

### Why are the changes needed?

We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources.

In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added/Modified unit tests.

Closes #28366 from viirya/SPARK-31365.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:50:06 +00:00
Daoyuan Wang 53a9bf8fec [SPARK-31595][SQL] Spark sql should allow unescaped quote mark in quoted string
### What changes were proposed in this pull request?
`def splitSemiColon` cannot handle unescaped quote mark like "'" or '"' correctly. When there are unmatched quotes in a string, `splitSemiColon` will not drop off semicolon as expected.

### Why are the changes needed?
Some regex expression will use quote mark in string. We should process semicolon correctly.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added Unit test and also manual test.

Closes #28393 from adrian-wang/unescaped.

Authored-by: Daoyuan Wang <me@daoyuan.wang>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:34:43 +00:00
sychen 588966d696 [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters
### What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.

### Why are the changes needed?

Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`.

### Does this PR introduce any user-facing change?
Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters.

### How was this patch tested?

add UT

Closes #28383 from cxzl25/fix_SPARK-31590.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-06 10:56:19 +09:00
Max Gekk bd26429931 [SPARK-31641][SQL] Fix days conversions by JSON legacy parser
### What changes were proposed in this pull request?
Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date.

### Why are the changes needed?
The changes fix a bug and restore compatibility with Spark 2.4 in which:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-11|
+----------+
```

After:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### How was this patch tested?
Add a test to `JsonSuite`.

Closes #28453 from MaxGekk/json-rebase-legacy-days.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:15:31 +00:00
Max Gekk bef5828e12 [SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold
### What changes were proposed in this pull request?
Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`.

### Why are the changes needed?
The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`.

Before (see the PR's results https://github.com/apache/spark/pull/28440):
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             376            388          10         13.3          75.2       1.1X
Collect java.sql.Timestamp                         1878           1937          64          2.7         375.6       0.2X
```

After:
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             249            264          24         20.1          49.8       1.7X
Collect java.sql.Timestamp                         1503           1523          24          3.3         300.5       0.3X
```

Perf improvements in average of:

1. From java.sql.Timestamp is ~ 34%
2. To java.sql.Timestamps is ~16%

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`.

Closes #28441 from MaxGekk/opt-rebase-common-threshold.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:11:53 +00:00
turbofei 8d1f7d2a4a [SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException
### What changes were proposed in this pull request?
If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`.
We may meet TableAlreadyExistsException.

```
org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default'
```

The reason is that, there is some tests that does not clean up the tables/views.
In this PR, I add `withTempViews` for these tests.

### Why are the changes needed?
To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existed UT.

Closes #28239 from turboFei/SPARK-31467.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-05 15:14:33 +09:00
Max Gekk 735771e7b4 [SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write
### What changes were proposed in this pull request?
Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS,  and rebasing on for INT96):

timestamp type | vectorized off (ns/row) | vectorized on (ns/row)
--|--|--
TIMESTAMP_MICROS| 160.1 | 50.2
INT96 | 215.6 | 117.8
TIMESTAMP_MILLIS | 159.9 | 60.6

### Why are the changes needed?
To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the benchmarks via:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 05:40:15 +00:00
beliefer b9494206a5 [SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff
### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/28194.
As discussed at https://github.com/apache/spark/pull/28194/files#r418418796.
This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff.
Although `ExpressionsSchemaSuite` at line
b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L165)
just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output.
This PR will try to avoid this issue.
After this PR, the exception looks like below:
```
[info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds)
[info]   340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167)
```

### Why are the changes needed?
Make the exception more concise and clear.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28430 from beliefer/improve-expressions-schema-suite.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-05 10:04:16 +09:00
Max Gekk 372ccba063
[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default
### What changes were proposed in this pull request?
This reverts commit 43a73e387c. It sets `INT96` as the timestamp type while saving timestamps to parquet files.

### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #28450 from MaxGekk/parquet-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-04 17:27:02 -07:00
Burak Yavuz 02a319d7e1 [SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog
## What changes were proposed in this pull request?

SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such.

### Why are the changes needed?

Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore.

### Does this PR introduce _any_ user-facing change?

Fixes a bug

### How was this patch tested?

Regression test

Closes #28434 from brkyvz/ddlCommands.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-04 12:22:29 +00:00
Wenchen Fan f72220b8ab [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?

Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.

### Why are the changes needed?

Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2677           2838         142         37.4          26.8       1.0X
[info] after 1582, vec on, rebase on                      3828           4331         805         26.1          38.3       0.7X
[info] before 1582, vec on, rebase off                    2903           2926          34         34.4          29.0       0.9X
[info] before 1582, vec on, rebase on                     4163           4197          38         24.0          41.6       0.6X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3537           3627         104         28.3          35.4       1.0X
[info] after 1900, vec on, rebase on                      6891           7010         105         14.5          68.9       0.5X
[info] before 1900, vec on, rebase off                    3692           3770          72         27.1          36.9       1.0X
[info] before 1900, vec on, rebase on                     7588           7610          30         13.2          75.9       0.5X
```

After this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2758           2944         197         36.3          27.6       1.0X
[info] after 1582, vec on, rebase on                      2908           2966          51         34.4          29.1       0.9X
[info] before 1582, vec on, rebase off                    2840           2878          37         35.2          28.4       1.0X
[info] before 1582, vec on, rebase on                     3407           3433          24         29.4          34.1       0.8X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3861           4003         139         25.9          38.6       1.0X
[info] after 1900, vec on, rebase on                      4194           4283          77         23.8          41.9       0.9X
[info] before 1900, vec on, rebase off                    3849           3937          79         26.0          38.5       1.0X
[info] before 1900, vec on, rebase on                     7512           7546          55         13.3          75.1       0.5X
```

Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.

Closes #28406 from cloud-fan/perf.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 15:30:10 +09:00
Yuming Wang 7ef0b69a92 [SPARK-31626][SQL] Port HIVE-10415: hive.start.cleanup.scratchdir configuration is not taking effect
### What changes were proposed in this pull request?
This pr port [HIVE-10415](https://issues.apache.org/jira/browse/HIVE-10415): `hive.start.cleanup.scratchdir` configuration is not taking effect.

### Why are the changes needed?

I encountered this issue:
![image](https://user-images.githubusercontent.com/5399861/80869375-aeafd080-8cd2-11ea-8573-93ec4b422be1.png)
I'd like to make `hive.start.cleanup.scratchdir` effective to reduce this issue.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test

Closes #28436 from wangyum/SPARK-31626.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 14:59:33 +09:00
Tianshi Zhu a222644e1d [SPARK-31267][SQL] Flaky test: WholeStageCodegenSparkSubmitSuite.Generated code on driver should not embed platform-specific constant
### What changes were proposed in this pull request?

Allow customized timeouts for `runSparkSubmit`, which will make flaky tests more likely to pass by using a larger timeout value.

I was able to reproduce the test failure on my laptop, which took 1.5 - 2 minutes to finish the test. After increasing the timeout, the test now can pass locally.

### Why are the changes needed?

This allows slow tests to use a larger timeout, so they are more likely to succeed.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The test was able to pass on my local env after the change.

Closes #28438 from tianshizz/SPARK-31267.

Authored-by: Tianshi Zhu <zhutianshirea@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 14:50:38 +09:00
Max Gekk 2fb85f6b68 [SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in DateTimeBenchmark
### What changes were proposed in this pull request?
- Changed to the number of rows in benchmark cases from 3 to the actual number `N`.
- Regenerated benchmark results in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
The changes are needed to have:
- Correct benchmark results
- Base line for other perf improvements that can be checked in the same environment.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the benchmark and checking its output.

Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-04 09:39:50 +09:00
Max Gekk 13dddee9a8 [MINOR][SQL][TESTS] Disable UI in SQL benchmarks by default
### What changes were proposed in this pull request?
Set `spark.ui.enabled` to `false` in `SqlBasedBenchmark.getSparkSession`. This disables UI in all SQL benchmarks by default.

### Why are the changes needed?
UI overhead lowers numbers in the `Relative` column and impacts on `Stdev` in benchmark results.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Checked by running `DateTimeRebaseBenchmark`.

Closes #28432 from MaxGekk/ui-off-in-benchmarks.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-02 17:54:36 +09:00
Pablo Langa 4fecc20f6e [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements
### What changes were proposed in this pull request?

The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

Example:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()
// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).
df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)
// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
  collect_set('value).over(win) as "stringSet",
  collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()
```

We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays
Array(1, 2, 3) == Array(1, 2, 3)  => False
The result is that duplicates are not removed in the hashset

The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type.
This transformation is only applied when we have a BinaryType

### Why are the changes needed?
Fix the bug explained

### Does this PR introduce any user-facing change?
Yes. Now `collect_set()` correctly deduplicates array of byte.

### How was this patch tested?
Unit testing

Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-01 22:09:04 +09:00
Yuanjian Li aec8b69435 [SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units
### What changes were proposed in this pull request?
Add test cases covering all interval units:  MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR

### Why are the changes needed?
For test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test only.

Closes #28418 from xuanyuanking/SPARK-28424.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-01 10:32:37 +09:00
Max Gekk c09cfb9808 [SPARK-31557][SQL] Fix timestamps rebasing in legacy parsers
### What changes were proposed in this pull request?
In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings.

### Why are the changes needed?
Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `TimestampFormatterSuite`

Closes #28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 12:45:32 +00:00
Wenchen Fan 636119c54b [SPARK-31607][SQL] Improve the perf of CTESubstitution
### What changes were proposed in this pull request?

In `CTESubstitution`, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.

### Why are the changes needed?

Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

local perf test
```
scala> :pa
// Entering paste mode (ctrl-D to finish)

def test(i: Int): Unit = 1.to(i).foreach { _ =>
  spark.sql("""
    with
    t1 as (select 1),
    t2 as (select 1),
    t3 as (select 1),
    t4 as (select 1),
    t5 as (select 1),
    t6 as (select 1),
    t7 as (select 1),
    t8 as (select 1),
    t9 as (select 1)
    select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}

// Exiting paste mode, now interpreting.

test: (i: Int)Unit

scala> test(10000)

scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)
```

The result before this patch
```
Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            3328796344 / 3924576425                         10000 / 20000
```
The result after this patch
```
Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            1503085936 / 2091992092                         10000 / 20000
```
About 2 times faster.

Closes #28407 from cloud-fan/cte.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 12:11:16 +00:00
Yuanjian Li 7195a18bf2 [SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests
### What changes were proposed in this pull request?

- Rephrase the API doc for `Column.as`
- Simplify the UTs

### Why are the changes needed?
Address comments in https://github.com/apache/spark/pull/28326

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT added.

Closes #28390 from xuanyuanking/SPARK-27340-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 06:24:00 +00:00
beliefer 1d1bb79bc6 [SPARK-31372][SQL][TEST] Display expression schema for double check
### What changes were proposed in this pull request?
Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.

### Why are the changes needed?
Ensure that the output alias is correct

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28194 from beliefer/check-expression-schema.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 03:58:04 +00:00
Kent Yao 9241f8282f [SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations
### What changes were proposed in this pull request?

Because of  ebc8fa50d0 and  beec8d535f, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....`

This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back.

### Why are the changes needed?

restore the SQL string for datetime operations

### Does this PR introduce any user-facing change?

NO, we are restoring here
### How was this patch tested?

added unit tests

Closes #28402 from yaooqinn/SPARK-31586-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 03:31:29 +00:00
Max Gekk 91648654da [SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of isInCollection
### What changes were proposed in this pull request?
- Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`.
- Test different switch thresholds in the `isInCollection: Scala Collection` test.

### Why are the changes needed?
To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing and new tests in `ColumnExpressionSuite`

Closes #28405 from MaxGekk/test-isInCollection.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 03:20:10 +00:00
Max Gekk 73eac7565d [SPARK-31557][SQL][TESTS][FOLLOWUP] Check rebasing in all legacy formatters
### What changes were proposed in this pull request?
- Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345
- Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days.

### Why are the changes needed?
Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `DateFormatterSuite`.

Closes #28398 from MaxGekk/test-rebasing-in-legacy-date-formatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 07:19:34 +00:00
Takeshi Yamamuro 97f2c03d3b
[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema
### What changes were proposed in this pull request?

This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```

Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

### Why are the changes needed?

To make output schema deterministic.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28392 from maropu/SPARK-31594.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-29 00:14:50 -07:00
Terry Kim 36803031e8 [SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views
### What changes were proposed in this pull request?

This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.

### Why are the changes needed?

It's a bug.

### Does this PR introduce any user-facing change?

Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key                              |value        |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2            |
|view.query.out.col.0             |c1           |
|view.query.out.numCols           |1            |
|p2                               |v2           |
|view.catalogAndNamespace.part.0  |spark_catalog|
|p1                               |v1           |
|view.catalogAndNamespace.part.1  |default      |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```

### How was this patch tested?

Added tests.

Closes #28375 from imback82/show_tblproperties_followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 07:06:45 +00:00
Kent Yao ea525fe8c0 [SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond
### What changes were proposed in this pull request?

With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933

I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.

```sql

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)

Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)

Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

```

```sql

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13

```

```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```

In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.

### Why are the changes needed?

Both satisfy the ANSI standard and common use cases in modern SQL platforms

### Does this PR introduce any user-facing change?

No, it new in 3.0
### How was this patch tested?

add more uts

Closes #28396 from yaooqinn/SPARK-31597.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 06:56:33 +00:00
angerszhu 6bc8d84130 [SPARK-29492][SQL] Reset HiveSession's SessionState conf's ClassLoader when sync mode
### What changes were proposed in this pull request?
Run sql in spark thrift server, each session 's thrift server about method will be called in one thread, but when running query statement,  we have two mode:
 1. sync
 2. async
 5a482e7209/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (L205-L238)

In sync mode, we just submit query in current session's corresponding thread and wait Spark to running query and return result,  and the query method will always wait for query return.
In async mode, in SparkExecuteStatementOperation, we will submit query in a backend thread pool, and update operation state,  after submitted to backend thread poll, ExecuteStatement method will return a OperationHandle to client side, and client side will request operation status continuously. after backend thread running sql and return , it will update corresponding  operation status, when client got operation status is final status, it will got error or start fetching result of this operation.

When we use pyhive connect to SparkThriftServer, it will run statement in sync mode.
When we query data of hive table , it will check serde class in HiveTableScanExec#addColumnMetadataToConf

5a482e7209/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala (L123)

```
  public Class<? extends Deserializer> getDeserializerClass() {
    try {
      return Class.forName(this.getSerdeClassName(), true, Utilities.getSessionSpecifiedClassLoader());
    } catch (ClassNotFoundException var2) {
      throw new RuntimeException(var2);
    }
  }

 public static ClassLoader getSessionSpecifiedClassLoader() {
    SessionState state = SessionState.get();
    if (state != null && state.getConf() != null) {
      ClassLoader sessionCL = state.getConf().getClassLoader();
      if (sessionCL != null) {
        if (LOG.isTraceEnabled()) {
          LOG.trace("Use session specified class loader");
        }

        return sessionCL;
      } else {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Session specified class loader not found, use thread based class loader");
        }

        return JavaUtils.getClassLoader();
      }
    } else {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Hive Conf not found or Session not initiated, use thread based class loader instead");
      }

      return JavaUtils.getClassLoader();
    }
  }
```
Since we run statement in sync mode, it will use HiveSession's SessionState,  and use it's conf's classLoader. then error happened.
```
Current operation state RUNNING_STATE,
java.lang.RuntimeException: java.lang.ClassNotFoundException:
xxx.xxx.xxxJsonSerDe
  at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74)
  at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
  at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
	  at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
	  at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader$lzycompute(HiveTableScanExec.scala:110)
  at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader(HiveTableScanExec.scala:105)
	  at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
	  at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
```
We should reset it when we start run sql in sync mode.
### Why are the changes needed?
Fix bug

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
UT

Closes #26141 from AngersZhuuuu/add_jar_in_sync_mode.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 06:48:46 +00:00
Max Gekk 86761861c2 [SPARK-31563][SQL][FOLLOWUP] Create literals directly from Catalyst's internal value in InSet.sql
### What changes were proposed in this pull request?
In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor.

### Why are the changes needed?
This simplifies code and avoids unnecessary conversions to external types.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`.

Closes #28399 from MaxGekk/fix-InSet-sql-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 06:44:22 +00:00
Kent Yao 295d866969 [SPARK-31596][SQL][DOCS] Generate SQL Configurations from hive module to configuration doc
### What changes were proposed in this pull request?

This PR adds `-Phive` profile to the pre-build phase to build the hive module to dev classpath.
Then reflect the HiveUtils object to dump all configurations in the class.

### Why are the changes needed?

supply SQL configurations from hive module to doc

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

passing Jenkins
 add verified locally

![image](https://user-images.githubusercontent.com/8326978/80492333-6fae1200-8996-11ea-99fd-595ee18c67e5.png)

Closes #28394 from yaooqinn/SPARK-31596.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-29 15:34:45 +09:00
Kent Yao 54996be4d2 [SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations
### What changes were proposed in this pull request?
With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot.

According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth.

In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories.

### Why are the changes needed?

Performance test coverage, since these operations are missing in the DateTimeBenchmark.

### Does this PR introduce any user-facing change?

No, just test

### How was this patch tested?

regenerated benchmark results

Closes #28369 from yaooqinn/SPARK-31527-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 15:39:28 +00:00
Max Gekk b7cabc80e6 [SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection"
### What changes were proposed in this pull request?
This reverts commit 5631a96367.

Closes #28328

### Why are the changes needed?
The PR  https://github.com/apache/spark/pull/25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default):
```scala
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()
```
The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false":
```
+--------------+
|isInCollection|
+--------------+
|         false|
+--------------+
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
```
$ ./build/sbt "test:testOnly *ColumnExpressionSuite"
```

Closes #28388 from MaxGekk/fix-isInCollection-revert.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 14:10:50 +00:00
Kent Yao beec8d535f [SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r)
### What changes were proposed in this pull request?

The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent.

Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239

Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR.

### Why are the changes needed?

remove redundant and useless code for easy maintenance

### Does this PR introduce any user-facing change?

Yes, the SQL string of `datetime - interval` become `datetime + (- interval)`
### How was this patch tested?

modified existing unit tests.

Closes #28381 from yaooqinn/SPARK-31586.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 14:01:07 +00:00
Yuanjian Li 6ed2dfbba1 [SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result
### What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

### Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```

### Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.

### How was this patch tested?
New UT added.

Closes #28294 from xuanyuanking/SPARK-31519.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 08:11:41 +00:00
jiake 079b3623c8 [SPARK-31524][SQL] Add metric to the split task number for skew optimization
### What changes were proposed in this pull request?
This is a followup of [#28022](https://github.com/apache/spark/pull/28022), to add the metric info of split task number for skewed optimization.
With this PR, we can see the number of splits for the skewed partitions as following:
![image](https://user-images.githubusercontent.com/11972570/80294583-ff886c00-879c-11ea-813c-2db302f99f04.png)

### Why are the changes needed?
make UI more friendly

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing ut

Closes #28109 from JkSelf/addSplitNumer.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 07:21:00 +00:00
Dongjoon Hyun 79eaaaf6da
[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.5.10.

### Why are the changes needed?

Apache ORC 1.5.10 is a maintenance release with the following patches.

- [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569
- [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte
- [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs
- [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file

The following is release note.
- https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing ORC tests and a newly added test case.

- The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114)
- The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC.
  - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906

Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-27 18:56:30 -07:00
Wenchen Fan 2f4f38b6f1
[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.

This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.

### Why are the changes needed?

correct the behavior

### Does this PR introduce any user-facing change?

yes, fix the fore-mentioned behaviors.

### How was this patch tested?

new tests

Closes #28371 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-27 16:47:39 -07:00
Yuanjian Li ba7adc4949
[SPARK-27340][SS] Alias on TimeWindow expression cause watermark metadata lost
Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work.

### What changes were proposed in this pull request?
Make metadata propagatable between Aliases.

### Why are the changes needed?
In Structured Streaming, we added an Alias for TimeWindow by default.
590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L3272-L3273)
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata
590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/Column.scala (L1049-L1054)
 and finally cause the AnalysisException:
```
Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition
```

### Does this PR introduce any user-facing change?
Bugfix for an alias on time window with watermark.

### How was this patch tested?
New UTs added. One for the functionality and one for explaining the common scenario.

Closes #28326 from xuanyuanking/SPARK-27340.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-27 15:07:52 -07:00
Kousuke Saruta 7d4d05c684
[SPARK-31565][WEBUI][FOLLOWUP] Add font color setting of DAG-viz for query plan
### What changes were proposed in this pull request?

This PR adds a font color setting of DAG-viz for query plan.

### Why are the changes needed?

#28352 aimed to unify the font color of all DAG-viz in WebUI but there is one part left over.

Before this change applied, the appearance of a query plan is like as follows.
<img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80325600-ca4d4e00-8870-11ea-945c-64971dbb752c.png">
The color of `WholeStageCodegen (1)` and its following text (`duration: total...`) is slightly darker than `SerializeFromObject`.

After this change, those color is unified as `#333333`.
<img width="450" alt="plan-graph-fixed2" src="https://user-images.githubusercontent.com/4736016/80325651-fb2d8300-8870-11ea-8ed8-178c124d224c.png">

### Does this PR introduce any user-facing change?

Slightly yes.

### How was this patch tested?

I confirmed the style of `fill` and `color` is `#333333` by debug console of Chrome.
<img width="321" alt="fill" src="https://user-images.githubusercontent.com/4736016/80325760-6c6d3600-8871-11ea-82e7-e789bf741f2a.png">
<img width="316" alt="color" src="https://user-images.githubusercontent.com/4736016/80325765-70995380-8871-11ea-8976-7020205d585c.png">

Closes #28355 from sarutak/followup-SPARK-31565.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-27 13:34:43 -07:00
Kent Yao 5ba467ca1d [SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc
### What changes were proposed in this pull request?

```scala
spark.sql.session.timeZone

spark.sql.warehouse.dir
```
these 2 configs are nondeterministic and vary with environments

Besides, reflect code in `gen-sql-config-docs.py` via  https://github.com/apache/spark/pull/28274#discussion_r412893096 and `configuration.md` via https://github.com/apache/spark/pull/28274#discussion_r412894905
### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

verify locally
![image](https://user-images.githubusercontent.com/8326978/80179099-5e7da200-8632-11ea-803f-d47a93151869.png)

Closes #28322 from yaooqinn/SPARK-31550.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-27 17:08:52 +09:00
yi.wu 7df658414b [SPARK-31529][SQL] Remove extra whitespaces in formatted explain
### What changes were proposed in this pull request?

Remove all the extra whitespaces in the formatted explain.

### Why are the changes needed?

The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can.

### Does this PR introduce any user-facing change?

No, formatted explain is newly added in Spark 3.0.

### How was this patch tested?

Updated sql query tests.

Closes #28315 from Ngone51/fix_extra_spaces.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 07:39:24 +00:00
Kent Yao ebc8fa50d0 [SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode
### What changes were proposed in this pull request?

To follow ANSI,the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0.

### Why are the changes needed?

Better ANSI compliance

### Does this PR introduce any user-facing change?

No, this PR should target 3.0.0 in which this feature is newly added.

### How was this patch tested?

add more unit tests

Closes #28310 from yaooqinn/SPARK-31527.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 05:28:46 +00:00
Bruce Robbins a911287244 [SPARK-31557][SQL] Legacy time parser should return Gregorian days rather than Julian days
### What changes were proposed in this pull request?

This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days.

### Why are the changes needed?

The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manual testing and modified existing unit tests.

Closes #28345 from bersprockets/SPARK-31557.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 05:00:36 +00:00
Juliusz Sompolski 560bd5401f [SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly
### What changes were proposed in this pull request?

`CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr.
On the example of `Single command with --database` test:
In
```
    runCliWithin(2.minute)(
      "CREATE DATABASE hive_db_test;"
        -> "",
      "USE hive_test;"
        -> "",
      "CREATE TABLE hive_test(key INT, val STRING);"
        -> "",
      "SHOW TABLES;"
        -> "hive_test"
    )
```
It was looking for lines containing "", "", "" and then "hive_test".
However, the string "hive_test" was contained in "hive_test_db", and hence:
```
2020-04-08 17:53:12,752 INFO  CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 0: ""
2020-04-08 17:53:12,765 INFO  CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 1: ""
2020-04-08 17:53:17,688 INFO  CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 2: ""
2020-04-08 17:53:18,069 INFO  CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds
2020-04-08 17:53:18,087 INFO  CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 3: "hive_test"
2020-04-08 17:53:21,742 INFO  CliSuite - Found all expected output.
```
Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table.

I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly.

### Why are the changes needed?

org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before.
```
[info] CliSuite:
[info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds)
[info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
[info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds)
[info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds)
[info] - Simple commands (16 seconds, 238 milliseconds)
[info] - Single command with -e (9 seconds, 967 milliseconds)
[info] - Single command with --database (21 seconds, 205 milliseconds)
[info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds)
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds)
[info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
[info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds)
[info] - list jars (9 seconds, 520 milliseconds)
[info] - list jar <jarfile> (9 seconds, 277 milliseconds)
[info] - list files (9 seconds, 828 milliseconds)
[info] - list file <filepath> (9 seconds, 646 milliseconds)
[info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
[info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
[info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
[info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds)
[info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds)
[info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds)
```

Closes #28156 from juliuszsompolski/SPARK-31388.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 04:27:55 +00:00
Kousuke Saruta 91ec2eacfa
[SPARK-31565][WEBUI] Unify the font color of label among all DAG-viz
### What changes were proposed in this pull request?

This PR unifies the font color of label as `#333333` among all DAG-viz.

### Why are the changes needed?

For the consistent appearance among all DAG-viz.
There are three types of DAG-viz in the WebUI.
One is for stages, another one is for RDDs and the last one is for query plans.
But the font color of labels are slightly different among them.

For stages, the color is `#333333` (simply 333) which is specified by `spark-dag-viz.css`.
<img width="355" alt="job-graph" src="https://user-images.githubusercontent.com/4736016/80321397-b517f580-8857-11ea-8c8e-cf68f648ab05.png">
<img width="310" alt="job-graph-color" src="https://user-images.githubusercontent.com/4736016/80321399-ba754000-8857-11ea-8708-83bdef4bc1d1.png">

For RDDs, the color is `#212529` which is specified by `bootstrap.min.js`.
<img width="386" alt="stage-graph" src="https://user-images.githubusercontent.com/4736016/80321438-f0b2bf80-8857-11ea-9c2a-13fa0fd1431c.png">
<img width="313" alt="stage-graph-color" src="https://user-images.githubusercontent.com/4736016/80321444-fa3c2780-8857-11ea-81b2-4f1203d47896.png">

For query plans, the color is `black` which is specified by `spark-sql-viz.css`.
<img width="449" alt="plan-graph" src="https://user-images.githubusercontent.com/4736016/80321490-61f27280-8858-11ea-9c3a-2c98d3d4d03b.png">
<img width="316" alt="plan-graph-color" src="https://user-images.githubusercontent.com/4736016/80321496-6ae34400-8858-11ea-8fe8-0d6e4a821608.png">

After the change, the appearance is like as follows (no change for stages).

For RDDs.
<img width="389" alt="stage-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321613-6b300f00-8859-11ea-912f-d92474aa9f47.png">

For query plans.
<img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321638-9a468080-8859-11ea-974c-33c56a8ffe1a.png">

### Does this PR introduce any user-facing change?

Yes. The unified color is slightly lighter than ever.

### How was this patch tested?

Confirmed that the color code among all DAG-viz are `#333333` using browser's debug console.

Closes #28352 from sarutak/unify-label-color.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 16:57:23 -07:00
Max Gekk bd139bda4a
[SPARK-31489][SQL] Fix pushing down filters with java.time.LocalDate values in ORC
### What changes were proposed in this pull request?
Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled.

Closes #28272

### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
Wrong value class java.time.LocalDate for DATE.EQUALS leaf
java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
	at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `OrcFilterSuite`.

Closes #28261 from MaxGekk/orc-date-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 15:49:00 -07:00
Peter Toth 4f53bfbbd5
[SPARK-31535][SQL] Fix nested CTE substitution
### What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)
```
Before this PR the result was `1`.

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new test case.

Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 15:31:32 -07:00
Takeshi Yamamuro e01125db0d
[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp
### What changes were proposed in this pull request?

This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows;

 - SELECT current_date;
 - SELECT current_timestamp;
 - SELECT substring('abcd' FROM 1);
 - SELECT substring('abcd' FROM 1 FOR 2);

### Why are the changes needed?

To make the SQL (built-in functions) references complete.

### Does this PR introduce any user-facing change?

<img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png">

<img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png">

<img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png">

### How was this patch tested?

Added test examples.

Closes #28342 from maropu/SPARK-31562.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 11:46:52 -07:00
Gengliang Wang f59ebdef5b [SPARK-31558][UI] Code clean up in spark-sql-viz.js
### What changes were proposed in this pull request?

1. Remove console.log(), which seems unnecessary in the releases.
2. Replace the double equals to triple equals
3. Reuse jquery selector.

### Why are the changes needed?

For better code quality.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests + manual test.

Closes #28333 from gengliangwang/removeLog.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-04-25 13:43:52 -07:00
Kent Yao 7959808e96
[SPARK-31564][TESTS] Fix flaky AllExecutionsPageSuite for checking 1970
### What changes were proposed in this pull request?

Fix flakiness by checking `1970/01/01` instead of `1970`.
The test was added by SPARK-27125 for 3.0.0.

### Why are the changes needed?

the `org.apache.spark.sql.execution.ui.AllExecutionsPageSuite.SPARK-27019:correctly display SQL page when event reordering happens` test is flaky for just checking the `html` content not containing 1970. I will add a ticket to check and fix that.
In the specific failure https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121799/testReport, it failed because the `html`

```
...
<td sorttable_customkey="1587806019707">
...

```
contained `1970`.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

passing jenkins

Closes #28344 from yaooqinn/SPARK-31564.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-25 10:27:05 -07:00
Max Gekk 7d8216a664
[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types
### What changes were proposed in this pull request?
In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.

### Why are the changes needed?
The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.

### Does this PR introduce any user-facing change?
Highly likely, not.

### How was this patch tested?
Added a test to `ColumnExpressionSuite`

Closes #28343 from MaxGekk/fix-InSet-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-25 09:29:51 -07:00
Gengliang Wang 16b961526d
[SPARK-31560][SQL][TESTS] Add V1/V2 tests for TextSuite and WholeTextFileSuite
### What changes were proposed in this pull request?

 Add V1/V2 tests for TextSuite and WholeTextFileSuite

### Why are the changes needed?

This is missing part since #24207. We should have these tests for test coverage.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #28335 from gengliangwang/testV2Suite.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:59:15 -07:00
Kent Yao f92652d0b5
[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions
### What changes were proposed in this pull request?

Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms.

For example
Negative:
https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts
https://prestodb.io/docs/current/functions/datetime.html#date_trunc
https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc
https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/

Positive:
https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html
https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

This PR removes these `fmt`s support for trunc and date_trunc functions.

### Why are the changes needed?

clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later.

### Does this PR introduce any user-facing change?
no, targeting 3.0.0, these are newly added in 3.0.0

### How was this patch tested?

remove and modify existing units tests

Closes #28313 from yaooqinn/SPARK-31528.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:28:41 -07:00
Kent Yao caf3ab8411
[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor
### What changes were proposed in this pull request?

the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor`

For example:
```scala
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(<console>:57)
  ... 53 elided

scala>
```

In this PR, we add the missing cases to `arrayClassFor`

### Why are the changes needed?

bugfix as described above

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add a test for array encoders

Closes #28324 from yaooqinn/SPARK-31552.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:04:26 -07:00
Kent Yao 8424f55229 [SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession
### What changes were proposed in this pull request?

SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
This seems a long-standing bug.

```scala
scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|spark.sql.warehou...|file:/Users/kenty...|
+--------------------+--------------------+

scala> spark.sql("set spark.sql.warehouse.dir=2");
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir;
  at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
  at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
  at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
  ... 47 elided

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
getClass   getOrCreate

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate
20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574

scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.warehou...|  xyz|
+--------------------+-----+

scala>
OptionsAttachments
```

### Why are the changes needed?
bugfix as shown in the previous section

### Does this PR introduce any user-facing change?

Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances.

### How was this patch tested?

new ut.

Closes #28316 from yaooqinn/SPARK-31532.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-25 08:53:00 +09:00
Jian Tang 6a576161ae [SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown
### What changes were proposed in this pull request?

This PR aims to add a benchmark suite for nested predicate pushdown with parquet file:

Performance comparison: Nested predicate pushdown disabled vs enabled,  with the following queries scenarios:

1.  When predicate pushed down, parquet reader are able to filter out all the row groups without loading them.

2. When predicate pushed down, parquet reader only loads one of the row groups.

3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down.

### Why are the changes needed?

No benchmark exists today for nested fields predicate pushdown performance evaluation.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
 Benchmark runs and reporting result.

Closes #28319 from JiJiTang/SPARK-31364.

Authored-by: Jian Tang <jian_tang@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-04-24 22:10:58 +00:00
Yuming Wang b10263b8e5 [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators
### What changes were proposed in this pull request?

`LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example:

Teradata / Hive 3.0 / Snowflake:
```sql
--like any
select 'foo' LIKE ANY ('%foo%','%bar%');

--like all
select 'foo' LIKE ALL ('%foo%','%bar%');
```
PostgreSQL:
```sql
-- like any
select 'foo' LIKE ANY (array['%foo%','%bar%']);

-- like all
select 'foo' LIKE ALL (array['%foo%','%bar%']);
```

This PR add support these two operators.

More details:
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ
https://issues.apache.org/jira/browse/HIVE-15229
https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html

### Why are the changes needed?

To smoothly migrate SQLs to Spark SQL.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Closes #27477 from wangyum/SPARK-30724.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-24 22:20:32 +09:00
yi.wu 463c54419b [SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf
### What changes were proposed in this pull request?

Give more friendly warning message/migration guide of deprecated scala udf to users.

### Why are the changes needed?

User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly.

### Does this PR introduce any user-facing change?

No, it's newly added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #28311 from Ngone51/update_udf_doc.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-24 19:10:18 +09:00
Jungtaek Lim (HeartSaVioR) 39bc50dbf8 [SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog
### What changes were proposed in this pull request?

This patch adds some log messages to log elapsed time for "compact" operation in FileStreamSourceLog and FileStreamSinkLog (added in CompactibleFileStreamLog) to help investigating the mysterious latency spike during the batch run.

### Why are the changes needed?

Tracking latency is a critical aspect of streaming query. While "compact" operation may bring nontrivial latency (it's even synchronous, adding all the latency to the batch run), it's not measured and end users have to guess.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A for UT. Manual test with streaming query using file source & file sink.

> grep "for compact batch" <driver log>

```
...
20/02/20 19:27:36 WARN FileStreamSinkLog: Compacting took 24473 ms (load: 14185 ms, write: 10288 ms) for compact batch 21359
20/02/20 19:27:39 WARN FileStreamSinkLog: Loaded 1068000 entries (397985432 bytes in memory), and wrote 1068000 entries for compact batch 21359
20/02/20 19:29:52 WARN FileStreamSourceLog: Compacting took 3777 ms (load: 1524 ms, write: 2253 ms) for compact batch 21369
20/02/20 19:29:52 WARN FileStreamSourceLog: Loaded 229477 entries (68970112 bytes in memory), and wrote 229477 entries for compact batch 21369
20/02/20 19:30:17 WARN FileStreamSinkLog: Compacting took 24183 ms (load: 12992 ms, write: 11191 ms) for compact batch 21369
20/02/20 19:30:20 WARN FileStreamSinkLog: Loaded 1068500 entries (398171880 bytes in memory), and wrote 1068500 entries for compact batch 21369
...
```

![Screen Shot 2020-02-21 at 12 34 22 PM](https://user-images.githubusercontent.com/1317309/75002142-c6830100-54a6-11ea-8da6-17afb056653b.png)

This messages are explaining why the operation duration peaks per every 10 batches which is compact interval. Latency from addBatch heavily increases in each peak which DOES NOT mean it takes more time to write outputs, but we have no idea if such message is not presented.

NOTE: The output may be a bit different from the code, as it may be changed a bit during review phase.

Closes #27557 from HeartSaVioR/SPARK-30804.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-24 12:34:44 +09:00
Max Gekk 26165427c7 [SPARK-31488][SQL] Support java.time.LocalDate in Parquet filter pushdown
### What changes were proposed in this pull request?
1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes.
2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days.
3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`).

### Why are the changes needed?
To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #28259 from MaxGekk/parquet-filter-java8-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-24 02:21:53 +00:00
Takeshi Yamamuro 42f496f6ac [SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-24 11:19:20 +09:00
Kent Yao 8dc2c0247b [SPARK-31522][SQL] Hive metastore client initialization related configurations should be static
### What changes were proposed in this pull request?
HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static:

1. spark.sql.hive.metastore.version - used to determine the hive version in Spark
2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client
3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes -  package names of classes that are shared or separated between SparkContextLoader and hive client class loader

Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command.

Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility.

### Why are the changes needed?

bugfix, these configurations should not be changed.

### Does this PR introduce any user-facing change?

Yes, the following set of configs are not allowed to change.
```
Seq("spark.sql.hive.metastore.version ",
      "spark.sql.hive.metastore.jars",
      "spark.sql.hive.metastore.sharedPrefixes",
      "spark.sql.hive.metastore.barrierPrefixes")
```
### How was this patch tested?

add unit test

Closes #28302 from yaooqinn/SPARK-31522.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-23 15:07:44 +00:00
Yuanjian Li ca90e1932d [SPARK-31515][SQL] Canonicalize Cast should consider the value of needTimeZone
### What changes were proposed in this pull request?
Override the canonicalized fields with respect to the result of `needsTimeZone`.

### Why are the changes needed?
The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #28288 from xuanyuanking/SPARK-31515.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-23 14:32:10 +09:00
Takeshi Yamamuro 820733aee2 [SPARK-31476][SQL][FOLLOWUP] Add tests for extract('field', source)
### What changes were proposed in this pull request?

SPARK-31476 has supported `extract('field', source)` as side-effect, so this PR intends to add some tests for the function in `SQLQueryTestSuite`.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28276 from maropu/SPARK-31476-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-23 04:59:59 +00:00
Kent Yao 3b5792114a [SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info
### What changes were proposed in this pull request?

\_FUNC\_ is used in note() of `ExpressionDescription` since https://github.com/apache/spark/pull/28248, it can be more cases later, we should replace it with function name for documentation

### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

pass Jenkins, and verify locally with Jekyll serve

Closes #28305 from yaooqinn/SPARK-31474-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 13:33:04 +09:00
Max Gekk e7856a7902 [MINOR][SQL] Add comments for filters values and return values of Row.get()/apply()
### What changes were proposed in this pull request?
- Document row field values of `DATE` and `TIMESTAMP` type returned by `Row.get()` and `Row.apply`.
- Refer to `Row.get()` from the description of filter values

### Why are the changes needed?
Reflect current behaviour of Row's method `apply()` and `get()` in comments to inform users about different return types that are depended on the SQL config settings `spark.sql.datetime.java8API.enabled`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run `$ ./dev/scalastyle`

Closes #28300 from MaxGekk/doc-filter-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-23 04:23:33 +00:00
Gabor Somogyi c619990c1d [SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector
### What changes were proposed in this pull request?
When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added DB2 support (other supported databases will come in later PRs).

What this PR contains:
* Added `DB2ConnectionProvider`
* Added `DB2ConnectionProviderSuite`
* Added `DB2KrbIntegrationSuite` docker integration test
* Changed DB2 JDBC driver to use the latest (test scope only)
* Changed test table data type to a type which is supported by all the databases
* Removed double connection creation on test side
* Increased connection timeout in docker tests because DB2 docker takes quite a time to start

### Why are the changes needed?
Missing JDBC kerberos support.

### Does this PR introduce any user-facing change?
Yes, now user is able to connect to DB2 using kerberos.

### How was this patch tested?
* Additional + existing unit tests
* Additional + existing integration tests
* Test on cluster manually

Closes #28215 from gaborgsomogyi/SPARK-31272.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
2020-04-22 17:10:30 -07:00
yi.wu 8fbfdb38c0 [SPARK-31495][SQL] Support formatted explain for AQE
### What changes were proposed in this pull request?

To support formatted explain for AQE.

### Why are the changes needed?

AQE does not support formatted explain yet. It's good to support it for better user experience, debugging, etc.

Before:
```
== Physical Plan ==
AdaptiveSparkPlan (1)
+- * HashAggregate (unknown)
   +- CustomShuffleReader (unknown)
      +- ShuffleQueryStage (unknown)
         +- Exchange (unknown)
            +- * HashAggregate (unknown)
               +- * Project (unknown)
                  +- * BroadcastHashJoin Inner BuildRight (unknown)
                     :- * LocalTableScan (unknown)
                     +- BroadcastQueryStage (unknown)
                        +- BroadcastExchange (unknown)
                           +- LocalTableScan (unknown)

(1) AdaptiveSparkPlan
Output [4]: [k#7, count(v1)#32L, sum(v1)#33L, avg(v2)#34]
Arguments: HashAggregate(keys=[k#7], functions=[count(1), sum(cast(v1#8 as bigint)), avg(cast(v2#19 as bigint))]), AdaptiveExecutionContext(org.apache.spark.sql.SparkSession104ab57b), [PlanAdaptiveSubqueries(Map())], false
```

After:
```
== Physical Plan ==
 AdaptiveSparkPlan (14)
 +- * HashAggregate (13)
    +- CustomShuffleReader (12)
       +- ShuffleQueryStage (11)
          +- Exchange (10)
             +- * HashAggregate (9)
                +- * Project (8)
                   +- * BroadcastHashJoin Inner BuildRight (7)
                      :- * Project (2)
                      :  +- * LocalTableScan (1)
                      +- BroadcastQueryStage (6)
                         +- BroadcastExchange (5)
                            +- * Project (4)
                               +- * LocalTableScan (3)

 (1) LocalTableScan [codegen id : 2]
 Output [2]: [_1#x, _2#x]
 Arguments: [_1#x, _2#x]

 (2) Project [codegen id : 2]
 Output [2]: [_1#x AS k#x, _2#x AS v1#x]
 Input [2]: [_1#x, _2#x]

 (3) LocalTableScan [codegen id : 1]
 Output [2]: [_1#x, _2#x]
 Arguments: [_1#x, _2#x]

 (4) Project [codegen id : 1]
 Output [2]: [_1#x AS k#x, _2#x AS v2#x]
 Input [2]: [_1#x, _2#x]

 (5) BroadcastExchange
 Input [2]: [k#x, v2#x]
 Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#x]

 (6) BroadcastQueryStage
 Output [2]: [k#x, v2#x]
 Arguments: 0

 (7) BroadcastHashJoin [codegen id : 2]
 Left keys [1]: [k#x]
 Right keys [1]: [k#x]
 Join condition: None

 (8) Project [codegen id : 2]
 Output [3]: [k#x, v1#x, v2#x]
 Input [4]: [k#x, v1#x, k#x, v2#x]

 (9) HashAggregate [codegen id : 2]
 Input [3]: [k#x, v1#x, v2#x]
 Keys [1]: [k#x]
 Functions [3]: [partial_count(1), partial_sum(cast(v1#x as bigint)), partial_avg(cast(v2#x as bigint))]
 Aggregate Attributes [4]: [count#xL, sum#xL, sum#x, count#xL]
 Results [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]

 (10) Exchange
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Arguments: hashpartitioning(k#x, 5), true, [id=#x]

 (11) ShuffleQueryStage
 Output [5]: [sum#xL, k#x, sum#x, count#xL, count#xL]
 Arguments: 1

 (12) CustomShuffleReader
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Arguments: coalesced

 (13) HashAggregate [codegen id : 3]
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Keys [1]: [k#x]
 Functions [3]: [count(1), sum(cast(v1#x as bigint)), avg(cast(v2#x as bigint))]
 Aggregate Attributes [3]: [count(1)#xL, sum(cast(v1#x as bigint))#xL, avg(cast(v2#x as bigint))#x]
 Results [4]: [k#x, count(1)#xL AS count(v1)#xL, sum(cast(v1#x as bigint))#xL AS sum(v1)#xL, avg(cast(v2#x as bigint))#x AS avg(v2)#x]

 (14) AdaptiveSparkPlan
 Output [4]: [k#x, count(v1)#xL, sum(v1)#xL, avg(v2)#x]
 Arguments: isFinalPlan=true
```

### Does this PR introduce any user-facing change?

No, this should be new feature along with AQE in Spark 3.0.

### How was this patch tested?

Added a query file: `explain-aqe.sql` and a unit test.

Closes #28271 from Ngone51/support_formatted_explain_for_aqe.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 12:44:06 +00:00
Kent Yao 37d2e037ed [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function
### What changes were proposed in this pull request?

Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

This PR removes these extract fields support from extract function for date and timestamp values

`isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms
`isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it.

For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated.
1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`]
2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`]
3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found.
4. with a discussion the community,we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too.

Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system

### Why are the changes needed?

Remove some nonstandard and uncommon features as we can add them back if necessary

### Does this PR introduce any user-facing change?

NO, we should target this to 3.0.0 and these are added during 3.0.0

### How was this patch tested?

Remove unused tests

Closes #28284 from yaooqinn/SPARK-31507.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:24:49 +00:00
Kent Yao 2c2062ea7c [SPARK-31498][SQL][DOCS] Dump public static sql configurations through doc generation
### What changes were proposed in this pull request?

Currently, only the non-static public SQL configurations are dump to public doc, we'd better also add those static public ones as the command `set -v`

This PR force call StaticSQLConf to buildStaticConf.

### Why are the changes needed?

Fix missing SQL configurations in doc

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

add unit test and verify locally to see if public static SQL conf is in `docs/sql-config.html`

Closes #28274 from yaooqinn/SPARK-31498.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:16:39 +00:00
herman cf6038499d
[SPARK-31511][SQL] Make BytesToBytesMap iterators thread-safe
### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #28286 from hvanhovell/SPARK-31511.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 18:17:19 -07:00
Wenchen Fan b209b5f406
[SPARK-31503][SQL] fix the SQL string of the TRIM functions
### What changes were proposed in this pull request?

override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.

### Why are the changes needed?

The current implementation is wrong. It gives you a SQL string that returns different result.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

new tests

Closes #28281 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 11:22:18 -07:00