Commit graph

9083 commits

Author SHA1 Message Date
jiake 10a4eafcfe [SPARK-30812][SQL] update the skew join configs by adding the prefix "skewedJoinOptimization"
### What changes were proposed in this pull request?
This is a follow up in [PR#27563](https://github.com/apache/spark/pull/27563).
This PR adds the prefix of "skewedJoinOptimization" in the skew join related configs.

### Why are the changes needed?
address remaining address

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
only update config and no need new ut.

Closes #27630 from JkSelf/renameskewjoinconfig.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-19 15:55:29 +08:00
LantaoJin c0715221b2 [SPARK-30785][SQL] Create table like should keep tracksPartitionsInCatalog same with source table
### What changes were proposed in this pull request?
Table generated by `CREATE TABLE LIKE` a partitioned table is a partitioned table. But when run `ALTER TABLE ADD PARTITION`, it will throw `AnalysisException: ALTER TABLE ADD PARTITION is not allowed`. That's because the default value of `tracksPartitionsInCatalog` from `CREATE TABLE LIKE` always is false.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add a unit test.

Closes #27538 from LantaoJin/SPARK-30785.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-19 15:05:34 +08:00
beliefer 0894dbab2c [MINOR][SQL] Improve readability for window execution
### What changes were proposed in this pull request?
I read the comments of `WindowExec` and found some comment will cause confusion and another need to improve.

### Why are the changes needed?
This PR will enhance the readability and let developer works more easy

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
No need

Closes #27431 from beliefer/improve-window-readability.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-19 14:26:27 +08:00
Wenchen Fan 1b67d546bd revert SPARK-29663 and SPARK-29688
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/26325 and https://github.com/apache/spark/pull/26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes #27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-02-18 21:19:57 +01:00
yi.wu 68d7edf949 [SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy
### What changes were proposed in this pull request?

Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html):

SQL:
* spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083)
* spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252)
* spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544)
* spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183)
* spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389)
* spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344)
* spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074)
* spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811)
* spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586)
* spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052)
* spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189)
* spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853)

CORE:
* spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855)
* spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235)
* spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001)
* spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371)
* spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429)

### Why are the changes needed?

To comply with the config naming policy.

### Does this PR introduce any user-facing change?

No. Configurations listed above are all newly added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27563 from Ngone51/revise_boolean_conf_name.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 20:39:50 +08:00
yi.wu 643a480b11 [SPARK-30863][SQL] Distinguish Cast and AnsiCast in toString
### What changes were proposed in this pull request?

Prefix by `ansi_`  in `toString` if it's a `AnsiCast` or ansi enabled `Cast`.

E.g. run `spark.sql("select cast('51' as int)").queryExecution.analyzed` under ansi mode.

Before this PR:
```
Project [cast(51 as int) AS CAST(51 AS INT)#0]
+- OneRowRelation
```

After this PR:
```
Project [ansi_cast(51 as int) AS CAST(51 AS INT)#0]
+- OneRowRelation
```

### Why are the changes needed?

This is useful while comparing `LogicalPlan`s literally.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27608 from Ngone51/ansi_cast_tostring.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 16:10:43 +08:00
HyukjinKwon 9618806f44 [SPARK-30847][SQL] Take productPrefix into account in MurmurHash3.productHash
### What changes were proposed in this pull request?

This PR proposes to port Scala's bugfix https://github.com/scala/scala/pull/7693 (Scala 2.13) to address https://github.com/scala/bug/issues/10495 issue.

In short, it is possible for different product instances having the same children to have the same hash. See:

```scala
scala> spark.range(1).selectExpr("id - 1").queryExecution.analyzed.semanticHash()
res0: Int = -565572825

scala> spark.range(1).selectExpr("id + 1").queryExecution.analyzed.semanticHash()
res1: Int = -565572825
```

### Why are the changes needed?

It was found during the review of https://github.com/apache/spark/pull/27565. We should better produce different hash for different objects.

### Does this PR introduce any user-facing change?

No, it's not identified. Possibly performance related issue.

### How was this patch tested?

Manually tested, and unittest was added.

Closes #27601 from HyukjinKwon/SPARK-30847.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 14:28:44 +08:00
Terry Kim 5866bc77d7 [SPARK-30814][SQL] ALTER TABLE ... ADD COLUMN position should be able to reference columns being added
### What changes were proposed in this pull request?

In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema:
```
root:
  - a: string
  - b: long
```
, the following should work:
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue.

### Why are the changes needed?

To fix a bug described above.

### Does this PR introduce any user-facing change?

Yes, now
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
works as expected.

### How was this patch tested?

Added new tests

Closes #27584 from imback82/alter_table_pos_fix.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 13:01:45 +08:00
Liang Zhang d8c0599e54 [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
### What changes were proposed in this pull request?
This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class.

### Why are the changes needed?
They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs.

### Does this PR introduce any user-facing change?
```
scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2")
df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int]

scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2")
df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int]

scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2")
df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int]

scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2")
df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int]

scala> df1.semanticHash
res0: Int = 594427822

scala> df2.semanticHash
res1: Int = 594427822

scala> df1.sameSemantics(df2)
res2: Boolean = true

scala> df1.sameSemantics(df3)
res3: Boolean = false

scala> df3.semanticHash
res4: Int = -1592702048

scala> df4.semanticHash
res5: Int = -1592702048

scala> df4.sameSemantics(df3)
res6: Boolean = true
```

### How was this patch tested?
Unit test in scala and doctest in python.

Note: comments are copied from the corresponding lower-level APIs.
Note: There are some issues to be fixed that would improve the hash collision rate: https://github.com/apache/spark/pull/27565#discussion_r379881028

Closes #27565 from liangz1/df-same-result.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
2020-02-18 09:22:26 +08:00
Nicholas Chammas 4ed9b88996 [SPARK-30832][DOCS] SQL function doc headers should link to anchors
### Why are the changes needed?

In most of our docs, you can click on a heading to immediately get an anchor link to that specific section of the docs. This is very handy when you are reading the docs and want to share a link to a specific part.

The SQL function docs are lacking this. This PR adds this convenience to the SQL function docs.

Here's the impact on the generated HTML.

Before this PR:

```html
<h3 id="array_join">array_join</h3>
```

After this PR:

```html
<h3 id="array_join"><a class="toclink" href="#array_join">array_join</a></h3>
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I built the docs manually and reviewed the results in my browser.

Closes #27585 from nchammas/SPARK-30832-sql-doc-headers.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-17 19:19:28 -06:00
beliefer d8d3ce5c76 [SPARK-30825][SQL][DOC] Update documents information for window function
### What changes were proposed in this pull request?
I checked the all the window function and found all of them not add parameter information and version information to the document.
This PR will make a supplement.

### Why are the changes needed?
Documentation is missing and does not meet new standards.

### Does this PR introduce any user-facing change?
Yes. User will face the information of parameters and version.

### How was this patch tested?
Exists UT

Closes #27572 from beliefer/add_since_for_window_function.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-18 09:44:34 +09:00
Ajith 657d151395 [SPARK-29174][SQL] Support LOCAL in INSERT OVERWRITE DIRECTORY to data source
### What changes were proposed in this pull request?
`INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax

### Why are the changes needed?
without the modification in PR, ``` insert overwrite local directory <location> using ```

throws exception

```
Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0)
```
which was introduced in https://github.com/apache/spark/pull/18975, but this restriction is not needed, hence dropping the same.
Keep behaviour consistent for local and remote file-system in  `INSERT OVERWRITE DIRECTORY`

### Does this PR introduce any user-facing change?
Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception

### How was this patch tested?
Added UT

Closes #27039 from ajithme/insertoverwrite2.

Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-18 09:42:31 +09:00
Ajith 2854091d12 [SPARK-22590][SQL] Copy sparkContext.localproperties to child thread in BroadcastExchangeExec.executionContext
### What changes were proposed in this pull request?
In `org.apache.spark.sql.execution.exchange.BroadcastExchangeExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the broadcast execution thread in `org.apache.spark.sql.execution.exchange.BroadcastExchangeExec#executionContext`

### Why are the changes needed?
When executing `BroadcastExchangeExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads.
These threads are created in the executionContext (thread pools). Each Thread pool has a default `keepAliveSeconds` of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added UT

Closes #27266 from ajithme/broadcastlocalprop.

Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 02:26:52 +08:00
Maxim Gekk afaeb29599 [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.

### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.

### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```

### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 02:15:44 +08:00
yi.wu a1d536cb3e [SPARK-15616][FOLLOW-UP][SQL] Sub Optimizer should include super.postHocOptimizationBatches
### What changes were proposed in this pull request?

Let sub optimizer's  `postHocOptimizationBatches` also includes super's `postHocOptimizationBatches`.

### Why are the changes needed?

It's necessary according to the design of catalyst optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass jenkins.

Closes #27607 from Ngone51/spark_15616_followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 01:52:23 +08:00
Yuanjian Li 5ffc5ff55e [SPARK-11150][SQL][FOLLOWUP] Move sql/dynamicpruning to sql/execution/dynamicpruning
### What changes were proposed in this pull request?
Follow-up work for #25600. In this PR, we move `sql/dynamicpruning` to `sql/execution/dynamicpruning`.

### Why are the changes needed?
Fix the unexpected public APIs in 3.0.0 #27560.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #27581 from xuanyuanking/SPARK-11150-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 01:44:14 +08:00
Yuming Wang 76ddb6d835 [SPARK-30755][SQL] Update migration guide and add actionable exception for HIVE-15167
### What changes were proposed in this pull request?
[HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) removed the `SerDe` interface. This may break custom `SerDe` builds for Hive 1.2. This PR update the migration guide for this change.

### Why are the changes needed?

Otherwise:
```
2020-01-27 05:11:20.446 - stderr> 20/01/27 05:11:20 INFO DAGScheduler: ResultStage 2 (main at NativeMethodAccessorImpl.java:0) failed in 1.000 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 13, 10.110.21.210, executor 1): java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/SerDe
  2020-01-27 05:11:20.446 - stderr>  at java.lang.ClassLoader.defineClass1(Native Method)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
  2020-01-27 05:11:20.446 - stderr>  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  2020-01-27 05:11:20.446 - stderr>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
  2020-01-27 05:11:20.446 - stderr>  at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
  2020-01-27 05:11:20.446 - stderr>  at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
  2020-01-27 05:11:20.446 - stderr>  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
  2020-01-27 05:11:20.446 - stderr>  at java.security.AccessController.doPrivileged(Native Method)
  2020-01-27 05:11:20.446 - stderr>  at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  2020-01-27 05:11:20.446 - stderr>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.ClassLoader.loadClass(ClassLoader.java:405)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.Class.forName0(Native Method)
  2020-01-27 05:11:20.446 - stderr>  at java.lang.Class.forName(Class.java:348)
  2020-01-27 05:11:20.446 - stderr>  at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:76)
.....
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manual test

Closes #27492 from wangyum/SPARK-30755.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-17 09:26:56 -08:00
wangguangxin.cn 0ae3ff60c4 [SPARK-30806][SQL] Evaluate once per group in UnboundedWindowFunctionFrame
### What changes were proposed in this pull request?
We only need to do aggregate evaluation once per group in `UnboundedWindowFunctionFrame`

### Why are the changes needed?
Currently, in `UnboundedWindowFunctionFrame.write`,it re-evaluate the processor for each row in a group, which is not necessary in fact which I'll address later. It hurts performance when the evaluation is time-consuming (for example, Percentile's eval need to sort its buffer and do some calculation). In our production, there is a percentile with window operation sql,  it costs more than 10 hours in SparkSQL while 10min in Hive.

In fact, `UnboundedWindowFunctionFrame` can be treated as `SlidingWindowFunctionFrame` with `lbound = UnboundedPreceding` and `ubound = UnboundedFollowing`, just as its comments. In that case, `SlidingWindowFunctionFrame` also only do evaluation once for each group.

The performance issue can be reproduced by running the follow scripts in local spark-shell
```
spark.range(100*100).map(i => (i, "India")).toDF("uv", "country").createOrReplaceTempView("test")
sql("select uv, country, percentile(uv, 0.95) over (partition by country) as ptc95 from test").collect.foreach(println)
```
Before this patch, the sql costs **128048 ms**.
With this patch,  the sql costs **3485 ms**.

If we increase the data size to 1000*1000 for example, then spark cannot even produce result without this patch(I'v waited for several hours).

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
Existing UT

Closes #27558 from WangGuangxin/windows.

Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: herman <herman@databricks.com>
2020-02-17 18:15:54 +01:00
Yuanjian Li e4a541b278 [SPARK-30829][SQL] Define LegacyBehaviorPolicy enumeration as the common value for result change configs
### What changes were proposed in this pull request?
Define a new enumeration `LegacyBehaviorPolicy` in SQLConf, it will be used as the common value for result change configs.

### Why are the changes needed?
During API auditing for the 3.0 release, we found several new approaches that will change the results silently. For these features, we need a common three-value config.

### Does this PR introduce any user-facing change?
Yes, original config `spark.sql.legacy.ctePrecedence.enabled` change to `spark.sql.legacy.ctePrecedencePolicy`.

### How was this patch tested?
Existing UT.

Closes #27579 from xuanyuanking/SPARK-30829.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 00:52:05 +08:00
Arwin Tio 25e9156bc0 [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
### What changes were proposed in this pull request?
See JIRA: https://issues.apache.org/jira/browse/SPARK-29089
Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html

When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck.

From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls:

> - have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those.
> - add parallel execution to the glob and existence checks

### Why are the changes needed?

Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

I added a test suite `DataSourceSuite` - open to suggestions for better naming.

See [here](https://github.com/apache/spark/pull/25899#issuecomment-534380034) and [here](https://github.com/apache/spark/pull/25899#issuecomment-534069194) for some measurements

Closes #25899 from cozos/master.

Lead-authored-by: Arwin Tio <Arwin.tio@adroll.com>
Co-authored-by: Arwin Tio <arwin.tio@hotmail.com>
Co-authored-by: Arwin Tio <arwin.tio@adroll.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-17 09:30:35 -06:00
Maxim Gekk 06217cfded [SPARK-30793][SQL] Fix truncations of timestamps before the epoch to minutes and seconds
### What changes were proposed in this pull request?
In the PR, I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `SECOND` and `MINUTE` levels.

### Why are the changes needed?
This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `SECOND` and `MINUTE` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp:
```sql
spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123');
2020-02-11 00:01:02
```
but seconds in the truncated timestamp before the epoch are increased by 1:
```sql
spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123');
1960-02-11 00:01:03
```

### Does this PR introduce any user-facing change?
Yes. After the changes, the example above outputs correct result:
```sql
spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123');
1960-02-11 00:01:02
```

### How was this patch tested?
Added new tests to `DateFunctionsSuite`.

Closes #27543 from MaxGekk/fix-second-minute-truc.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 22:51:56 +08:00
Yuanjian Li ab186e3659 [SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior
### What changes were proposed in this pull request?
This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found.

### Why are the changes needed?
Prevent silent behavior changes.

### Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

### How was this patch tested?
Modify existing UT.

Closes #27478 from xuanyuanking/SPARK-25892-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 22:06:58 +08:00
Maxim Gekk 9107f77f15 [SPARK-30843][SQL] Fix getting of time components before 1582 year
### What changes were proposed in this pull request?

1. Rewrite DateTimeUtils methods `getHours()`, `getMinutes()`, `getSeconds()`, `getSecondsWithFraction()`, `getMilliseconds()` and `getMicroseconds()` using Java 8 time APIs. This will automatically switch the `Hour`, `Minute`, `Second` and `DatePart` expressions on Proleptic Gregorian calendar.
2. Remove unused methods and constant of DateTimeUtils - `to2001`, `YearZero `, `toYearZero` and `absoluteMicroSecond()`.
3. Remove unused value `timeZone` from `TimeZoneAwareExpression` since all expressions have been migrated to Java 8 time API, and legacy instance of `TimeZone` is not needed any more.
4. Change signatures of modified DateTimeUtils methods, and pass `ZoneId` instead of `TimeZone`. This will allow to avoid unnecessary conversions `TimeZone` -> `String` -> `ZoneId`.
5. Modify tests in `DateTimeUtilsSuite` and in `DateExpressionsSuite` to pass `ZoneId` instead of `TimeZone`. Correct the tests, to pass tested zone id instead of None.

### Why are the changes needed?
The changes fix the issue of wrong results returned by the `hour()`, `minute()`, `second()`, `date_part('millisecond', ...)` and `date_part('microsecond', ....)`, see example in [SPARK-30843](https://issues.apache.org/jira/browse/SPARK-30843).

### Does this PR introduce any user-facing change?
Yes. After the changes, the results of examples from SPARK-30843:
```sql
spark-sql> select hour(timestamp '0010-01-01 00:00:00');
0
spark-sql> select minute(timestamp '0010-01-01 00:00:00');
0
spark-sql> select second(timestamp '0010-01-01 00:00:00');
0
spark-sql> select date_part('milliseconds', timestamp '0010-01-01 00:00:00');
0.000
spark-sql> select date_part('microseconds', timestamp '0010-01-01 00:00:00');
0
```

### How was this patch tested?
- By existing test suites `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`.
- Add new tests to `DateExpressionsSuite` and `DateTimeUtilsSuite` for 10 year, like:
```scala
  input = date(10, 1, 1, 0, 0, 0, 0, zonePST)
  assert(getHours(input, zonePST) === 0)
```
- Re-run `DateTimeBenchmark` using Amazon EC2.

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 |

Closes #27596 from MaxGekk/localtimestamp-greg-cal.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-1-30.us-west-2.compute.internal>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 13:59:21 +08:00
Wenchen Fan ab07c6300c [SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view
### What changes were proposed in this pull request?

No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption.

However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view.

This PR proposes to fix this issue by
1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views.
2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views.

### Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

### Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix.

### How was this patch tested?

new tests

Closes #27550 from cloud-fan/ns.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 12:07:46 +08:00
Maxim Gekk 8b73b92aad [SPARK-30826][SQL] Respect reference case in StringStartsWith pushed down to parquet
### What changes were proposed in this pull request?
In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet.

### Why are the changes needed?
This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table:
```sql
CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')
```
created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
+---+
```

### Does this PR introduce any user-facing change?
Yes. After the changes the result is correct for the example above:
```scala
spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show
+---+
|col|
+---+
| 42|
+---+
```

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #27574 from MaxGekk/parquet-StringStartsWith-case-sens.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-15 19:49:58 +08:00
DB Tsai d0f9614760 [SPARK-30289][SQL] Partitioned by Nested Column for InMemoryTable
### What changes were proposed in this pull request?
1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct.

This PR implements partitioned by nested column for `InMemoryTable`.

### Why are the changes needed?

This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests and new tests.

Closes #26929 from dbtsai/addTests.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-02-14 21:46:01 +00:00
Maxim Gekk 7137a6d065 [SPARK-30766][SQL] Fix the timestamp truncation to the HOUR and DAY levels
### What changes were proposed in this pull request?
In the PR, I propose to use Java 8 time API in timestamp truncations to the levels of `HOUR` and `DAY`. The problem is in the usage of `timeZone.getOffset(millis)` in days/hours truncations where the combined calendar (Julian + Gregorian) is used underneath.

### Why are the changes needed?
The change fix wrong truncations. For example, the following truncation to hours should print `0010-01-01 01:00:00` but it outputs wrong timestamp:
```scala
Seq("0010-01-01 01:02:03.123456").toDF()
    .select($"value".cast("timestamp").as("ts"))
    .select(date_trunc("HOUR", $"ts").cast("string"))
    .show(false)
+------------------------------------+
|CAST(date_trunc(HOUR, ts) AS STRING)|
+------------------------------------+
|0010-01-01 01:30:17                 |
+------------------------------------+
```

### Does this PR introduce any user-facing change?
Yes. After the changes, the result of the example above is:
```scala
+------------------------------------+
|CAST(date_trunc(HOUR, ts) AS STRING)|
+------------------------------------+
|0010-01-01 01:00:00                 |
+------------------------------------+
```

### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- By `DateExpressionsSuite` and `DateTimeUtilsSuite`

Closes #27512 from MaxGekk/fix-trunc-old-timestamp.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-14 22:16:57 +08:00
HyukjinKwon 2a270a731a [SPARK-30810][SQL] Parses and convert a CSV Dataset having different column from 'value' in csv(dataset) API
### What changes were proposed in this pull request?

This PR fixes `DataFrameReader.csv(dataset: Dataset[String])` API to take a `Dataset[String]` originated from a column name different from `value`. This is a long-standing bug started from the very first place.

`CSVUtils.filterCommentAndEmpty` assumed the `Dataset[String]` to be originated with `value` column. This PR changes to use the first column name in the schema.

### Why are the changes needed?

For  `DataFrameReader.csv(dataset: Dataset[String])` to support any `Dataset[String]` as the signature indicates.

### Does this PR introduce any user-facing change?
Yes,

```scala
val ds = spark.range(2).selectExpr("concat('a,b,', id) AS text").as[String]
spark.read.option("header", true).option("inferSchema", true).csv(ds).show()
```

Before:

```
org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input columns: [text];;
'Filter (length(trim('value, None)) > 0)
+- Project [concat(a,b,, cast(id#0L as string)) AS text#2]
   +- Range (0, 2, step=1, splits=Some(2))
```

After:

```
+---+---+---+
|  a|  b|  0|
+---+---+---+
|  a|  b|  1|
+---+---+---+
```

### How was this patch tested?

Unittest was added.

Closes #27561 from HyukjinKwon/SPARK-30810.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-14 18:20:18 +08:00
yi.wu 99b8136a86 [SPARK-25990][SQL] ScriptTransformation should handle different data types correctly
### What changes were proposed in this pull request?

We should convert Spark InternalRows to hive data via `HiveInspectors.wrapperFor`.

### Why are the changes needed?

We may hit below exception without this change:

```
[info]    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.1.6, executor driver): java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[info]   	at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:55)
[info]   	at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:321)
[info]   	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:292)
[info]   	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:247)
[info]   	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:231)
[info]   	at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:55)
[info]   	at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2(ScriptTransformationExec.scala:300)
[info]   	at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$2$adapted(ScriptTransformationExec.scala:281)
[info]   	at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   	at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.$anonfun$run$1(ScriptTransformationExec.scala:281)
[info]   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
[info]   	at org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new test. But please note that this test returns different result between Hive1.2 and Hive2.3 due to `HiveDecimal` or `SerDe` difference(don't know the root cause yet).

Closes #27556 from Ngone51/script_transform.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-14 16:52:28 +08:00
maryannxue 0aed77a015 [SPARK-30801][SQL] Subqueries should not be AQE-ed if main query is not
### What changes were proposed in this pull request?
This PR makes sure AQE is either enabled or disabled for the entire query, including the main query and all subqueries.
Currently there are unsupported queries by AQE, e.g., queries that contain DPP filters. We need to make sure that if the main query is unsupported, none of the sub-queries should apply AQE, otherwise it can lead to performance regressions due to missed opportunity of sub-query reuse.

### Why are the changes needed?
To get rid of potential perf regressions when AQE is turned on.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated DynamicPartitionPruningSuite:
1. Removed the existing workaround `withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")`
2. Added `DynamicPartitionPruningSuiteAEOn` and `DynamicPartitionPruningSuiteAEOff` to enable testing this suite with AQE on and off options
3. Added a check in `checkPartitionPruningPredicate` to verify that the subqueries are always in sync with the main query in terms of whether AQE is applied.

Closes #27554 from maryannxue/spark-30801.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-14 11:20:55 +08:00
Ali Afroozeh e2d3983de7 [SPARK-30798][SQL] Scope Session.active in QueryExecution
### What changes were proposed in this pull request?

This PR scopes `SparkSession.active` to prevent problems with processing queries with possibly different spark sessions (and different configs). A new method, `withActive` is introduced on `SparkSession` that restores the previous spark session after the block of code is executed.

### Why are the changes needed?
`SparkSession.active` is a thread local variable that points to the current thread's spark session. It is important to note that the `SQLConf.get` method depends on `SparkSession.active`. In the current implementation it is possible that `SparkSession.active` points to a different session which causes various problems. Most of these problems arise because part of the query processing is done using the configurations of a different session. For example, when creating a data frame using a new session, i.e., `session.sql("...")`, part of the data frame is constructed using the currently active spark session, which can be a different session from the one used later for processing the query.

### Does this PR introduce any user-facing change?
The `withActive` method is introduced on `SparkSession`.

### How was this patch tested?
Unit tests (to be added)

Closes #27387 from dbaliafroozeh/UseWithActiveSessionInQueryExecution.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-02-13 23:58:55 +01:00
Wenchen Fan a4ceea6868 [SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
This is a followup of https://github.com/apache/spark/pull/26434

This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR
1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement.
2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
no

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
existing tests

test UI manually:
![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png)

explain output
```
AdaptiveSparkPlan(isFinalPlan=true)
+- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f
   +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner
      :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0
      :  +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB)
      :     +- ShuffleQueryStage 0
      :        +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53]
      :           +- *(1) Project [(id#0L % 2) AS key1#2L]
      :              +- *(1) Filter isnotnull((id#0L % 2))
      :                 +- *(1) Range (0, 100000, step=1, splits=6)
      +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0
         +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB)
            +- ShuffleQueryStage 1
               +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64]
                  +- *(2) Project [((id#4L % 2) + 1) AS key2#6L]
                     +- *(2) Filter isnotnull(((id#4L % 2) + 1))
                        +- *(2) Range (0, 100000, step=1, splits=6)
```

Closes #27493 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-02-13 20:09:24 +01:00
Yuming Wang fb0e07b08c [SPARK-29231][SQL] Constraints should be inferred from cast equality constraint
### What changes were proposed in this pull request?

This PR add support infer constraints from cast equality constraint. For example:
```scala
scala> spark.sql("create table spark_29231_1(c1 bigint, c2 bigint)")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_29231_2(c1 int, c2 bigint)")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain
== Physical Plan ==
*(2) Project [c1#5L, c2#6L]
+- *(2) BroadcastHashJoin [c1#5L], [cast(c1#7 as bigint)], Inner, BuildRight
   :- *(2) Project [c1#5L, c2#6L]
   :  +- *(2) Filter (isnotnull(c1#5L) AND (c1#5L = 1))
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.spark_29231_1[c1#5L,c2#6L] Batched: true, DataFilters: [isnotnull(c1#5L), (c1#5L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#209]
      +- *(1) Project [c1#7]
         +- *(1) Filter isnotnull(c1#7)
            +- *(1) ColumnarToRow
               +- FileScan parquet default.spark_29231_2[c1#7] Batched: true, DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int>
```

After this PR:
```scala
scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain
== Physical Plan ==
*(2) Project [c1#0L, c2#1L]
+- *(2) BroadcastHashJoin [c1#0L], [cast(c1#2 as bigint)], Inner, BuildRight
   :- *(2) Project [c1#0L, c2#1L]
   :  +- *(2) Filter (isnotnull(c1#0L) AND (c1#0L = 1))
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.spark_29231_1[c1#0L,c2#1L] Batched: true, DataFilters: [isnotnull(c1#0L), (c1#0L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#99]
      +- *(1) Project [c1#2]
         +- *(1) Filter ((cast(c1#2 as bigint) = 1) AND isnotnull(c1#2))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.spark_29231_2[c1#2] Batched: true, DataFilters: [(cast(c1#2 as bigint) = 1), isnotnull(c1#2)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int>
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #27252 from wangyum/SPARK-29231.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 22:48:27 +08:00
beliefer 04604b9899 [SPARK-30758][SQL][TESTS] Improve bracketed comments tests
### What changes were proposed in this pull request?
Although Spark SQL support bracketed comments, but `SQLQueryTestSuite` can't treat bracketed comments well and lead to generated golden files can't display bracketed comments well.
This PR will improve the treatment of bracketed comments and add three test case in `PlanParserSuite`.
Spark SQL can't support nested bracketed comments and https://github.com/apache/spark/pull/27495 used to support it.

### Why are the changes needed?
Golden files can't display well.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT.

Closes #27481 from beliefer/ansi-brancket-comments.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 22:06:24 +08:00
Terry Kim a6b4b914f2 [SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax
### What changes were proposed in this pull request?

This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

The user now can do the following:
```SQL
CREATE TABLE t (col1 int, col2 int) USING Foo;
ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3');
```
, which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`.

### Why are the changes needed?

This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful.

### Does this PR introduce any user-facing change?

Yes, adding a new DDL statement.

### How was this patch tested?

More tests to be added.

Closes #27482 from imback82/replace_cols.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 20:13:36 +08:00
maryannxue 453d5261b2 [SPARK-30528][SQL] Turn off DPP subquery duplication by default
### What changes were proposed in this pull request?
This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:
1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default.
2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible.

### Why are the changes needed?
This is to fix a potential performance regression caused by DPP.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated DynamicPartitionPruningSuite to test the new configuration.

Closes #27551 from maryannxue/spark-30528.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 19:32:38 +08:00
iRakson 926e3a1efe [SPARK-30790] The dataType of map() should be map<null,null>
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes #27542 from iRakson/SPARK-30790.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 12:23:40 +08:00
Thomas Graves 496f6ac860 [SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes
### What changes were proposed in this pull request?

This is another PR for stage level scheduling. In particular this adds changes to the dynamic allocation manager and the scheduler backend to be able to track what executors are needed per ResourceProfile.  Note the api is still private to Spark until the entire feature gets in, so this functionality will be there but only usable by tests for profiles other then the DefaultProfile.

The main changes here are simply tracking things on a ResourceProfile basis as well as sending the executor requests to the scheduler backend for all ResourceProfiles.

I introduce a ResourceProfileManager in this PR that will track all the actual ResourceProfile objects so that we can keep them all in a single place and just pass around and use in datastructures the resource profile id. The resource profile id can be used with the ResourceProfileManager to get the actual ResourceProfile contents.

There are various places in the code that use executor "slots" for things.  The ResourceProfile adds functionality to keep that calculation in it.   This logic is more complex then it should due to standalone mode and mesos coarse grained not setting the executor cores config. They default to all cores on the worker, so calculating slots is harder there.
This PR keeps the functionality to make the cores the limiting resource because the scheduler still uses that for "slots" for a few things.

This PR does also add the resource profile id to the Stage and stage info classes to be able to test things easier.   That full set of changes will come with the scheduler PR that will be after this one.

The PR stops at the scheduler backend pieces for the cluster manager and the real YARN support hasn't been added in this PR, that again will be in a separate PR, so this has a few of the API changes up to the cluster manager and then just uses the default profile requests to continue.

The code for the entire feature is here for reference: https://github.com/apache/spark/pull/27053/files although it needs to be upmerged again as well.

### Why are the changes needed?

Needed for stage level scheduling feature.

### Does this PR introduce any user-facing change?

No user facing api changes added here.

### How was this patch tested?

Lots of unit tests and manually testing. I tested on yarn, k8s, standalone, local modes. Ran both failure and success cases.

Closes #27313 from tgravescs/SPARK-29148.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-02-12 16:45:42 -06:00
Liang-Chi Hsieh 5b76367a9d [SPARK-30797][SQL] Set tradition user/group/other permission to ACL entries when setting up ACLs in truncate table
### What changes were proposed in this pull request?

This is a follow-up to the PR #26956. In #26956, the patch proposed to preserve path permission when truncating table. When setting up original ACLs, we need to set user/group/other permission as ACL entries too, otherwise if the path doesn't have default user/group/other ACL entries, ACL API will complain an error `Invalid ACL: the user, group and other entries are required.`.

 In short this change makes sure:

1. Permissions for user/group/other are always kept into ACLs to work with ACL API.
2. Other custom ACLs are still kept after TRUNCATE TABLE (#26956 did this).

### Why are the changes needed?

Without this fix, `TRUNCATE TABLE` will get an error when setting up ACLs if there is no default default user/group/other ACL entries.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Update unit test.

Manual test on dev Spark cluster.

Set ACLs for a table path without default user/group/other ACL entries:
```
hdfs dfs -setfacl --set 'user:liangchi:rwx,user::rwx,group::r--,other::r--' /user/hive/warehouse/test.db/test_truncate_table

hdfs dfs -getfacl /user/hive/warehouse/test.db/test_truncate_table
# file: /user/hive/warehouse/test.db/test_truncate_table
# owner: liangchi
# group: supergroup
user::rwx
user:liangchi:rwx
group::r--
mask::rwx
other::r--
```
Then run `sql("truncate table test.test_truncate_table")`, it works by normally truncating the table and preserve ACLs.

Closes #27548 from viirya/fix-truncate-table-permission.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-12 14:27:18 -08:00
Maxim Gekk aa0d13683c [SPARK-30760][SQL] Port millisToDays and daysToMillis on Java 8 time API
### What changes were proposed in this pull request?
In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API.

I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL.

### Why are the changes needed?
New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations.

### Does this PR introduce any user-facing change?
Yes, this might effect behavior for old dates before 1582 year.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`.

Closes #27494 from MaxGekk/millis-2-days-java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 02:31:48 +08:00
Eric Wu 5919bd3b8d [SPARK-30651][SQL] Add detailed information for Aggregate operators in EXPLAIN FORMATTED
### What changes were proposed in this pull request?
Currently `EXPLAIN FORMATTED` only report input attributes of HashAggregate/ObjectHashAggregate/SortAggregate, while `EXPLAIN EXTENDED` provides more information of Keys, Functions, etc. This PR enhanced `EXPLAIN FORMATTED` to sync with original explain behavior.

### Why are the changes needed?
The newly added `EXPLAIN FORMATTED` got less information comparing to the original `EXPLAIN EXTENDED`

### Does this PR introduce any user-facing change?
Yes, taking HashAggregate explain result as example.

**SQL**
```
EXPLAIN FORMATTED
  SELECT
    COUNT(val) + SUM(key) as TOTAL,
    COUNT(key) FILTER (WHERE val > 1)
  FROM explain_temp1;
```

**EXPLAIN EXTENDED**
```
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(val#6), sum(cast(key#5 as bigint)), count(key#5)], output=[TOTAL#62L, count(key) FILTER (WHERE (val > 1))#71L])
+- Exchange SinglePartition, true, [id=#89]
   +- HashAggregate(keys=[], functions=[partial_count(val#6), partial_sum(cast(key#5 as bigint)), partial_count(key#5) FILTER (WHERE (val#6 > 1))], output=[count#75L, sum#76L, count#77L])
      +- *(1) ColumnarToRow
         +- FileScan parquet default.explain_temp1[key#5,val#6] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/XXX/spark-dev/spark/spark-warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,val:int>
```

**EXPLAIN FORMATTED - BEFORE**
```
== Physical Plan ==
* HashAggregate (5)
+- Exchange (4)
   +- HashAggregate (3)
      +- * ColumnarToRow (2)
         +- Scan parquet default.explain_temp1 (1)

...
...
(5) HashAggregate [codegen id : 2]
Input: [count#91L, sum#92L, count#93L]
...
...
```

**EXPLAIN FORMATTED - AFTER**
```
== Physical Plan ==
* HashAggregate (5)
+- Exchange (4)
   +- HashAggregate (3)
      +- * ColumnarToRow (2)
         +- Scan parquet default.explain_temp1 (1)

...
...
(5) HashAggregate [codegen id : 2]
Input: [count#91L, sum#92L, count#93L]
Keys: []
Functions: [count(val#6), sum(cast(key#5 as bigint)), count(key#5)]
Results: [(count(val#6)#84L + sum(cast(key#5 as bigint))#85L) AS TOTAL#78L, count(key#5)#86L AS count(key) FILTER (WHERE (val > 1))#87L]
Output: [TOTAL#78L, count(key) FILTER (WHERE (val > 1))#87L]
...
...
```

### How was this patch tested?
Three tests added in explain.sql for HashAggregate/ObjectHashAggregate/SortAggregate.

Closes #27368 from Eric5553/ExplainFormattedAgg.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 02:00:23 +08:00
Maxim Gekk 61b1e608f0 [SPARK-30759][SQL][TESTS][FOLLOWUP] Check cache initialization in StringRegexExpression
### What changes were proposed in this pull request?
Added new test to `RegexpExpressionsSuite` which checks that `cache` of compiled pattern is set when the `right` expression (pattern in `LIKE`) is a foldable expression.

### Why are the changes needed?
To be sure that `cache` in `StringRegexExpression` is initialized for foldable patterns.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the added test in `RegexpExpressionsSuite`.

Closes #27547 from MaxGekk/regexp-cache-test.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 23:50:34 +08:00
Maxim Gekk c1986204e5 [SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see https://github.com/apache/spark/pull/26507 & https://github.com/apache/spark/pull/26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 20:12:38 +08:00
beliefer f5026b1ba7 [SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
### What changes were proposed in this pull request?
The current implement of `regexp_extract` will throws a unprocessed exception show below:

`SELECT regexp_extract('1a 2b 14m', 'd+')`
```
java.lang.IndexOutOfBoundsException: No group 1
[info] at java.util.regex.Matcher.group(Matcher.java:538)
[info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
```
I think should treat this exception well.

### Why are the changes needed?
Fix a bug `java.lang.IndexOutOfBoundsException No group 1 `

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
New UT

Closes #27508 from beliefer/fix-regexp_extract-bug.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 14:49:22 +08:00
Kris Mok b4769998ef [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
### What changes were proposed in this pull request?

This PR proposes to make the `code` string interpolator treat escapes the same way as Scala's builtin `StringContext.s()` string interpolator. This will remove the need for an ugly workaround in `Like` expression's codegen.

### Why are the changes needed?

The `code()` string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin `StringContext.s()` interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments.

For example,
```scala
val arg = "This is an argument."
val str = s"This is string part 1. $arg This is string part 2."
val code = code"This is string part 1. $arg This is string part 2."
assert(code.toString == str)
```
We should expect the `code()` interpolator to produce the same result as the `StringContext.s()` interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim.

But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args.
That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in `Like` expression's codegen, there's an ugly workaround for this bug:
```scala
      // We need double escape to avoid org.codehaus.commons.compiler.CompileException.
      // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'.
      // '\"' will cause exception 'Line break in literal not allowed'.
      val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
        s"""\\\\\\$escapeChar"""
      } else {
        escapeChar
      }
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a new unit test case in `CodeBlockSuite`.

Closes #27544 from rednaxelafx/fix-code-string-interpolator.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-12 15:19:16 +09:00
herman b25359cca3 [SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions
### What changes were proposed in this pull request?
This is a small follow-up for https://github.com/apache/spark/pull/27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes #27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-12 10:48:29 +09:00
Maxim Gekk 45db48e2d2 Revert "[SPARK-30625][SQL] Support escape as third parameter of the like function
### What changes were proposed in this pull request?

In the PR, I propose to revert the commit 8aebc80e0e.

### Why are the changes needed?
See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #27531 from MaxGekk/revert-like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-11 10:15:34 -08:00
HyukjinKwon 99bd59fe29 [SPARK-29462][SQL][DOCS] Add some more context and details in 'spark.sql.defaultUrlStreamHandlerFactory.enabled' documentation
### What changes were proposed in this pull request?

This PR adds some more information and context to `spark.sql.defaultUrlStreamHandlerFactory.enabled`.

### Why are the changes needed?

It is a bit difficult to understand the documentation of `spark.sql.defaultUrlStreamHandlerFactory.enabled`.

### Does this PR introduce any user-facing change?

Nope, internal doc only fix.

### How was this patch tested?

Nope. I only tested linter.

Closes #27541 from HyukjinKwon/SPARK-29462-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-11 09:55:02 -08:00
Maxim Gekk dc66d57e98 [SPARK-30754][SQL] Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
### What changes were proposed in this pull request?
In the case of back-to-back calculation of `floorDiv` and `floorMod` with the same arguments, the result of `foorDiv` can be reused in calculation of `floorMod`. The `floorMod` method is defined as the following in Java standard library:
```java
    public static int floorMod(int x, int y) {
        int r = x - floorDiv(x, y) * y;
        return r;
    }
```
If `floorDiv(x, y)` has been already calculated, it can be reused in `x - floorDiv(x, y) * y`.

I propose to modify 2 places in `DateTimeUtils`:
1. `microsToInstant` which is widely used in many date-time functions. `Math.floorMod(us, MICROS_PER_SECOND)` is just replaced by its definition from Java Math library.
2. `truncDate`: `Math.floorMod(oldYear, divider) == 0` is replaced by `Math.floorDiv(oldYear, divider) * divider == oldYear` where `floorDiv(...) * divider` is pre-calculated.

### Why are the changes needed?
This reduces the number of arithmetic operations, and can slightly improve performance of date-time functions.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes #27491 from MaxGekk/opt-microsToInstant.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-11 09:07:40 -06:00
fuwhu f1d0dce484 [MINOR][DOC] Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions
### What changes were proposed in this pull request?
Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions.

### Why are the changes needed?
To describe these two classes.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
no

Closes #27535 from fuwhu/SPARK-15616-FOLLOW-UP.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-11 22:16:44 +08:00