Commit graph

27278 commits

Author SHA1 Message Date
Gabor Somogyi 5a5af46a94 [SPARK-31575][SQL] Synchronise global JVM security configuration modification
### What changes were proposed in this pull request?
There is a race in secure JDBC connection providers. Namely multiple providers can read and/or write the exact same JVM security configuration at the same time. In this PR I've synchronised them with an object class. Since the configuration read and write takes couple of milliseconds it won't cause performance degradation.

### Why are the changes needed?
There is a race in secure JDBC connection providers.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing unit + integration tests.

Closes #28368 from gaborgsomogyi/SPARK-31575.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:10:58 -05:00
Huaxin Gao 7a670b5a0a [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
### What changes were proposed in this pull request?
Add Python version of
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```

### Why are the changes needed?
parity between scala and python

### Does this PR introduce _any_ user-facing change?
yes
new method
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```
in PySpark ANOVATest/ChisqTest/FValueTest

### How was this patch tested?
New doctest

Closes #28483 from huaxingao/flatten_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:09:00 -05:00
Max Gekk 32a5398b65 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps
### What changes were proposed in this pull request?
Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test  HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType.

### Why are the changes needed?
To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.

Closes #28481 from MaxGekk/test-random-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 12:59:41 +00:00
Dongjoon Hyun b80309bdb4
[SPARK-31674][CORE][DOCS] Make Prometheus metric endpoints experimental
### What changes were proposed in this pull request?

This PR aims to new Prometheus-format metric endpoints experimental in Apache Spark 3.0.0.

### Why are the changes needed?

Although the new metrics are disabled by default, we had better make it experimental explicitly in Apache Spark 3.0.0 since the output format is still not fixed. We can finalize it in Apache Spark 3.1.0.

### Does this PR introduce _any_ user-facing change?

Only doc-change is visible to the users.

### How was this patch tested?

Manually check the code since this is a documentation and class annotation change.

Closes #28495 from dongjoon-hyun/SPARK-31674.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-10 22:32:26 -07:00
Max Gekk 5d5866be12 [SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`.

### Why are the changes needed?
This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala>
Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS")
  .select($"tsS".cast("timestamp").as("ts")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-07 00:32:20.123|
...
|1001-01-07 00:32:20.123|
+-----------------------+
```
Expected values **must be 1001-01-01 01:02:03.123** but not 1001-01-07 00:32:20.123.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-01 01:02:03.123|
...
|1001-01-01 01:02:03.123|
+-----------------------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 04:58:08 +00:00
Max Gekk 9f768fa991 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps
### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar.

### Why are the changes needed?
This makes RowEncoderSuite much stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY))
```

Closes #28486 from MaxGekk/fix-RowEncoderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 14:22:12 -05:00
Huaxin Gao a75dc80a76 [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference
### What changes were proposed in this pull request?
Remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
Please see #28414

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually build and check

Closes #28451 from huaxingao/html_cleanup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 12:57:25 -05:00
Max Gekk ce63bef1da [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.

### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-10 13:31:26 +09:00
tianlzhang ecda38a7b3
[SPARK-31611][YARN] Register NettyMemoryMetrics into Node Manager's metrics system
### What changes were proposed in this pull request?

Register `NettyMemoryMetrics` into Node Manager's metrics system through `YarnShuffleServiceMetrics`.

- usedDirectMemory
- usedHeapMemory

### Why are the changes needed?

Such that `NettyMemoryMetrics` can be exposed through Node Manager's JMX.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT to ensure NettyMemoryMetrics are registered into Node Manager's metrics system.

Closes #28416 from manuzhang/spark-31611.

Authored-by: tianlzhang <tianlzhang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-08 15:50:19 -07:00
wang-zhun c1801fd6da [SPARK-31235][FOLLOWUP][TESTS][TEST-HADOOP3.2] Fix test "specify a more specific type for the ap…
### What changes were proposed in this pull request?
Update the input parameters for instantiating `RMAppManager` and `ClientRMService`

### Why are the changes needed?
For hadoop3.2, if `RMAppManager` is not created correctly, the following exception will occur:
```
java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
	at org.apache.hadoop.yarn.security.YarnAuthorizationProvider.getInstance(YarnAuthorizationProvider.java:55)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.<init>(RMAppManager.java:117)
```

### How was this patch tested?
UTs

Closes #28456 from wang-zhun/Fix-SPARK-31235.

Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-08 15:41:23 -05:00
manuzhang 77c690a725 [SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan
### What changes were proposed in this pull request?
Show write commands on SQL UI of an AQE plan

### Why are the changes needed?
Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI.

#### Before

![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png)

#### After

![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add UT.

Closes #28474 from manuzhang/aqe-ui.

Lead-authored-by: manuzhang <owenzhang1990@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-08 10:24:13 -07:00
Kousuke Saruta 0fb607ef37 [SPARK-30385][WEBUI] WebUI occasionally throw IOException on stop()
### What changes were proposed in this pull request?

This PR added a workaround for the issue which occasionally happens when SparkContext#stop() is called.
I think this issue can occurs on macOS with OpenJDK / OracleJDK 1.8.
If this issue happens, following stack trace is shown.
```
20/05/03 02:17:54 WARN AbstractConnector:
java.io.IOException: No such file or directory
	at sun.nio.ch.NativeThread.signal(Native Method)
	at sun.nio.ch.ServerSocketChannelImpl.implCloseSelectableChannel(ServerSocketChannelImpl.java:292)
	at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
	at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
	at org.eclipse.jetty.server.ServerConnector.close(ServerConnector.java:368)
	at org.eclipse.jetty.server.AbstractNetworkConnector.shutdown(AbstractNetworkConnector.java:105)
	at org.eclipse.jetty.server.Server.doStop(Server.java:439)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
	at org.apache.spark.ui.ServerInfo.stop(JettyUtils.scala:501)
	at org.apache.spark.ui.WebUI.$anonfun$stop$2(WebUI.scala:173)
	at org.apache.spark.ui.WebUI.$anonfun$stop$2$adapted(WebUI.scala:173)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.ui.WebUI.stop(WebUI.scala:173)
	at org.apache.spark.ui.SparkUI.stop(SparkUI.scala:101)
	at org.apache.spark.SparkContext.$anonfun$stop$6(SparkContext.scala:1966)
	at org.apache.spark.SparkContext.$anonfun$stop$6$adapted(SparkContext.scala:1966)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.SparkContext.$anonfun$stop$5(SparkContext.scala:1966)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1966)
	at org.apache.spark.repl.Main$.$anonfun$doMain$3(Main.scala:79)
	at org.apache.spark.repl.Main$.$anonfun$doMain$3$adapted(Main.scala:79)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.repl.Main$.doMain(Main.scala:79)
	at org.apache.spark.repl.Main$.main(Main.scala:58)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

This issue happens when the Jetty's acceptor thread shrinks before the main thread sends a signal to the thread.

Jetty's acceptor thread waits for a new connection request and blocked by `accept(this.fd, newfd, isaa)` in [`sun.nio.ch.ServerSocketChannelImpl#accept`](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l241).

When `org.eclipse.jetty.server.Server.doStop` is called in the main thread, the thread reaches [this code](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l280).

The server socket descriptor will be closed by `nd.preClose` in the main thread.
Then, `accept()` in acceptor thread throws an Exception due to "Bad file descriptor" in case of macOS.
After the exception is thrown, the acceptor thread will continue to [fetch a task](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L783).
If the thread obtain the `SHRINK` task [here](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L854), the thread will be shrink.
If, the acceptor thread finishes before `NativeThread.signal` is called in the main thread, this issue happens.

I have confirmed this issue happens even `jetty-9.4.28.v20200408`.
Because the stack trace is displayed by the [logger](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java#L372), it's difficult to suppress it.
According to [this condition](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L842), shrink doesn't happen if the idle time is 0. So this PR adds a workaround that set the idle time to 0 immediately before stop.

In case of Linux, the acceptor thread is still blocked by `accept` even though `np.preClose` is called in the main thread.
The acceptor thread will return from `accept` when `NativeThread.signal` is called in the main thread.
It seems that the implementation of `accept systemcall` called in `accept` is different between Linux and macOS.
So, I believe this issue doesn't happen on Linux.

Also, the implementation of `NativeThread.signal` is a little bit changed in [OpenJDK 9](http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/7b17bff2ea36) for macOS.
So this issue doesn't happen for macOS with OpenJDK 9+.

You can reproduce this issue by following instructions using debugger.

1. Launch spark-shell in local mode with JDWP enabled.
2. Access to WebUI. This is needed to increase the number of SparkUI thread to greater than minThreads to meet the condition of shrink.
3. Enable the following breakpoints. Note that don't suspend all threads when a thread reaches one of the breakpoints. Only the threads which reach the line should be suspended.
  3.1 [long now = System.nanoTime(); at org.eclipse.jetty.util.thread.QueuedThreadPool#idleJobPoll](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L850)
  3.2 [NativeThread.signal(th); at sun.nio.ch.ServerSocketChannelImpl#implCloseSelectableChannel](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l283)
  3.3 [thread = 0; at ServerSocketChannelImpl#accept](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l247)
4. Quit spark-shell.
5.  Waiting for a thread reaching the breakpoint `3.1` and until the following condition become true (The idle time of those threads are 1min and you can confirm it using the expression evaluation feature if your debugger supports ).
`(System.nanoTime() - last) > TimeUnit.MILLISECONDS.toNanos(_idleTimeout)`
6. The acceptor thread named `SparkUI-<N>-acceptor-0` should be suspended at the breakpoint `3.3` so continue this thread. This thread will reach the breakpoint at `3.1` and continue further. Then, the acceptor thread will be shrink.
7. Continue all the threads rest.

### Why are the changes needed?

This stack trace is not brought by Spark but it confuses users.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested by the reproduce procedure above and confirmed acceptor thread is no longer shrink.

Closes #28437 from sarutak/SPARK-30385.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-08 08:41:18 +00:00
zhengruifeng bb9b50c217 [SPARK-31656][ML][PYSPARK] AFT blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (~10X speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28473 from zhengruifeng/blockify_aft.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 14:06:36 +08:00
Huaxin Gao 18d2ba53e4 [SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark
### What changes were proposed in this pull request?
Add ANOVASelector and FValueSelector to PySpark

### Why are the changes needed?
ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well.

### Does this PR introduce _any_ user-facing change?
Yes. Add Python version of ANOVASelector and FValueSelector

### How was this patch tested?
new doctest

Closes #28464 from huaxingao/selector_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 11:02:24 +08:00
Huaxin Gao 08335b651a [SPARK-31659][ML][DOCS] Add VarianceThresholdSelector examples and doc
### What changes were proposed in this pull request?
Add VarianceThresholdSelector examples and doc

### Why are the changes needed?
VarianceThresholdSelector is a new feature selector in 3.1.0. We need to add examples and doc

### Does this PR introduce _any_ user-facing change?
Yes.
add Scala, Python and Java examples for VarianceThresholdSelector. Also add doc

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 01 AM" src="https://user-images.githubusercontent.com/13592258/81321791-e3f84d80-9047-11ea-837b-e39c193bd437.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 44 AM" src="https://user-images.githubusercontent.com/13592258/81321806-e8246b00-9047-11ea-8f35-206e330a92ab.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 27 AM" src="https://user-images.githubusercontent.com/13592258/81321822-ea86c500-9047-11ea-8743-99adec7f502b.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 43 AM" src="https://user-images.githubusercontent.com/13592258/81321826-ec508880-9047-11ea-9e7a-22ee5e13f495.png">

### How was this patch tested?
Manually checked

Closes #28478 from huaxingao/variance_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:57:35 +08:00
zhengruifeng 97332f26bf [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as `epsilon`
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines;  (up to 6X(squaredError)~12X(huber) speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28471 from zhengruifeng/blockify_lir_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:52:01 +08:00
Dongjoon Hyun 24fac1e0c7
[SPARK-31646][FOLLOWUP][TESTS] Add clean up code and disable irrelevent conf 2020-05-07 17:50:32 -07:00
tianlzhang dad61ed465
[SPARK-31646][SHUFFLE] Remove unused registeredConnections counter from ShuffleMetrics
### What changes were proposed in this pull request?
Remove unused `registeredConnections` counter from `ExternalBlockHandler#ShuffleMetrics`

This was added by SPARK-25642 at 3.0.0
- 8dd29fe36b

### Why are the changes needed?
It's `registeredConnections` counter created in `TransportContext` that's really counting the numbers and it's misleading for people who want to add new metrics like `registeredConnections`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UTs to ensure all expected metrics are registered for `ExternalShuffleService` and `YarnShuffleService`

Closes #28457 from manuzhang/spark-31611-pre.

Lead-authored-by: tianlzhang <tianlzhang@ebay.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-07 15:22:13 -07:00
angerszhu 0d9faf602e
[SPARK-31655][BUILD] Upgrade snappy-java to 1.1.7.5
### What changes were proposed in this pull request?

snappy-java have release v1.1.7.5, upgrade to latest version.

Fixed in v1.1.7.4
- Caching internal buffers for SnappyFramed streams #234
- Fixed the native lib for ppc64le to work with glibc 2.17 (Previously it depended on 2.22)

Fixed in v1.1.7.5
- Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4

https://github.com/xerial/snappy-java/compare/1.1.7.3...1.1.7.5

v 1.1.7.5 release note:
edc4ec28bd
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need

Closes #28472 from AngersZhuuuu/spark-31655.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-07 12:01:43 -07:00
Max Gekk 272d229005 [SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing
### What changes were proposed in this pull request?
In PR, I propose to modify two tests of `ParquetIOSuite`:
- SPARK-31159: rebasing timestamps in write
- SPARK-31159: rebasing dates in write

to check non-vectorized Parquet reader together with vectorized reader.

### Why are the changes needed?
To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `PaquetIOSuite`:
```
$ ./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-07 07:52:29 +00:00
Kent Yao b31ae7bb0b [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
### What changes were proposed in this pull request?

The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods.

This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of  the RuntimeReplaceable ones with toPettySQL

### Why are the changes needed?

Consistency of schema output between RuntimeReplaceable expressions and normal ones.

For example, `date_format` vs `to_timestamp`, before this PR, they output differently

#### Before
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>
```
#### After

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")

struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>

````

### Does this PR introduce _any_ user-facing change?

Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example

### How was this patch tested?
regenerate all related tests

Closes #28420 from yaooqinn/SPARK-31615.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:40:26 +09:00
Kent Yao bd6b53cc0b [SPARK-31631][TESTS] Fix test flakiness caused by MiniKdc which throws 'address in use' BindException with retry
### What changes were proposed in this pull request?
The `Kafka*Suite`s are flaky because of the Hadoop MiniKdc issue - https://issues.apache.org/jira/browse/HADOOP-12656
> Looking at MiniKdc implementation, if port is 0, the constructor use ServerSocket to find an unused port, assign the port number to the member variable port and close the ServerSocket object; later, in initKDCServer(), instantiate a TcpTransport object and bind at that port.

> It appears that the port may be used in between, and then throw the exception.

Related test failures are suspected,  such as https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122225/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

```scala
[info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** (15 seconds, 426 milliseconds)
[info]   java.net.BindException: Address already in use
[info]   at sun.nio.ch.Net.bind0(Native Method)
[info]   at sun.nio.ch.Net.bind(Net.java:433)
[info]   at sun.nio.ch.Net.bind(Net.java:425)
[info]   at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
[info]   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
[info]   at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198)
[info]   at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422)
[info]   at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
```
After comparing the error stack trace with similar issues reported  in different projects, such as
https://issues.apache.org/jira/browse/KAFKA-3453
https://issues.apache.org/jira/browse/HBASE-14734

We can be sure that they are caused by the same problem issued in HADOOP-12656.

In the PR, We apply the approach from HBASE first before we finally drop Hadoop 2.7.x

### Why are the changes needed?

fix test flakiness

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

the test itself passing Jenkins

Closes #28442 from yaooqinn/SPARK-31631.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:37:03 +09:00
zhengruifeng 052ff49acd [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`

### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28458 from zhengruifeng/blockify_lor_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-07 10:07:24 +08:00
Liang-Chi Hsieh 9bf738724a [SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown
### What changes were proposed in this pull request?

This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document.

### Why are the changes needed?

Make developers less confusing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Only doc change.

Closes #28468 from viirya/SPARK-31365-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:57:08 +09:00
Max Gekk 3d38bc2605 [SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite
### What changes were proposed in this pull request?
Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`.

### Why are the changes needed?
Avro config is not relevant to the parquet tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetIOSuite` via
```
./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-07 09:46:42 +09:00
HyukjinKwon 5c5dd77d6a [SPARK-31647][SQL] Deprecate 'spark.sql.optimizer.metadataOnly' configuration
### What changes were proposed in this pull request?

This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release.

### Why are the changes needed?

This optimization can cause a potential correctness issue, see also SPARK-26709.
Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590.

Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side.

### Does this PR introduce _any_ user-facing change?

Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning:

```scala
scala> spark.conf.unset("spark.sql.optimizer.metadataOnly")
```
```
20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
 deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
```scala
scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
```
```
20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```

### How was this patch tested?

Manually tested.

Closes #28459 from HyukjinKwon/SPARK-31647.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:00:59 +09:00
Huaxin Gao 09ece50799 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark
### What changes were proposed in this pull request?
Add VarianceThresholdSelector to PySpark

### Why are the changes needed?
parity between Scala and Python

### Does this PR introduce any user-facing change?
Yes.
VarianceThresholdSelector is added to PySpark

### How was this patch tested?
new doctest

Closes #28409 from huaxingao/variance_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-06 09:11:03 -05:00
yi.wu b16ea8e1ab [SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries
### What changes were proposed in this pull request?

Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages.

### Why are the changes needed?

If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries.

Before:

<img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png">

After:
<img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png">

### Does this PR introduce _any_ user-facing change?

No(AQE feature hasn't been released).

### How was this patch tested?

Tested manually.

Closes #28460 from Ngone51/fix_aqe_ui.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 12:52:53 +00:00
Huaxin Gao f05560bf50 [SPARK-31127][ML] Implement abstract Selector
### What changes were proposed in this pull request?
Implement abstract Selector. Put the common code among ```ANOVASelector```, ```ChiSqSelector```, ```FValueSelector``` and ```VarianceThresholdSelector``` to Selector.

### Why are the changes needed?
code reuse

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #27978 from huaxingao/spark-31127.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-06 16:10:30 +08:00
Liang-Chi Hsieh 4952f1a03c [SPARK-31365][SQL] Enable nested predicate pushdown per data sources
### What changes were proposed in this pull request?

This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown.

### Why are the changes needed?

We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources.

In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added/Modified unit tests.

Closes #28366 from viirya/SPARK-31365.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:50:06 +00:00
Daoyuan Wang 53a9bf8fec [SPARK-31595][SQL] Spark sql should allow unescaped quote mark in quoted string
### What changes were proposed in this pull request?
`def splitSemiColon` cannot handle unescaped quote mark like "'" or '"' correctly. When there are unmatched quotes in a string, `splitSemiColon` will not drop off semicolon as expected.

### Why are the changes needed?
Some regex expression will use quote mark in string. We should process semicolon correctly.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added Unit test and also manual test.

Closes #28393 from adrian-wang/unescaped.

Authored-by: Daoyuan Wang <me@daoyuan.wang>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:34:43 +00:00
zhengruifeng ebdf41dd69 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, add new param `blockSize`;
2, add a new class InstanceBlock;
3, **if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP);**
4, if `blockSize>1`, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`)

### Does this PR introduce any user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28349 from zhengruifeng/blockify_svc_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-06 10:06:23 +08:00
sychen 588966d696 [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters
### What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.

### Why are the changes needed?

Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`.

### Does this PR introduce any user-facing change?
Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters.

### How was this patch tested?

add UT

Closes #28383 from cxzl25/fix_SPARK-31590.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-06 10:56:19 +09:00
yi.wu 61a6ca5d3f
[SPARK-31643][TEST] Fix flaky o.a.s.scheduler.BarrierTaskContextSuite.barrier task killed, interrupt
### What changes were proposed in this pull request?

Make sure the task has nearly reached `context.barrier()` before killing.

### Why are the changes needed?

In case of the task is killed before it reaches `context.barrier()`, the task will not create the expected file.

```
Error Message
org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed.
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed.
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1(BarrierTaskContextSuite.scala:266)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1$adapted(BarrierTaskContextSuite.scala:226)
	at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:163)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.testBarrierTaskKilled(BarrierTaskContextSuite.scala:226)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$29(BarrierTaskContextSuite.scala:277)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
```

[Here's](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122273/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/barrier_task_killed__interrupt/) the full error messages.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #28454 from Ngone51/fix_kill_interrupt.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-05 12:36:42 -07:00
Steve Loughran 86c4e43525
[SPARK-31644][BUILD] Make Spark's guava version configurable from the command line
### What changes were proposed in this pull request?

This adds the maven property guava.version which can be
used to control the guava version for a build.

It does not change the current version.

### Why are the changes needed?

All future Hadoop releases are going to be built with a later guava version, including Hadoop 3.1.4. This means to run the spark tests with that release you need to update the spark guava version. This patch lets whoever builds spark do this locally.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

ran the hadoop-cloud module tests with the 3.1.4 RC0

```
mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging -Phadoop-cloud,yarn,kinesis-asl test --pl hadoop-cloud
```

observed the linkage problem

```
  java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
```

made the version configurable, retested with

```
-Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging Dguava.version=27.0-jre
```

all good.

Closes #28455 from steveloughran/SPARK-31644-guava-version.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-05 12:17:24 -07:00
Max Gekk bd26429931 [SPARK-31641][SQL] Fix days conversions by JSON legacy parser
### What changes were proposed in this pull request?
Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date.

### Why are the changes needed?
The changes fix a bug and restore compatibility with Spark 2.4 in which:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-11|
+----------+
```

After:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### How was this patch tested?
Add a test to `JsonSuite`.

Closes #28453 from MaxGekk/json-rebase-legacy-days.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:15:31 +00:00
Max Gekk bef5828e12 [SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold
### What changes were proposed in this pull request?
Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`.

### Why are the changes needed?
The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`.

Before (see the PR's results https://github.com/apache/spark/pull/28440):
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             376            388          10         13.3          75.2       1.1X
Collect java.sql.Timestamp                         1878           1937          64          2.7         375.6       0.2X
```

After:
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             249            264          24         20.1          49.8       1.7X
Collect java.sql.Timestamp                         1503           1523          24          3.3         300.5       0.3X
```

Perf improvements in average of:

1. From java.sql.Timestamp is ~ 34%
2. To java.sql.Timestamps is ~16%

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`.

Closes #28441 from MaxGekk/opt-rebase-common-threshold.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:11:53 +00:00
Akshat Bordia c71198ab6c [SPARK-31621][CORE] Fixing Spark Master UI Issue when application is waiting for workers to launch driver
### What changes were proposed in this pull request?
Fixing an issue where Spark Master UI Fails to load if the application is waiting for workers to launch driver.

**Root Cause:**
This is happening due to the fact that the submitted application is waiting for a worker to be free to run the driver. Due to this resource is set to null in the formatResourcesAddresses method and this is running into null pointer exception.
![image](https://user-images.githubusercontent.com/31816865/80801557-77ee9300-8bca-11ea-92b7-b8df58b68de3.png)

**Fix:**
Added a null check before forming a resource address and display "None" if the driver isn't launched yet.

### Why are the changes needed?

Spark Master UI should load as expected when applications are waiting for workers to run driver.

### Does this PR introduce _any_ user-facing change?
The worker column in Spark Master UI will show "None" if the driver hasn't been launched yet.
![image](https://user-images.githubusercontent.com/31816865/80801671-be43f200-8bca-11ea-86c3-381925f82cc7.png)

### How was this patch tested?
Tested on a local setup. Launched 2 applications and ensured that Spark Master UI loads fine.
![image](https://user-images.githubusercontent.com/31816865/80801883-5b9f2600-8bcb-11ea-8a1a-cc597aabc4c2.png)

Closes #28429 from akshatb1/MasterUIBug.

Authored-by: Akshat Bordia <akshat.bordia31@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-05 08:58:37 -05:00
wang-zhun f3891e377f [SPARK-31235][YARN] Separates different categories of applications
### What changes were proposed in this pull request?
This PR adds `spark.yarn.applicationType` to identify the application type

### Why are the changes needed?
The current application defaults to the SPARK type.
In fact, different types of applications have different characteristics and are suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
I recommend distinguishing them by the parameter `spark.yarn.applicationType` so that we can more easily manage and maintain different types of applications.

### How was this patch tested?
1.add UT
2.Tested by verifying Yarn-UI `ApplicationType` in the following cases:
- client and cluster mode

Need additional explanation:
limit cannot exceed 20 characters, can be empty or space
The reasons are as follows:
```
// org.apache.hadoop.yarn.server.resourcemanager.submitApplication.
 if (submissionContext.getApplicationType() == null) {
      submissionContext
        .setApplicationType(YarnConfiguration.DEFAULT_APPLICATION_TYPE);
} else {
      // APPLICATION_TYPE_LENGTH = 20
      if (submissionContext.getApplicationType().length() > YarnConfiguration.APPLICATION_TYPE_LENGTH) {
        submissionContext.setApplicationType(submissionContext
          .getApplicationType().substring(0,
            YarnConfiguration.APPLICATION_TYPE_LENGTH));
      }
    }
```

Closes #28009 from wang-zhun/SPARK-31235.

Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-05 08:40:57 -05:00
zhengruifeng 701deac88d [SPARK-31603][ML] AFT uses common functions in RDDLossFunction
### What changes were proposed in this pull request?
1, make AFT reuse common functions in `ml.optim`, rather than making its own impl.

### Why are the changes needed?
The logic in optimizing AFT is quite similar to other algorithms like other algs based on `RDDLossFunction`,
We should reuse the common functions.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28404 from zhengruifeng/mv_aft_optim.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-05 08:35:20 -05:00
Dilip Biswal 5052d9557d [SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table
### What changes were proposed in this pull request?
This PR is to clean up the markdown file in remaining pages in sql reference. The first one was done by gatorsmile in  [28415](https://github.com/apache/spark/pull/28415)

- Replace HTML table by MD table
- **sql-ref-ansi-compliance.md**
<img width="967" alt="Screen Shot 2020-05-01 at 4 36 35 PM" src="https://user-images.githubusercontent.com/14225158/80848981-1cbca080-8bca-11ea-8a5d-63174b31c800.png">
- **sql-ref-datatypes.md (Scala)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 37 30 PM" src="https://user-images.githubusercontent.com/14225158/80849057-6a390d80-8bca-11ea-8866-ab08bab31432.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 39 18 PM" src="https://user-images.githubusercontent.com/14225158/80849061-6c9b6780-8bca-11ea-834c-eb93d3ab47ae.png">
- **sql-ref-datatypes.md (Java)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 41 24 PM" src="https://user-images.githubusercontent.com/14225158/80849138-b3895d00-8bca-11ea-9d3b-555acad2086c.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 41 39 PM" src="https://user-images.githubusercontent.com/14225158/80849140-b6844d80-8bca-11ea-9ca9-1812b6a76c02.png">
- **sql-ref-datatypes.md (Python)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 43 36 PM" src="https://user-images.githubusercontent.com/14225158/80849202-0400ba80-8bcb-11ea-96a5-7caecbf9dbbf.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 43 54 PM" src="https://user-images.githubusercontent.com/14225158/80849205-06fbab00-8bcb-11ea-8f00-6df52b151684.png">
- **sql-ref-datatypes.md (R)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 45 16 PM" src="https://user-images.githubusercontent.com/14225158/80849288-5fcb4380-8bcb-11ea-8277-8589b5bb31bc.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 45 36 PM" src="https://user-images.githubusercontent.com/14225158/80849294-62c63400-8bcb-11ea-9438-b4f1193bc757.png">
- **sql-ref-datatypes.md (SQL)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 48 02 PM" src="https://user-images.githubusercontent.com/14225158/80849336-986b1d00-8bcb-11ea-9736-5fb40496b681.png">
- **sql-ref-syntax-qry-select-tvf.md**
<img width="967" alt="Screen Shot 2020-05-01 at 4 49 32 PM" src="https://user-images.githubusercontent.com/14225158/80849399-d10af680-8bcb-11ea-8dc2-e3e750e21a59.png">

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually using jekyll serve

Closes #28433 from dilipbiswal/sql-doc-table-cleanup.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-05 15:21:14 +09:00
turbofei 8d1f7d2a4a [SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException
### What changes were proposed in this pull request?
If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`.
We may meet TableAlreadyExistsException.

```
org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default'
```

The reason is that, there is some tests that does not clean up the tables/views.
In this PR, I add `withTempViews` for these tests.

### Why are the changes needed?
To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existed UT.

Closes #28239 from turboFei/SPARK-31467.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-05 15:14:33 +09:00
Max Gekk 735771e7b4 [SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write
### What changes were proposed in this pull request?
Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS,  and rebasing on for INT96):

timestamp type | vectorized off (ns/row) | vectorized on (ns/row)
--|--|--
TIMESTAMP_MICROS| 160.1 | 50.2
INT96 | 215.6 | 117.8
TIMESTAMP_MILLIS | 159.9 | 60.6

### Why are the changes needed?
To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the benchmarks via:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 05:40:15 +00:00
Dongjoon Hyun 0907f2e7b5 [SPARK-27963][FOLLOW-UP][DOCS][CORE] Remove for testing because CleanerListener is used ExecutorMonitor during dynamic allocation
### What changes were proposed in this pull request?

This PR aims to remove `for testing` from `CleanerListener` class description to promote this private class more clearly.

### Why are the changes needed?

After SPARK-27963 (Allow dynamic allocation without a shuffle service), `CleanerListener` is used in `ExecutorMonitor` during dynamic allocation. Specifically, `CleanerListener.shuffleCleaned` is used.
- https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L385-L392

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a private doc-only change.

Closes #28452 from dongjoon-hyun/SPARK-MINOR.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-05 10:07:30 +09:00
beliefer b9494206a5 [SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff
### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/28194.
As discussed at https://github.com/apache/spark/pull/28194/files#r418418796.
This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff.
Although `ExpressionsSchemaSuite` at line
b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L165)
just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output.
This PR will try to avoid this issue.
After this PR, the exception looks like below:
```
[info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds)
[info]   340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167)
```

### Why are the changes needed?
Make the exception more concise and clear.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28430 from beliefer/improve-expressions-schema-suite.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-05 10:04:16 +09:00
Max Gekk 372ccba063
[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default
### What changes were proposed in this pull request?
This reverts commit 43a73e387c. It sets `INT96` as the timestamp type while saving timestamps to parquet files.

### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #28450 from MaxGekk/parquet-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-04 17:27:02 -07:00
Dongjoon Hyun e7995c2ddc
[SPARK-31633][BUILD] Upgrade SLF4J from 1.7.16 to 1.7.30
### What changes were proposed in this pull request?

This PR aims to upgrade SLF4J from 1.7.16 to 1.7.30.

### Why are the changes needed?

SLF4J 1.7.23+ is required to enable `slf4j-log4j12` with MDC feature to run under Java 9. Also, this will bring all latest bug fixes.
- http://www.slf4j.org/news.html

> When running under Java 9, log4j version 1.2.x is unable to correctly parse the "java.version" system property. Assuming an inccorect Java version, it proceeded to disable its MDC functionality. The slf4j-log4j12 module shipping in this release fixes the issue by tweaking MDC internals by reflection, allowing log4j to run under Java 9. See also SLF4J-393.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #28446 from dongjoon-hyun/SPARK-31633.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-04 08:14:12 -07:00
Burak Yavuz 02a319d7e1 [SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog
## What changes were proposed in this pull request?

SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such.

### Why are the changes needed?

Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore.

### Does this PR introduce _any_ user-facing change?

Fixes a bug

### How was this patch tested?

Regression test

Closes #28434 from brkyvz/ddlCommands.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-04 12:22:29 +00:00
Kazuaki Ishizaki 35fcc8d5c5 [MINOR][DOCS] Fix typo in documents
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in `project/MimaExcludes.scala`

### Why are the changes needed?
Better readability of documents

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No test needed

Closes #28447 from kiszk/typo_20200504.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 16:53:50 +09:00
Wenchen Fan f72220b8ab [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?

Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.

### Why are the changes needed?

Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2677           2838         142         37.4          26.8       1.0X
[info] after 1582, vec on, rebase on                      3828           4331         805         26.1          38.3       0.7X
[info] before 1582, vec on, rebase off                    2903           2926          34         34.4          29.0       0.9X
[info] before 1582, vec on, rebase on                     4163           4197          38         24.0          41.6       0.6X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3537           3627         104         28.3          35.4       1.0X
[info] after 1900, vec on, rebase on                      6891           7010         105         14.5          68.9       0.5X
[info] before 1900, vec on, rebase off                    3692           3770          72         27.1          36.9       1.0X
[info] before 1900, vec on, rebase on                     7588           7610          30         13.2          75.9       0.5X
```

After this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2758           2944         197         36.3          27.6       1.0X
[info] after 1582, vec on, rebase on                      2908           2966          51         34.4          29.1       0.9X
[info] before 1582, vec on, rebase off                    2840           2878          37         35.2          28.4       1.0X
[info] before 1582, vec on, rebase on                     3407           3433          24         29.4          34.1       0.8X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3861           4003         139         25.9          38.6       1.0X
[info] after 1900, vec on, rebase on                      4194           4283          77         23.8          41.9       0.9X
[info] before 1900, vec on, rebase off                    3849           3937          79         26.0          38.5       1.0X
[info] before 1900, vec on, rebase on                     7512           7546          55         13.3          75.1       0.5X
```

Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.

Closes #28406 from cloud-fan/perf.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 15:30:10 +09:00