Commit graph

26953 commits

Author SHA1 Message Date
Xingbo Jiang c2c5b2df50 [SPARK-31239][CORE][TEST] Increase await duration in WorkerDecommissionSuite.verify a task with all workers decommissioned succeeds
### What changes were proposed in this pull request?

The test case has been flaky because the execution time sometimes exceeds the await duration. Increase the await duration to avoid flakiness.

### How was this patch tested?

Tested locally and it didn't fail anymore.

Closes #28007 from jiangxb1987/DecomTest.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-25 13:43:35 +09:00
Kousuke Saruta 999c9ed10c [SPARK-31081][UI][SQL] Make display of stageId/stageAttemptId/taskId of sql metrics toggleable
### What changes were proposed in this pull request?

This is another solution for `SPARK-31081` and #27849 .
I added a checkbox which can toggle display of stageId/taskid in the SQL's DAG page.
Mainly, I implemented the toggleable texts in boxes with HTML label feature provided by `dagre-d3`.
The additional metrics are enclosed by `<span>` and control the appearance of the text.
But the exception is additional metrics in clusters.
We can use HTML label for cluster but layout will be broken so I choosed normal text label for clusters.
Due to that, this solution contains a little bit tricky code in`spark-sql-viz.js` to manipulate the metric texts and generate DOMs.

### Why are the changes needed?

It makes metrics harder to read after #26843 and user may not interest in extra info(stageId/StageAttemptId/taskId ) when they do not need debug.
#27849 control the appearance by a new configuration property but providing a checkbox is more flexible.

### Does this PR introduce any user-facing change?

Yes.
[Additional metrics shown]
![with-checked](https://user-images.githubusercontent.com/4736016/77244214-0f6cd780-6c56-11ea-9275-a30758dd5339.png)

[Additional metrics hidden]
![without-chedked](https://user-images.githubusercontent.com/4736016/77244219-14ca2200-6c56-11ea-9874-33a466085fce.png)

### How was this patch tested?

Tested manually with a simple DataFrame operation.
* The appearance of additional metrics in the boxes are controlled by the newly added checkbox.
* No error found with JS-debugger.
* Checked/not-checked state is preserved after reloading.

Closes #27927 from sarutak/SPARK-31081.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-03-24 13:37:13 -07:00
Kousuke Saruta 88864c0615 [SPARK-31161][WEBUI] Refactor the on-click timeline action in streagming-page.js
### What changes were proposed in this pull request?

Refactor `streaming-page.js` by making on-click timeline action customizable.

### Why are the changes needed?

In the current implementation, `streaming-page.js` is used from Streaming page and Structured Streaming page but the implementation of the on-click timeline action is strongly dependent on Streamng page.
Structured Streaming page doesn't define the on-click action for now but it's better to remove the dependncy for the future.

Originally, I make this change to fix `SPARK-31128` but #27883 resolved it.
So, now this is just for refactoring.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual tests with following code and confirmed there are no regression and no error in the debug console in Firefox.

For Structured Streaming:
```
spark.readStream.format("socket").options(Map("host"->"localhost", "port"->"8765")).load.writeStream.format("console").start
```
And then, visited Structured Streaming page and there were no error in the debug console when I clicked a point in the timeline.

For Spark Streaming:
```
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(1))
ssc.socketTextStream("localhost", 8765)
dstream.foreachRDD(rdd => rdd.foreach(println))
ssc.start
```
And then, visited Streaming page and confirmed scrolling down and hilighting work well and there were no error in the debug console when I clicked a point in the timeline.

Closes #27921 from sarutak/followup-SPARK-29543-fix-oncick.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-24 13:00:46 -05:00
yi.wu f6ff7d0cf8 [SPARK-30127][SQL] Support case class parameter for typed Scala UDF
### What changes were proposed in this pull request?

To support  case class parameter for typed Scala UDF, e.g.

```
case class TestData(key: Int, value: String)
val f = (d: TestData) => d.key * d.value.toInt
val myUdf = udf(f)
val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2")
checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil)
```

### Why are the changes needed?

Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class.

And here's benchmark result of using case class comparing to row:

```scala

// case class:  58ms 65ms 59ms 64ms 61ms
// row:         59ms 64ms 73ms 84ms 69ms
val f1 = (d: TestData) => s"${d.key}, ${d.value}"
val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}"
val udf1 = udf(f1)
// set spark.sql.legacy.allowUntypedScalaUDF=true
val udf2 = udf(f2, StringType)

val df = spark.range(100000).selectExpr("cast (id as int) as id")
    .select(struct('id, lit("str")).as("col"))
df.cache().collect()

// warmup to exclude some extra influence
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()

start = System.currentTimeMillis()
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

start = System.currentTimeMillis()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

```

### Does this PR introduce any user-facing change?

Yes. User now could be able to use typed Scala UDF with case class as input parameter.

### How was this patch tested?

Added unit tests.

Closes #27937 from Ngone51/udf_caseclass_support.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 23:03:57 +08:00
Maxim Gekk 1fd4607d84 [SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types
### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.

Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 21:14:25 +08:00
HyukjinKwon c181c45f86 [SPARK-31231][BUILD] Explicitly setuptools version as 46.0.0 in pip package test
### What changes were proposed in this pull request?

For a bit of background,
PIP packaging test started to fail (see [this logs](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)) as of  setuptools 46.1.0 release. In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep the modes in `package_data`.

In PySpark pip installation, we keep the executable scripts in `package_data` fc4e56a54c/python/setup.py (L199-L200), and expose their symbolic links as executable scripts.

So, the symbolic links (or copied scripts) executes the scripts copied from `package_data`, which doesn't have the executable permission in its mode:

```
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: Permission denied
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied
```

The current issue is being tracked at https://github.com/pypa/setuptools/issues/2041

</br>

For what this PR proposes:
It sets the upper bound in PR builder for now to unblock other PRs.  _This PR does not solve the issue yet. I will make a fix after monitoring https://github.com/pypa/setuptools/issues/2041_

### Why are the changes needed?

It currently affects users who uses the latest setuptools. So, _users seem unable to use PySpark with the latest setuptools._ See also https://github.com/pypa/setuptools/issues/2041#issuecomment-602566667

### Does this PR introduce any user-facing change?

It makes CI pass for now. No user-facing change yet.

### How was this patch tested?

Jenkins will test.

Closes #27995 from HyukjinKwon/investigate-pip-packaging.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-24 17:59:43 +09:00
HyukjinKwon bd324007d5 [SPARK-31229][SQL][TESTS] Add unit tests TypeCoercion.findTypeForComplex and Cast.canCast in null <> complex types
### What changes were proposed in this pull request?

This PR (SPARK-31229) is rather a followup of https://github.com/apache/spark/pull/27926 (SPARK-31166). It adds unittests for `TypeCoercion.findTypeForComplex` and `Cast.canCast` about struct, map and array with the respect to null types.

### Why are the changes needed?

To detect which scope was broken in the future easily.

### Does this PR introduce any user-facing change?

No, it's a test-only.

### How was this patch tested?

Unittests were added.

Closes #27990 from HyukjinKwon/SPARK-31166-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-24 14:10:59 +09:00
Wenchen Fan 1d0f54951e [SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 12:07:22 +08:00
Maxim Gekk aa3a7429f4 [SPARK-31159][SQL][FOLLOWUP] Move checking of the rebaseDateTime flag out of the loop in VectorizedColumnReader
### What changes were proposed in this pull request?
In the PR, I propose to refactor reading of timestamps of the `TIMESTAMP_MILLIS` logical type from Parquet files in `VectorizedColumnReader`, and move checking of the `rebaseDateTime` flag out of the internal loop.

### Why are the changes needed?
To avoid any additional overhead of the checking the SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` introduced by the PR https://github.com/apache/spark/pull/27915.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the test suite `ParquetIOSuite`.

Closes #27973 from MaxGekk/rebase-parquet-datetime-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-23 23:02:48 +09:00
Wenchen Fan d929c0dfe8 [SPARK-31133][SQL][DOC] fix sql ref doc for DML
### What changes were proposed in this pull request?

`INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement.

### Why are the changes needed?

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #27891 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 22:00:50 +08:00
yi.wu 5c4d44bb83 [SPARK-31190][SQL] ScalaReflection should not erasure user defined AnyVal type
### What changes were proposed in this pull request?

Improve `ScalaReflection` to only don't erasure non user defined `AnyVal` type, but still erasure other types, e.g. `Any`. And this brings two benefits:

1. Give better encode error message for some unsupported types, e.g. `Any`

2. Won't miss the walk path for the `AnyVal` type

### Why are the changes needed?

Firstly, PR #15284 added encode(serializeFor/deserializeFor) support for value class, which extends `AnyVal`, by not erasure types. But, this also introduce a problem that when user try to encoder unsupported types, e.g. `Any`, it will fail on `java.lang.ClassNotFoundException: scala.Any` due to the reason that `scala.Any` doesn't erasure to `java.lang.Object`.

Also, in current `getClassNameFromType()`, it always erasure types which could missing walked path for user defined `AnyVal` types.

### Does this PR introduce any user-facing change?

Yes. For the test below:

```
case class Bar(i: Any)
case class Foo(i: Bar) extends AnyVal

test() {
  implicitly[ExpressionEncoder[Foo]]
}
```

Before:

```
java.lang.ClassNotFoundException: scala.Any
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
 ...
````

After:
```
java.lang.UnsupportedOperationException: No Encoder found for Any
 - field (class: "java.lang.Object", name: "i")
 - field (class: "org.apache.spark.sql.catalyst.encoders.Bar", name: "i")
 - root class: "org.apache.spark.sql.catalyst.encoders.Foo"
 at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:561)
```

### How was this patch tested?

Added unit test and test manually.

Closes #27959 from Ngone51/impr_anyval.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 16:28:34 +08:00
Maxim Gekk db6247faa8 [SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years
### What changes were proposed in this pull request?
In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date **1000-02-29** exists in the Julian calendar but not in Proleptic Gregorian calendar.

I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, **1000-02-29** doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is **1000-03-01**.

### Why are the changes needed?
Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year
```
The parquet files were saved via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-02-29|
+----------+
```

### Does this PR introduce any user-facing change?
Yes, after the fix:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-03-01|
+----------+
```

### How was this patch tested?
Added tests to `DateTimeUtilsSuite`.

Closes #27974 from MaxGekk/julian-date-29-feb.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 14:21:24 +08:00
LantaoJin 929b794e25
[SPARK-30494][SQL] Fix cached data leakage during replacing an existing view
### What changes were proposed in this pull request?

The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak.

We can reproduce by below commands:
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create or replace temporary view temp1 as select 1")
scala> spark.sql("cache table temp1")
scala> spark.sql("create or replace temporary view temp1 as select 1, 2")
scala> spark.sql("cache table temp1")
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined)
scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined)
```

### Why are the changes needed?
Fix the memory leak, specially for long running mode.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add an unit test.

Closes #27185 from LantaoJin/SPARK-30494.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-22 22:22:13 -07:00
beliefer a0cf972985 [SPARK-31141][DSTREAMS][DOC] Add version information to the configuration of Dstreams
### What changes were proposed in this pull request?
Add version information to the configuration of `Dstreams`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.streaming.backpressure.enabled | 1.5.0 | SPARK-9967 and SPARK-10099 | 392bd19d678567751cd3844d9d166a7491c5887e#diff-1b584c4ed88a9022abb11d594f760997 |  
spark.streaming.backpressure.initialRate | 2.0.0 | SPARK-11627 | 7218c0eba957e0a079a407b79c3a050cce9647b2#diff-c64d571ef32d2dbf76e965ecd04a9f52 |  
spark.streaming.blockInterval | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-54d85b29e4349628a0de525c119399b5 |  
spark.streaming.receiver.maxRate | 1.0.2 | SPARK-1341 | ca19cfbcd5cfac9ad731350dfeea14355aec87d6#diff-c64d571ef32d2dbf76e965ecd04a9f52 |  
spark.streaming.receiver.writeAheadLog.enable | 1.2.1 | SPARK-4482 | ce5ea0fd611ce560f6e1fac83562469bdb97091e#diff-0607b70e4e79cbbc1a128c45784cb813 |  
spark.streaming.unpersist | 0.9.0 | None | 08b9fec93d00ff0ebb49af4d9ac72d2806eded02#diff-bcf5f84f78d23ebde7d532bea756bc57 |  
spark.streaming.stopGracefullyOnShutdown | 1.4.0 | SPARK-7776 | a17a5cb302c5fa6a4d3e9e3e0fa2100c0b5436d6#diff-8a7f0e3f26c15ba484e6312c3caf033d |  
spark.streaming.kafka.maxRetries | 1.3.0 | SPARK-4964 | a119cae48030520da9f26ee9a1270bed7f33031e#diff-26cb4369f86050dc2e75cd16291b2844 |  
spark.streaming.ui.retainedBatches | 1.0.0 | SPARK-1386 | f36dc3fed0a0671b0712d664db859da28c0a98e2#diff-56b8d67d07284cfab165d5363bd3500e |
spark.streaming.driver.writeAheadLog.closeFileAfterWrite | 1.6.0 | SPARK-11324 | 4f030b9e82172659d250281782ac573cbd1438fc#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite | 1.6.0 | SPARK-11324 | 4f030b9e82172659d250281782ac573cbd1438fc#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.receiver.writeAheadLog.class | 1.4.0 | SPARK-7056 | 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.receiver.writeAheadLog.rollingIntervalSecs | 1.4.0 | SPARK-7056 | 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.receiver.writeAheadLog.maxFailures | 1.2.0 | SPARK-4028 | 234de9232bcfa212317a8073c4a82c3863b36b14#diff-8cec1a581eebcad673dc8930b1a2801c |  
spark.streaming.driver.writeAheadLog.class | 1.4.0 | SPARK-7056 | 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.driver.writeAheadLog.rollingIntervalSecs | 1.4.0 | SPARK-7056 | 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.driver.writeAheadLog.maxFailures | 1.4.0 | SPARK-7056 | 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.driver.writeAheadLog.allowBatching | 1.6.0 | SPARK-11141 | dccc4645df629f35c4788d50b2c0a6ab381db4b7#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.driver.writeAheadLog.batchingTimeout | 1.6.0 | SPARK-11141 | dccc4645df629f35c4788d50b2c0a6ab381db4b7#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d |  
spark.streaming.sessionByKey.deltaChainThreshold | 1.6.0 | SPARK-11290 | daa74be6f863061221bb0c2f94e70672e6fcbeaa#diff-e0a40541298f885606a2361ff9c5af6c |  
spark.streaming.backpressure.rateEstimator | 1.5.0 | SPARK-8977 | 819be46e5a73f2d19230354ebba30c58538590f5#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 |  
spark.streaming.backpressure.pid.proportional | 1.5.0 | SPARK-8979 | 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 |  
spark.streaming.backpressure.pid.integral | 1.5.0 | SPARK-8979 | 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 |  
spark.streaming.backpressure.pid.derived | 1.5.0 | SPARK-8979 | 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 |  
spark.streaming.backpressure.pid.minRate | 1.5.0 | SPARK-9966 | 612b4609bdd38763725ae07d77c2176aa6756e64#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 |  
spark.streaming.concurrentJobs | 0.7.0 | None | c97ebf64377e853ab7c616a103869a4417f25954#diff-839f06302b2d648a85436486fc13c85d |  
spark.streaming.internal.batchTime | 1.4.0 | SPARK-6862 | 1b7106b867bc0aa4d64b669d79b646f862acaf47#diff-25124e4f06a1da237bf486eceb1f7967 | It's not a configuration, it's a property
spark.streaming.internal.outputOpId | 1.4.0 | SPARK-6862 | 1b7106b867bc0aa4d64b669d79b646f862acaf47#diff-25124e4f06a1da237bf486eceb1f7967 | It's not a configuration, it's a property
spark.streaming.clock | 0.7.0 | None | cae894ee7aefa4cf9b1952038a48be81e1d2a856#diff-839f06302b2d648a85436486fc13c85d |  
spark.streaming.gracefulStopTimeout | 1.0.0 | SPARK-1332 | 94cbe2329021296b660d88f3e8ef3734374020d2#diff-2f8c5c038fda47b9875e10785fdd2498 |  
spark.streaming.manualClock.jump | 0.7.0 | None | fc3d0b602a08fdd182c2138506d1cd9952631f95#diff-839f06302b2d648a85436486fc13c85d |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'

### How was this patch tested?
Exists UT

Closes #27898 from beliefer/add-version-to-dstream-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-23 13:01:44 +09:00
zhengruifeng ded51b04d2 [SPARK-31138][ML][FOLLOWUP] ANOVA optimization
### What changes were proposed in this pull request?
1, remove unused var `numFeatures`;
2, remove the computation of `numSamples` and `numClasses`, since they can be directly infered by `counts: OpenHashMap[Double, Long]`

### Why are the changes needed?
remove a unnecessary job to compute `numSamples` and `numClasses`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27979 from zhengruifeng/anova_followup.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-23 11:16:57 +08:00
beliefer ae0699d4b5 [SPARK-31002][CORE][DOC][FOLLOWUP] Add version information to the configuration of Core
### What changes were proposed in this pull request?
This PR follows up #27847, #27852 and https://github.com/apache/spark/pull/27913.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.storage.localDiskByExecutors.cacheSize | 3.0.0 | SPARK-27651 | fd2bf55abaab08798a428d4e47d4050ba2b82a95#diff-6bdad48cfc34314e89599655442ff210 |
spark.storage.memoryMapLimitForTests | 2.3.0 | SPARK-3151 | b8ffb51055108fd606b86f034747006962cd2df3#diff-abd96f2ae793cd6ea6aab5b96a3c1d7a |  
spark.barrier.sync.timeout | 2.4.0 | SPARK-24817 | 388f5a0635a2812cd71b08352e3ddc20293ec189#diff-6bdad48cfc34314e89599655442ff210 |
spark.scheduler.blacklist.unschedulableTaskSetTimeout | 2.4.1 | SPARK-22148 | 52e9711d01694158ecb3691f2ec25c0ebe4b0207#diff-6bdad48cfc34314e89599655442ff210 |  
spark.scheduler.barrier.maxConcurrentTasksCheck.interval | 2.4.0 | SPARK-24819 | bfb74394a5513134ea1da9fcf4a1783b77dd64e4#diff-6bdad48cfc34314e89599655442ff210 |  
spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures | 2.4.0 | SPARK-24819 | bfb74394a5513134ea1da9fcf4a1783b77dd64e4#diff-6bdad48cfc34314e89599655442ff210 |  
spark.unsafe.exceptionOnMemoryLeak | 1.4.0 | SPARK-7076 and SPARK-7077 and SPARK-7080 | f49284b5bf3a69ed91a5e3e6e0ed3be93a6ab9e4#diff-5a0de266c82b95adb47d9bca714e1f1b |  
spark.unsafe.sorter.spill.read.ahead.enabled | 2.3.0 | SPARK-21113 | 1e978b17d63d7ba20368057aa4e65f5ef6e87369#diff-93a086317cea72a113cf81056882c206 |  
spark.unsafe.sorter.spill.reader.buffer.size | 2.1.0 | SPARK-16862 | c1937dd19a23bd096a4707656c7ba19fb5c16966#diff-93a086317cea72a113cf81056882c206 |  
spark.plugins | 3.0.0 | SPARK-29397 | d51d228048d519a9a666f48dc532625de13e7587#diff-6bdad48cfc34314e89599655442ff210 |  
spark.cleaner.periodicGC.interval | 1.6.0 | SPARK-8414 | 72da2a21f0940b97757ace5975535e559d627688#diff-75141521b1d55bc32d72b70032ad96c0 |
spark.cleaner.referenceTracking | 1.0.0 | SPARK-1103 | 11eabbe125b2ee572fad359c33c93f5e6fdf0b2d#diff-364713d7776956cb8b0a771e9b62f82d |  
spark.cleaner.referenceTracking.blocking | 1.0.0 | SPARK-1103 | 11eabbe125b2ee572fad359c33c93f5e6fdf0b2d#diff-364713d7776956cb8b0a771e9b62f82d |  
spark.cleaner.referenceTracking.blocking.shuffle | 1.1.1 | SPARK-3139 | 5cf1e440137006eedd6846ac8fa57ccf9fd1958d#diff-75141521b1d55bc32d72b70032ad96c0 |  
spark.cleaner.referenceTracking.cleanCheckpoints | 1.4.0 | SPARK-2033 | 25998e4d73bcc95ac85d9af71adfdc726ec89568#diff-440e866c5df0b8386aff57f9f8bd8db1 |  
spark.executor.logs.rolling.strategy | 1.1.0 | SPARK-1940 | 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 |
spark.executor.logs.rolling.time.interval | 1.1.0 | SPARK-1940 | 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 |
spark.executor.logs.rolling.maxSize | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.executor.logs.rolling.maxRetainedFiles | 1.1.0 | SPARK-1940 | 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 |
spark.executor.logs.rolling.enableCompression | 2.0.2 | SPARK-17711 | 26e978a93f029e1a1b5c7524d0b52c8141b70997#diff-2b4575e096e4db7165e087f9429f2a02 |  
spark.master.rest.enabled | 1.3.0 | SPARK-5388 | 6ec0cdc14390d4dc45acf31040f21e1efc476fc0#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.master.rest.port | 1.3.0 | SPARK-5388 | 6ec0cdc14390d4dc45acf31040f21e1efc476fc0#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.master.ui.port | 1.1.0 | SPARK-2857 | 12f99cf5f88faf94d9dbfe85cb72d0010a3a25ac#diff-366c88f47e9b5cfa4d4305febeb8b026 |  
spark.io.compression.snappy.blockSize | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.io.compression.lz4.blockSize | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.io.compression.codec | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-df9e6118c481ceb27faa399114fac0a1 |  
spark.io.compression.zstd.bufferSize | 2.3.0 | SPARK-19112 | 444bce1c98c45147fe63e2132e9743a0c5e49598#diff-df9e6118c481ceb27faa399114fac0a1 |  
spark.io.compression.zstd.level | 2.3.0 | SPARK-19112 | 444bce1c98c45147fe63e2132e9743a0c5e49598#diff-df9e6118c481ceb27faa399114fac0a1 |  
spark.io.warning.largeFileThreshold | 3.0.0 | SPARK-28366 | 26d03b62e20d053943d03b5c5573dd349e49654c#diff-6bdad48cfc34314e89599655442ff210 |  
spark.eventLog.compression.codec | 3.0.0 | SPARK-28118 | 47f54b1ec717d0d744bf3ad46bb1ed3542b667c8#diff-6bdad48cfc34314e89599655442ff210 |  
spark.buffer.size | 0.5.0 | None | 4b1646a25f7581cecae108553da13833e842e68a#diff-eaf125f56ce786d64dcef99cf446a751 |  
spark.locality.wait.process | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 |  
spark.locality.wait.node | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 |  
spark.locality.wait.rack | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 |  
spark.reducer.maxSizeInFlight | 1.4.0 | SPARK-5932 | 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.reducer.maxReqsInFlight | 2.0.0 | SPARK-6166 | 894921d813a259f2f266fde7d86d2ecb5a0af24b#diff-eb30a71e0d04150b8e0b64929852e38b |  
spark.broadcast.compress | 0.6.0 | None | efc5423210d1aadeaea78273a4a8f10425753079#diff-76170a9c8f67b542bc58240a0a12fe08 |  
spark.broadcast.blockSize | 0.5.0 | None | b8ab7862b8bd168bca60bd930cd97c1099fbc8a8#diff-271d7958e14cdaa46cf3737cfcf51341 |  
spark.broadcast.checksum | 2.1.1 | SPARK-18188 | 06a56df226aa0c03c21f23258630d8a96385c696#diff-4f43d14923008c6650a8eb7b40c07f74 |
spark.broadcast.UDFCompressionThreshold | 3.0.0 | SPARK-28355 | 79e204770300dab4a669b9f8e2421ef905236e7b#diff-6bdad48cfc34314e89599655442ff210 |
spark.rdd.compress | 0.6.0 | None | efc5423210d1aadeaea78273a4a8f10425753079#diff-76170a9c8f67b542bc58240a0a12fe08 |  
spark.rdd.parallelListingThreshold | 2.0.0 | SPARK-9926 | 80a4bfa4d1c86398b90b26c34d8dcbc2355f5a6a#diff-eaababfc87ea4949f97860e8b89b7586 |
spark.rdd.limit.scaleUpFactor | 2.1.0 | SPARK-16984 | 806d8a8e980d8ba2f4261bceb393c40bafaa2f73#diff-1d55e54678eff2076263f2fe36150c17 |  
spark.serializer | 0.5.0 | None | fd1d255821bde844af28e897fabd59a715659038#diff-b920b65c23bf3a1b3326325b0d6a81b2 |  
spark.serializer.objectStreamReset | 1.0.0 | SPARK-942 | 40566e10aae4b21ffc71ea72702b8df118ac5c8e#diff-6a59dfc43d1b31dc1c3072ceafa829f5 |  
spark.serializer.extraDebugInfo | 1.3.0 | SPARK-5307 | 636408311deeebd77fb83d2249e0afad1a1ba149#diff-6a59dfc43d1b31dc1c3072ceafa829f5 |  
spark.jars | 0.9.0 | None | f1d206c6b4c0a5b2517b05af05fdda6049e2f7c2#diff-364713d7776956cb8b0a771e9b62f82d |  
spark.files | 1.0.0 | None | 29ee101c73bf066bf7f4f8141c475b8d1bd3cf1c#diff-364713d7776956cb8b0a771e9b62f82d |  
spark.submit.deployMode | 1.5.0 | SPARK-6797 | 7f487c8bde14dbdd244a3493ad11a129ef2bb327#diff-4d2ab44195558d5a9d5f15b8803ef39d |  
spark.submit.pyFiles | 1.0.1 | SPARK-1549 | d7ddb26e1fa02e773999cc4a97c48d2cd1723956#diff-4d2ab44195558d5a9d5f15b8803ef39d |
spark.scheduler.allocation.file | 0.8.1 | None | 976fe60f7609d7b905a34f18743efabd966407f0#diff-9bc0105ee454005379abed710cd20ced |  
spark.scheduler.minRegisteredResourcesRatio | 1.1.1 | SPARK-2635 | 3311da2f9efc5ff2c7d01273ac08f719b067d11d#diff-7d99a7c7a051e5e851aaaefb275a44a1 |  
spark.scheduler.maxRegisteredResourcesWaitingTime | 1.1.1 | SPARK-2635 | 3311da2f9efc5ff2c7d01273ac08f719b067d11d#diff-7d99a7c7a051e5e851aaaefb275a44a1 |  
spark.scheduler.mode | 0.8.0 | None | 98fb69822cf780160bca51abeaab7c82e49fab54#diff-cb7a25b3c9a7341c6d99bcb8e9780c92 |  
spark.scheduler.revive.interval | 0.8.1 | None | d0c9d41a061969d409715b86a91937d8de4c29f7#diff-7d99a7c7a051e5e851aaaefb275a44a1 |  
spark.speculation | 0.6.0 | None | e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-4e188f32951dc989d97fa7577858bc7c |  
spark.speculation.interval | 0.6.0 | None | e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-4e188f32951dc989d97fa7577858bc7c |  
spark.speculation.multiplier | 0.6.0 | None | e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-fff59f72dfe6ca4ccb607ad12535da07 |  
spark.speculation.quantile | 0.6.0 | None | e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-fff59f72dfe6ca4ccb607ad12535da07 |  
spark.speculation.task.duration.threshold | 3.0.0 | SPARK-29976 | ad238a2238a9d0da89be4424574436cbfaee579d#diff-6bdad48cfc34314e89599655442ff210 |
spark.yarn.stagingDir | 2.0.0 | SPARK-13063 | bc36df127d3b9f56b4edaeb5eca7697d4aef761a#diff-14b8ed2ef4e3da985300b8d796a38fa9 |  
spark.buffer.pageSize | 1.5.0 | SPARK-9411 | 1b0099fc62d02ff6216a76fbfe17a4ec5b2f3536#diff-1b22e54318c04824a6d53ed3f4d1bb35 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Exists UT

Closes #27931 from beliefer/add-version-to-core-config-part-four.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-23 11:07:43 +09:00
Kent Yao f81f11822c [SPARK-31189][R][DOCS][FOLLOWUP] Replace Datetime pattern links in R doc
### What changes were proposed in this pull request?

Use our own docs for data pattern instructions to replace java doc.

### Why are the changes needed?

fix doc

### Does this PR introduce any user-facing change?

yes. doc changed
### How was this patch tested?

pass jenkins

Closes #27975 from yaooqinn/SPARK-31189-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-22 14:22:44 +09:00
Huaxin Gao 307cfe1f8e [SPARK-31185][ML] Implement VarianceThresholdSelector
### What changes were proposed in this pull request?
Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

### Why are the changes needed?
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Add new test suite.

Closes #27954 from huaxingao/variance-threshold.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-22 12:44:18 +08:00
Jungtaek Lim (HeartSaVioR) f55f6b569b
[SPARK-31101][BUILD] Upgrade Janino to 3.0.16
### What changes were proposed in this pull request?

This PR(SPARK-31101) proposes to upgrade Janino to 3.0.16 which is released recently.

* Merged pull request janino-compiler/janino#114 "Grow the code for relocatables, and do fixup, and relocate".

Please see the commit log.
- https://github.com/janino-compiler/janino/commits/3.0.16

You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html / though release note for Janino 3.0.16 is actually incorrect.

### Why are the changes needed?

We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable Janino to succeed to compile user's query properly.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #27932 from HeartSaVioR/SPARK-31101-janino-3.0.16.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-21 19:10:23 -07:00
Gabor Somogyi bf342bafa8
[SPARK-30541][TESTS] Implement KafkaDelegationTokenSuite with testRetry
### What changes were proposed in this pull request?
`KafkaDelegationTokenSuite` has been ignored because showed flaky behaviour. In this PR I've changed the approach how the test executed and turning it on again. This PR contains the following:
* The test runs in separate JVM in order to avoid modified security context
* The body of the test runs in `testRetry` which reties if failed
* Additional logs to analyse possible failures
* Enhanced clean-up code

### Why are the changes needed?
`KafkaDelegationTokenSuite ` is ignored.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Executed the test in loop 1k+ times in jenkins (locally much harder to reproduce).

Closes #27877 from gaborgsomogyi/SPARK-30541.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-21 18:59:29 -07:00
Prashant Sharma 3799d2b9d8
[SPARK-30715][K8S][TESTS][FOLLOWUP] Update k8s client version in IT as well
### What changes were proposed in this pull request?
This is a follow up for SPARK-30715 . Kubernetes client version in sync in integration-tests and kubernetes/core

### Why are the changes needed?
More than once, the kubernetes client version has gone out of sync between integration tests and kubernetes/core. So brought them up in sync again and added a comment to save us from future need of this additional followup.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually.

Closes #27948 from ScrapCodes/follow-up-spark-30715.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-21 18:26:53 -07:00
Eric Wu 3a48ea1fe0
[SPARK-31184][SQL] Support getTablesByType API of Hive Client
### What changes were proposed in this pull request?
Hive 2.3+ supports `getTablesByType` API, which will provide an efficient way to get HiveTable with specific type. Now, we have following mappings when using `HiveExternalCatalog`.
```
CatalogTableType.EXTERNAL  =>  HiveTableType.EXTERNAL_TABLE
CatalogTableType.MANAGED => HiveTableType.MANAGED_TABLE
CatalogTableType.VIEW => HiveTableType.VIRTUAL_VIEW
```
Without this API, we need to achieve the goal by `getTables` + `getTablesByName` + `filter with type`.

This PR add `getTablesByType` in `HiveShim`. For those hive versions don't support this API, `UnsupportedOperationException` will be thrown. And the upper logic should catch the exception and fallback to the filter solution mentioned above.

Since the JDK11 related fix in `Hive` is not released yet, manual tests against hive 2.3.7-SNAPSHOT is done by following the instructions of SPARK-29245.

### Why are the changes needed?
This API will provide better usability and performance if we want to get a list of hiveTables with specific type. For example `HiveTableType.VIRTUAL_VIEW` corresponding to `CatalogTableType.VIEW`.

### Does this PR introduce any user-facing change?
No, this is a support function.

### How was this patch tested?
Add tests in VersionsSuite and manually run JDK11 test with following settings:

- Hive 2.3.6 Metastore on JDK8
- Hive 2.3.7-SNAPSHOT library build from source of Hive 2.3 branch
- Spark build with Hive 2.3.7-SNAPSHOT on jdk-11.0.6

Closes #27952 from Eric5553/GetTableByType.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-21 17:41:23 -07:00
yan ma fae981e5f3 [SPARK-30773][ML] Support NativeBlas for level-1 routines
### What changes were proposed in this pull request?
Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256

### Why are the changes needed?
In current ML BLAS.scala, all level-1 routines are fixed to use java
implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X
performance improvement based on performance test which apply direct
calls against these methods. We should provide a way to allow user take
advantage of NativeBLAS for level-1 routines. Here we do it through
switching to NativeBLAS for these methods from f2jBLAS.

### Does this PR introduce any user-facing change?
 Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system.

### How was this patch tested?
Perf test direct calls level-1 routines

Closes #27546 from yma11/SPARK-30773.

Lead-authored-by: yan ma <yan.ma@intel.com>
Co-authored-by: Ma Yan <yan.ma@intel.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-20 10:32:58 -05:00
Huaxin Gao 9a990133f6 [SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels
### What changes were proposed in this pull request?
Add ANOVA Selector

### Why are the changes needed?
Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features.

https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels.
This PR adds ANOVASelector for continuous features and categorical labels.

### Does this PR introduce any user-facing change?
Yes, add a new Selector.

### How was this patch tested?
add new test suites

Closes #27895 from huaxingao/anova.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-20 10:28:00 -05:00
Kent Yao 88ae6c4481 [SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document
### What changes were proposed in this pull request?

Fix errors and missing parts for datetime pattern document
1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own.
2. Some pattern letters are missing
3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N')
4. the second fraction pattern different logic for parsing and formatting

### Why are the changes needed?

fix and improve doc
### Does this PR introduce any user-facing change?

yes, new and updated doc
### How was this patch tested?

pass Jenkins
viewed locally with `jekyll serve`
![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png)

Closes #27956 from yaooqinn/SPARK-31189.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 21:59:26 +08:00
Maxim Gekk b402bc900a [SPARK-31183][SQL][FOLLOWUP] Move rebase tests to AvroSuite and check the rebase flag out of function bodies
### What changes were proposed in this pull request?
1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`.
2. Checking of the `rebaseDateTime` flag is moved out from functions bodies.

### Why are the changes needed?
1. The tests are moved because they are not directly related to logical types.
2. Checking the flag out of functions bodies should improve performance.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running Avro tests via the command `build/sbt avro/test`

Closes #27964 from MaxGekk/rebase-avro-datetime-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-20 19:02:54 +09:00
Maxim Gekk 6a668763b8 [SPARK-31195][SQL] Correct and reuse days rebase functions of DateTimeUtils in DaysWritable
### What changes were proposed in this pull request?
In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`.

### Why are the changes needed?
0. Existing rebasing of days in `DaysWritable` is not correct.
1. To deduplicate code in `DaysWritable`
2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5

### Does this PR introduce any user-facing change?
This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`.

### How was this patch tested?
By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`.

Closes #27962 from MaxGekk/reuse-rebase-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-20 15:57:21 +09:00
Maxim Gekk 4766a36647 [SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.avro.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds.
2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on.
5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-07|
+----------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-01|
+----------+
```

### How was this patch tested?
1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro")

scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df2: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> :paste
// Entering paste mode (ctrl-D to finish)

  val timestampSchema = s"""
    |  {
    |    "namespace": "logical",
    |    "type": "record",
    |    "name": "test",
    |    "fields": [
    |      {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null}
    |    ]
    |  }
    |""".stripMargin

// Exiting paste mode, now interpreting.
scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")

```

2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. :
  - `rebasing microseconds timestamps in write`
  - `rebasing milliseconds timestamps in write`
  - `rebasing dates in write`

Closes #27953 from MaxGekk/rebase-avro-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 13:57:49 +08:00
Dongjoon Hyun f1cc86792f [SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases
### What changes were proposed in this pull request?

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

### Why are the changes needed?

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 12:28:57 +08:00
Takeshi Yamamuro ca499e9409
[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution
### What changes were proposed in this pull request?

This pr fixed code to respect a database name for broadcast table hint resolution.
Currently, spark ignores a database name in multi-part names;
```
scala> sql("CREATE DATABASE testDb")
scala> spark.range(10).write.saveAsTable("testDb.t")

// without this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#24L]
+- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=4)
   +- *(2) Project [id#26L]
      +- *(2) Filter isnotnull(id#26L)
         +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

// with this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#3L]
+- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight
   :- *(2) Range (0, 10, step=1, splits=4)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
      +- *(1) Project [id#5L]
         +- *(1) Filter isnotnull(id#5L)
            +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>
```

This PR comes from https://github.com/apache/spark/pull/22198

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #27935 from maropu/SPARK-25121-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-19 20:11:04 -07:00
Dongjoon Hyun c6a6d5e006
Revert "[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir"
This reverts commit 5bc0d76591.

Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-19 16:08:51 -07:00
Wenchen Fan ac262cb272 [SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26933

Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode.

### Why are the changes needed?

correct the ANSI cast behavior from string to integral

### Does this PR introduce any user-facing change?

Yes under ANSI mode, but ANSI mode is off by default.

### How was this patch tested?

new test

Closes #27957 from cloud-fan/ansi.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-20 00:52:09 +09:00
Kris Mok a1776288f4 [SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId
### What changes were proposed in this pull request?

Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.

The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.

After this change, the following query:
```scala
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
```
will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
```
spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)

*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
      +- *(1) Range (0, 10, step=1, splits=16)
```

The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.

### Why are the changes needed?

Minor improvement to aid WSCG debugging.

### Does this PR introduce any user-facing change?

No user-facing change for end-users; minor change for developers who debug WSCG generated code.

### How was this patch tested?

Manually tested the output; all other tests still pass.

Closes #27955 from rednaxelafx/codegen.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 20:53:01 +09:00
Maxim Gekk bb295d80e3 [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.parquet.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion.
2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on.
5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`.
6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`.
7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`.

### Why are the changes needed?
- For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.
- It fixes the bug of incorrect saving/loading timestamps of the `INT96` type

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-07 11:32:20.123456|
+--------------------------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)

scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-01 01:02:03.123456|
+--------------------------+
```

### How was this patch tested?
1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date")

scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96")
```
2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes):
```bash
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```
Read the saved date/timestamp by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```

Closes #27915 from MaxGekk/rebase-parquet-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-19 12:49:51 +08:00
sarthfrey-db 6fd3138e9c [SPARK-30667][CORE] Change BarrierTaskContext allGather method return type
This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen

Closes #27951 from sarthfrey/all-gather-api.

Authored-by: sarthfrey-db <sarth.frey@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-19 12:12:39 +09:00
Burak Yavuz 4237251861 [SPARK-31178][SQL] Prevent V2 exec nodes from executing multiple times
### What changes were proposed in this pull request?

This PR prevents the execution of V2 DataSource exec nodes multiple times when `collect()` is called on them. For V1 DataSources, commands would be executed as a RunnableCommand, which would cache the result as part of the `ExecutedCommandExec` node. We extend `V2CommandExec` for all the data writing commands so that they only get executed once as well.

### Why are the changes needed?

Calling `collect()` on a SQL command that inserts data or creates a table gets executed multiple times otherwise.

### Does this PR introduce any user-facing change?

Fixes a bug

### How was this patch tested?

Unit tests

Closes #27941 from brkyvz/doubleInsert.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2020-03-18 18:07:24 -07:00
Wenchen Fan 8643e5d9c5 [SPARK-31171][SQL][FOLLOWUP] update document
### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/27936 to update document.

### Why are the changes needed?

correct document

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27950 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:29:31 +09:00
Kent Yao 3d695954e5 [SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text
### What changes were proposed in this pull request?

pattern `''` means literal `'`

```sql
select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS");
5377-02-14 06:27:19'000000519
```
0946a9514f missed this case and this pr add it back.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #27949 from yaooqinn/SPARK-31150-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:27:06 +09:00
Huaxin Gao d22c9f6c0d [SPARK-30933][ML][DOCS] ML, GraphX 3.0 QA: Update user guide for new features & APIs
### What changes were proposed in this pull request?
Change ml-tuning.html.

### Why are the changes needed?
Add description for ```MultilabelClassificationEvaluator``` and ```RankingEvaluator```.

### Does this PR introduce any user-facing change?
Yes

before:
![image](https://user-images.githubusercontent.com/13592258/76437013-2c5ffb80-6376-11ea-8946-f5c2e7379b7c.png)

after:
![image](https://user-images.githubusercontent.com/13592258/76437054-397cea80-6376-11ea-867f-fe8d8fa4e5b3.png)

### How was this patch tested?

Closes #27880 from huaxingao/spark-30933.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-18 13:21:24 -05:00
yi.wu 8bfaa62f2f [SPARK-31175][SQL] Avoid creating reverse comparator for each compare in InterpretedOrdering
### What changes were proposed in this pull request?

Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering.

### Why are the changes needed?

Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27938 from Ngone51/reverse_comparator.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 23:56:48 +08:00
Kent Yao 57fcc49306 [SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar
### What changes were proposed in this pull request?

The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g.

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee');
```
Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before.

This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class.
And 'e' and 'c' char checking in this method.

Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g.

```sql
spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'");
20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")]
java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S

spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'");
NULL
```
### Why are the changes needed?

avoid vagueness
bug fix

### Does this PR introduce any user-facing change?

no, these are not  exposed yet

### How was this patch tested?

add ut

Closes #27939 from yaooqinn/SPARK-31176.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 20:19:50 +08:00
Kent Yao f1d27cdd91 [SPARK-31119][SQL] Add interval value support for extract expression as extract source
### What changes were proposed in this pull request?

```
<extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren>

<extract source> ::= <datetime value expression> | <interval value expression>
```
We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval.

This pr adds interval value support for `extract` expression as extract source

### Why are the changes needed?

For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions.

### Does this PR introduce any user-facing change?

yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals

### How was this patch tested?

add unit tests

Closes #27876 from yaooqinn/SPARK-31119.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 12:29:39 +08:00
Qianyang Yu 6f0b0f1655
[SPARK-30954][ML][R] Make file name the same as class name
This pr solved the same issue as [pr27919](https://github.com/apache/spark/pull/27919), but this one changes the file names based on comment from previous pr.

### What changes were proposed in this pull request?

Make some of  file names the same as class name in R package.

### Why are the changes needed?

Make the file consistence

### Does this PR introduce any user-facing change?

No
### How was this patch tested?

run `./R/run-tests.sh`

Closes #27940 from kevinyu98/spark-30954-r-v2.

Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 16:15:02 -07:00
manuzhang 4e4e08f372
[SPARK-31047][SQL] Improve file listing for ViewFileSystem
### What changes were proposed in this pull request?
Use `listLocatedStatus` when `lnMemoryFileIndex` is listing files from a `ViewFileSystem` which should delegate to that of `DistributedFileSystem`.

### Why are the changes needed?
When `ViewFileSystem` is used to manage several `DistributedFileSystem`, the change will improve performance of file listing, especially when there are many files.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27801 from manuzhang/spark-31047.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 14:23:28 -07:00
Holden Karau 57d27e900f
[SPARK-31125][K8S] Terminating pods have a deletion timestamp but they are not yet dead
### What changes were proposed in this pull request?

Change what we consider a deleted pod to not include "Terminating"

### Why are the changes needed?

If we get a new snapshot while a pod is in the process of being cleaned up we shouldn't delete the executor until it is fully terminated.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This should be covered by the decommissioning tests in that they currently are flaky because we sometimes delete the executor instead of allowing it to decommission all the way.

I also ran this in a loop locally ~80 times with the only failures being the PV suite because of unrelated minikube mount issues.

Closes #27905 from holdenk/SPARK-31125-Processing-state-snapshots-incorrect.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 12:04:06 -07:00
Wenchen Fan dc5ebc2d5b
[SPARK-31171][SQL] size(null) should return null under ansi mode
### What changes were proposed in this pull request?

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

### Why are the changes needed?

In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

### Does this PR introduce any user-facing change?

No as ANSI mode is off by default.

### How was this patch tested?

new tests

Closes #27936 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 11:48:54 -07:00
Adam Binford 9f27a5495d
[SPARK-30860][CORE] Use FileSystem.mkdirs to avoid umask at rolling event log folder and appStatusFile creation
### What changes were proposed in this pull request?
This pull request fixes an issue with rolling event logs. The rolling event log directory is created ignoring the dfs umask setting. This allows the history server to prune old rolling logs when run as the group owner of the event log folder.

### Why are the changes needed?
For non-rolling event logs, log files are created ignoring the umask setting by calling setPermission after creating the file. The default umask of 022 currently causes rolling log directories to be created without group write permissions, preventing the history server from pruning logs of applications not run as the same user as the history server. This adds the same behavior for rolling event logs so users don't need to worry about the umask setting causing different behavior.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually. The folder is created with the correct 770 permission. The status file is still affected by the umask setting, but that doesn't stop the folder from being deleted by the history server. I'm not sure if that causes any other issues. I'm not sure how to test something involving a Hadoop setting.

Closes #27764 from Kimahriman/bug/rolling-log-permissions.

Authored-by: Adam Binford <adam.binford@radiantsolutions.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 11:20:10 -07:00
Kent Yao 5bc0d76591 [SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir
### What changes were proposed in this pull request?

In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps.

Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`.

### Why are the changes needed?

fix bug for Spark SQL cli to pick right confs

### Does this PR introduce any user-facing change?

yes, the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified.

### How was this patch tested?

add cli ut

Closes #27933 from yaooqinn/SPARK-31170.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 23:03:18 +08:00
Kent Yao 0946a9514f [SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

### Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

### Does this PR introduce any user-facing change?

no, the related changes are newly added
### How was this patch tested?

add uts

Closes #27906 from yaooqinn/SPARK-31150.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 21:53:46 +08:00
Takeshi Yamamuro 124b4ce2e6
[MINOR][SQL] Update the DataFrameWriter.bucketBy comment
### What changes were proposed in this pull request?

This PR intends to update the `DataFrameWriter.bucketBy` comment for clearly describing that the bucketBy scheme follows a Spark "specific" one.

I saw the questions about the current bucketing compatibility with Hive in [SPARK-31162](https://issues.apache.org/jira/browse/SPARK-31162?focusedCommentId=17060408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17060408) and [SPARK-17495](https://issues.apache.org/jira/browse/SPARK-17495?focusedCommentId=17059847&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17059847) from users and IMHO the comment is a bit confusing to users about the compatibility

### Why are the changes needed?

To make users understood smoothly.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27930 from maropu/UpdateBucketByComment.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 00:52:45 -07:00