Commit graph

25771 commits

Author SHA1 Message Date
Maxim Gekk 3206a99870 [SPARK-29651][SQL] Fix parsing of interval seconds fraction
### What changes were proposed in this pull request?
In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `IntervalUtils` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases:
- When the fractional part has less than 9 digits
- The seconds unit is negative

### Why are the changes needed?
The changes are needed to fix the issues:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 microseconds
```
The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds`
```sql
spark-sql> select interval '-10.123456789 seconds';
interval -9 seconds -876 milliseconds -544 microseconds
```
but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds.

### Does this PR introduce any user-facing change?
Yes. After changes:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 milliseconds 456 microseconds
spark-sql> select interval '-10.123456789 seconds';
interval -10 seconds -123 milliseconds -456 microseconds
```

### How was this patch tested?
By existing and new tests in `ExpressionParserSuite`.

Closes #26313 from MaxGekk/fix-interval-nanos-parsing.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-31 09:20:46 +08:00
Ryan Blue cfc80d0eb1 [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown
### What changes were proposed in this pull request?

This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.

To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.

This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.

### Why are the changes needed?

This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.

Closes #25955 from rdblue/move-v2-pushdown.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-10-30 18:07:34 -07:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Xingbo Jiang 155a67d00c [SPARK-29666][BUILD] Fix the publish release failure under dry-run mode
### What changes were proposed in this pull request?

`release-build.sh` fail to publish release under dry run mode with the following error message:
```
/opt/spark-rm/release-build.sh: line 429: pushd: spark-repo-g4MBm/org/apache/spark: No such file or directory
```

We need to at least run the `mvn clean install` command once to create the `$tmp_repo` path, but now those steps are all skipped under dry-run mode. This PR fixes the issue.

### How was this patch tested?

Tested locally.

Closes #26329 from jiangxb1987/dryrun.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 14:57:51 -07:00
Xingbo Jiang fd6cfb1be3 [SPARK-29646][BUILD] Allow pyspark version name format ${versionNumber}-preview in release script
### What changes were proposed in this pull request?

Update `release-build.sh`, to allow pyspark version name format `${versionNumber}-preview`, otherwise the release script won't generate pyspark release tarballs.

### How was this patch tested?

Tested locally.

Closes #26306 from jiangxb1987/buildPython.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 14:51:50 -07:00
Dongjoon Hyun d417113c25 [SPARK-29668][DOCS] Deprecate Python 3 prior to version 3.6
### What changes were proposed in this pull request?

This PR aims to deprecate `Python 3.4 ~ 3.5`, which is prior to version 3.6 additionally.

### Why are the changes needed?

Since `Python 3.8` is already out, we will focus on to support Python 3.6/3.7/3.8.

### Does this PR introduce any user-facing change?

Yes. It's highly recommended to use Python 3.6/3.7. We will verify Python 3.8 before Apache Spark 3.0.0 release.

### How was this patch tested?

NA (This is a doc-only change).

Closes #26326 from dongjoon-hyun/SPARK-29668.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 12:31:23 -07:00
Takeshi Yamamuro 472940b2f4 [SPARK-29120][SQL][TESTS] Port create_view.sql
### What changes were proposed in this pull request?

This PR ports create_view.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/create_view.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/create_view.out

### Why are the changes needed?

To check behaviour differences between Spark and PostgreSQL

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins. And, Comparison with PgSQL results

Closes #26290 from maropu/SPARK-29120.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 09:07:38 -07:00
Kent Yao dc987f0c8b [SPARK-29653][SQL] Fix MICROS_PER_MONTH in IntervalUtils
### What changes were proposed in this pull request?

MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY

### Why are the changes needed?

fix bug

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #26321 from yaooqinn/SPARK-29653.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 08:09:22 -07:00
Jungtaek Lim (HeartSaVioR) 44a27bdccd [SPARK-29604][SQL] Force initialize SessionState before initializing HiveClient in SparkSQLEnv
### What changes were proposed in this pull request?

This patch fixes the issue that external listeners are not initialized properly when `spark.sql.hive.metastore.jars` is set to either "maven" or custom list of jar.
("builtin" is not a case here - all jars in Spark classloader are also available in separate classloader)

The culprit is lazy initialization (lazy val or passing builder function) & thread context classloader. HiveClient leverages IsolatedClientLoader to properly load Hive and relevant libraries without issue - to not mess up with Spark classpath it uses separate classloader with leveraging thread context classloader.

But there's a messed-up case - SessionState is being initialized while HiveClient changed the thread context classloader from Spark classloader to Hive isolated one, and streaming query listeners are loaded from changed classloader while initializing SessionState.

This patch forces initializing SessionState in SparkSQLEnv to avoid such case.

### Why are the changes needed?

ClassNotFoundException could occur in spark-sql with specific configuration, as explained above.

### Does this PR introduce any user-facing change?

No, as I don't think end users assume the classloader of external listeners is only containing jars for Hive client.

### How was this patch tested?

New UT added which fails on master branch and passes with the patch.

The error message with master branch when running UT:

```
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:147)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:137)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:59)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$2(SparkSQLEnvSuite.scala:44)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.withSystemProperties(SparkSQLEnvSuite.scala:61)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$1(SparkSQLEnvSuite.scala:43)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
	at org.scalatest.Suite.run(Suite.scala:1124)
	at org.scalatest.Suite.run$(Suite.scala:1106)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1349)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1343)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1343)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1033)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1011)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1509)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1011)
	at org.scalatest.tools.Runner$.run(Runner.scala:850)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:133)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:27)
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1054)
	at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:156)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154)
	at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:151)
	at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:105)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:105)
	at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:164)
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
	at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:127)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:300)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:421)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:314)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
	... 58 more
Caused by: java.lang.ClassNotFoundException: test.custom.listener.DummyQueryExecutionListener
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:206)
	at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2746)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2744)
	at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1(QueryExecutionListener.scala:83)
	at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1$adapted(QueryExecutionListener.scala:82)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$listenerManager$2(BaseSessionStateBuilder.scala:293)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:293)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:320)
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1051)
	... 80 more
```

Closes #26258 from HeartSaVioR/SPARK-29604.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-30 01:06:31 -07:00
uncleGen 5f1ef544f3 [MINOR][DOCS] Use proper html tag in markdown
### What changes were proposed in this pull request?
This PR fix and use proper html tag in docs

### Why are the changes needed?

Fix documentation format error.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26302 from uncleGen/minor-doc.

Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 15:30:58 +09:00
DylanGuedes 1bf65d97ac [SPARK-29110][SQL][TESTS] Port window.sql (Part 4)
### What changes were proposed in this pull request?

This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql#L913-L1278

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out

### Why are the changes needed?

To ensure compatibility with PostgreSQL.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins. And, Comparison with PgSQL results.

Closes #26238 from DylanGuedes/spark-29110.

Authored-by: DylanGuedes <djmgguedes@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 15:22:44 +09:00
HyukjinKwon 8682bb11ae [SPARK-29627][PYTHON][SQL] Allow array_contains to take column instances
### What changes were proposed in this pull request?

This PR proposes to allow `array_contains` to take column instances.

### Why are the changes needed?

For consistent support in Scala and Python APIs. Scala allows column instances at `array_contains`

Scala:

```scala
import org.apache.spark.sql.functions._
val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
df.select(array_contains($"data", lit("a"))).show()
```

Python:

```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```

However, PySpark sides does not allow.

### Does this PR introduce any user-facing change?

Yes.

```python
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
```

**Before:**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```

**After:**

```
+-----------------------+
|array_contains(data, a)|
+-----------------------+
|                   true|
|                  false|
+-----------------------+
```

### How was this patch tested?

Manually tested and added a doctest.

Closes #26288 from HyukjinKwon/SPARK-29627.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 09:45:19 +09:00
Kent Yao 8e667db5d8 [SPARK-29629][SQL] Support typed integer literal expression
### What changes were proposed in this pull request?

```
postgres=# select date '2001-09-28' + integer '7';
  ?column?
------------
 2001-10-05
(1 row)postgres=# select integer '7';
 int4
------
    7
(1 row)
```
Add support for typed integer literal expression from postgreSQL.

### Why are the changes needed?

SPARK-27764 Feature Parity between PostgreSQL and Spark

### Does this PR introduce any user-facing change?

support typed integer lit in SQL

### How was this patch tested?

add uts

Closes #26291 from yaooqinn/SPARK-29629.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-30 09:40:54 +09:00
ulysses 6958d7e629 [SPARK-28746][SQL] Add partitionby hint for sql queries
## What changes were proposed in this pull request?

Now, `RepartitionByExpression` is allowed at Dataset method `Dataset.repartition()`. But in spark sql,  we do not have an equivalent functionality.
In hive, we can use `distribute by`, so it's worth to add a hint to support such function.
Similar jira [SPARK-24940](https://issues.apache.org/jira/browse/SPARK-24940)

## Why are the changes needed?

Make repartition hints consistent with repartition api .

## Does this PR introduce any user-facing change?
This pr intends to support quries below;
```
// SQL cases
 - sql("SELECT /*+ REPARTITION(c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION(1, c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t")
 - sql("SELECT /*+ REPARTITION_BY_RANGE(1, c) */ * FROM t")
```

## How was this patch tested?
UT

Closes #25464 from ulysses-you/SPARK-28746.

Lead-authored-by: ulysses <youxiduo@weidian.com>
Co-authored-by: ulysses <646303253@qq.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-30 08:25:34 +09:00
Gabor Somogyi 9c817a83fc [SPARK-29637][CORE] Add description to Job SHS web API
### Why are the changes needed?
Starting from Spark 2.3, the SHS REST API endpoint `/applications/<app_id>/jobs/` is not including `description` in the JobData returned. This is not the case until Spark 2.2.

In this PR I've added the mentioned field.

### Does this PR introduce any user-facing change?
Yes.

Old API response:
```
[ {
  "jobId" : 0,
  "name" : "foreach at <console>:26",
  "submissionTime" : "2019-10-28T12:41:54.301GMT",
  "completionTime" : "2019-10-28T12:41:54.731GMT",
  "stageIds" : [ 0 ],
  "jobGroup" : "test",
  "status" : "SUCCEEDED",
  "numTasks" : 1,
  "numActiveTasks" : 0,
  "numCompletedTasks" : 1,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numKilledTasks" : 0,
  "numCompletedIndices" : 1,
  "numActiveStages" : 0,
  "numCompletedStages" : 1,
  "numSkippedStages" : 0,
  "numFailedStages" : 0,
  "killedTasksSummary" : { }
} ]
```
New API response:
```
[ {
  "jobId" : 0,
  "name" : "foreach at <console>:26",
  "description" : "job",                            <= This is the addition here
  "submissionTime" : "2019-10-28T13:37:24.107GMT",
  "completionTime" : "2019-10-28T13:37:24.613GMT",
  "stageIds" : [ 0 ],
  "jobGroup" : "test",
  "status" : "SUCCEEDED",
  "numTasks" : 1,
  "numActiveTasks" : 0,
  "numCompletedTasks" : 1,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numKilledTasks" : 0,
  "numCompletedIndices" : 1,
  "numActiveStages" : 0,
  "numCompletedStages" : 1,
  "numSkippedStages" : 0,
  "numFailedStages" : 0,
  "killedTasksSummary" : { }
} ]
```

### How was this patch tested?
Extended + existing unit tests.

Manually:
* Open spark-shell
```
scala> sc.setJobGroup("test", "job", false);
scala> val foo = sc.textFile("/user/foo.txt");
foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> foo.foreach(println);
```
* Access REST API `http://SHS-host:port/api/v1/applications/<app-id>/jobs/`

Closes #26295 from gaborgsomogyi/SPARK-29637.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-29 12:00:52 -07:00
Maxim Gekk 44c1c03924 [SPARK-29607][SQL] Move static methods from CalendarInterval to IntervalUtils
### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.

### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see https://github.com/apache/spark/pull/26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see https://github.com/apache/spark/pull/25022), it would be nice to clean it up by moving service methods to an internal object.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites

Closes #26261 from MaxGekk/refactoring-calendar-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-30 01:15:18 +08:00
Huaxin Gao 37690dea10 [SPARK-29565][ML][PYTHON] OneHotEncoder should support single-column input/output
### What changes were proposed in this pull request?
add single-column input/ouput support in OneHotEncoder

### Why are the changes needed?
Currently, OneHotEncoder only has multi columns support.  It makes sense to support single column as well.

### Does this PR introduce any user-facing change?
Yes
```OneHotEncoder.setInputCol```
```OneHotEncoder.setOutputCol```

### How was this patch tested?
Unit test

Closes #26265 from huaxingao/spark-29565.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-28 23:20:21 -07:00
Huaxin Gao e92b75482f [SPARK-29612][SQL] ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add AlterTableRecoverPartitionsStatement and make ALTER TABLE ... RECOVER PARTITIONS go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t RECOVER PARTITIONS  // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RECOVER PARTITIONS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26269 from huaxingao/spark-29612.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-29 13:54:07 +08:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
Jungtaek Lim (HeartSaVioR) fb80dfee70 [SPARK-28158][SQL][FOLLOWUP] HiveUserDefinedTypeSuite: don't use RandomDataGenerator to create row for UDT backed by ArrayType
### What changes were proposed in this pull request?

There're some issues observed in `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"`:

1) Neither function (TestUDF) nor test take "nullable" point column into account.
2) ExamplePointUDT. sqlType is ArrayType which doesn't provide information how many elements are expected. RandomDataGenerator may provide less elements than needed.

This patch fixes `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"` to change the type of "point" column to be non-nullable, as well as not use RandomDataGenerator to create row for UDT backed by ArrayType.

### Why are the changes needed?

CI builds are failing in high occurrences.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested by running tests locally multiple times.

Closes #26287 from HeartSaVioR/SPARK-28158-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-29 11:57:25 +08:00
Huaxin Gao c137acbf65 [SPARK-29566][ML] Imputer should support single-column input/output
### What changes were proposed in this pull request?
add single-column input/output support in Imputer

### Why are the changes needed?
Currently, Imputer only has multi-column support. This PR adds single-column input/output support.

### Does this PR introduce any user-facing change?
Yes. add single-column input/output support in Imputer
```Imputer.setInputCol```
```Imputer.setOutputCol```

### How was this patch tested?
add unit tests

Closes #26247 from huaxingao/spark-29566.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-29 11:11:41 +08:00
Terry Kim 59db1f617a [SPARK-29609][SQL] DataSourceV2: Support DROP NAMESPACE
### What changes were proposed in this pull request?

This PR adds `DROP NAMESPACE` support for V2 catalogs.

### Why are the changes needed?

Currently, you cannot drop namespaces for v2 catalogs.

### Does this PR introduce any user-facing change?

The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
DROP NAMESPACE mycatalog.ns
SHOW NAMESPACES IN mycatalog # Will show no namespaces
```
to drop a namespace `ns` inside `mycatalog` V2 catalog.

### How was this patch tested?

Added unit tests.

Closes #26262 from imback82/drop_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-28 15:00:22 -07:00
Jungtaek Lim (HeartSaVioR) 762db39c15 [SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source
### What changes were proposed in this pull request?

This patch deduplicates code blocks in Kafka data source which are being repeated multiple times in a method.

### Why are the changes needed?

This change would simplify the code and open possibility to simplify future code whenever fields are added to Kafka writer schema.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #26158 from HeartSaVioR/MINOR-deduplicate-kafka-source.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-10-28 11:14:18 -07:00
Liang-Chi Hsieh 2be1fe6abc [SPARK-29521][SQL] LOAD DATA INTO TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add LoadDataStatement and make LOAD DATA INTO TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
LOAD DATA INPATH 'filepath'  INTO TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running LOAD DATA INTO TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26178 from viirya/SPARK-29521.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-29 00:41:20 +08:00
angerszhu d6e33dc377 [SPARK-29599][WEBUI] Support pagination for session table in JDBC/ODBC Tab
### What changes were proposed in this pull request?

In this PR, extend the support of pagination to session  table in `JDBC/PDBC` .

### Why are the changes needed?
Some times we may connect a lot client and there a many session info shown in session tab.
make it can be paged for better view.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manuel verify.
After pr:
<img width="1440" alt="Screen Shot 2019-10-25 at 4 19 27 PM" src="https://user-images.githubusercontent.com/46485123/67555133-50ae9900-f743-11e9-8724-9624a691f232.png">

<img width="1434" alt="Screen Shot 2019-10-25 at 4 19 38 PM" src="https://user-images.githubusercontent.com/46485123/67555165-5906d400-f743-11e9-819e-73f86a333dd3.png">

Closes #26253 from AngersZhuuuu/SPARK-29599.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-28 08:45:21 -05:00
uncleGen 0182817ea3 [SPARK-28158][SQL] Hive UDFs supports UDT type
## What changes were proposed in this pull request?

After this PR, we can create and register Hive UDFs to accept UDT type, like `VectorUDT` and `MatrixUDT`. These UDTs are widely used in Spark machine learning.

## How was this patch tested?

add new ut

Closes #24961 from uncleGen/SPARK-28158.

Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-28 20:50:34 +09:00
HyukjinKwon a8d5134981 Revert "[SPARK-29528][BUILD][TEST-MAVEN] Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1"
This reverts commit 5fc363b307.
2019-10-28 20:46:28 +09:00
jiake 50cf48489a [SPARK-28560][SQL][FOLLOWUP] change the local shuffle reader from leaf node to unary node
### What changes were proposed in this pull request?

### Why are the changes needed?
When make the `LocalShuffleReaderExec` to leaf node, there exists a potential issue: the leaf node will hide the running query stage and make the unfinished query stage as finished query stage when creating its parent query stage.
This PR make the leaf node to unary node.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #26250 from JkSelf/updateLeafNodeofLocalReaderToUnaryExecNode.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-28 14:23:53 +08:00
Takeshi Yamamuro e6c56937e2 [SPARK-29008][SQL][FOLLOW-UP] Checks correct param lengths of split methods
### What changes were proposed in this pull request?

This pr is to fix wrong code to check parameter lengths of split methods in `subexpressionEliminationForWholeStageCodegen`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26267 from maropu/SPARK-29008-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 21:45:56 -07:00
Dongjoon Hyun ba9d1610b6 [SPARK-29617][BUILD] Upgrade to ORC 1.5.7
### What changes were proposed in this pull request?

This PR aims to upgrade to Apache ORC 1.5.7.

### Why are the changes needed?

This will bring the latest bug fixes. The following is the full release note.
- https://issues.apache.org/jira/projects/ORC/versions/12345702

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26276 from dongjoon-hyun/SPARK-29617.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 21:11:17 -07:00
Huaxin Gao b19fd487df [SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py
### What changes were proposed in this pull request?
Remove automatically generated param setters in _shared_params_code_gen.py

### Why are the changes needed?
To keep parity between scala and python

### Does this PR introduce any user-facing change?
Yes
Add some setters in Python ML XXXModels

### How was this patch tested?
unit tests

Closes #26232 from huaxingao/spark-29093.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-28 11:36:10 +08:00
Dongjoon Hyun a43b966f00 [SPARK-29613][BUILD][SS] Upgrade to Kafka 2.3.1
### What changes were proposed in this pull request?

This PR aims to upgrade to Kafka 2.3.1 client library for client fixes like KAFKA-8950, KAFKA-8570, and KAFKA-8635. The following is the full release note.
- https://archive.apache.org/dist/kafka/2.3.1/RELEASE_NOTES.html

### Why are the changes needed?

- [KAFKA-8950 KafkaConsumer stops fetching](https://issues.apache.org/jira/browse/KAFKA-8950)
- [KAFKA-8570 Downconversion could fail when log contains out of order message formats](https://issues.apache.org/jira/browse/KAFKA-8570)
- [KAFKA-8635 Unnecessary wait when looking up coordinator before transactional request](https://issues.apache.org/jira/browse/KAFKA-8635)

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #26271 from dongjoon-hyun/SPARK-29613.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 16:15:54 -07:00
Maxim Gekk 74514b46e5 [SPARK-29614][SQL][TEST] Fix failures of DateTimeUtilsSuite and TimestampFormatterSuite
### What changes were proposed in this pull request?
The `DateTimeUtilsSuite` and `TimestampFormatterSuite` assume constant time difference between `timestamp'yesterday'`, `timestamp'today'` and `timestamp'tomorrow'` which is wrong on daylight switching day - day length can be 23 or 25 hours. In the PR, I propose to use Java 8 time API to calculate instances of `yesterday` and `tomorrow` timestamps.

### Why are the changes needed?
The changes fix test failures and make the tests tolerant to daylight time switching.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `TimestampFormatterSuite`.

Closes #26273 from MaxGekk/midnight-tolerant.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-27 13:48:43 -07:00
shahid 077fb99a26 [SPARK-29589][WEBUI] Support pagination for sqlstats session table in JDBC/ODBC Session page
### What changes were proposed in this pull request?
In the PR https://github.com/apache/spark/pull/26215, we supported pagination for sqlstats table in JDBC/ODBC server page. In this PR, we are extending the support of pagination to sqlstats session table by making use of existing pagination classes in https://github.com/apache/spark/pull/26215.

### Why are the changes needed?
Support pagination for sqlsessionstats table in JDBC/ODBC server page in the WEBUI. It will easier for user to analyse the table and it may fix the potential issues like oom while loading the page, that may occur similar to the SQL page (refer #22645)

### Does this PR introduce any user-facing change?
There will be no change in the sqlsessionstats table in JDBC/ODBC server page execpt pagination support.

### How was this patch tested?
Manually verified.

Before:

![Screenshot 2019-10-24 at 11 32 27 PM](https://user-images.githubusercontent.com/23054875/67512507-96715000-f6b6-11e9-9f1f-ab1877eb24e6.png)

After:

![Screenshot 2019-10-24 at 10 58 53 PM](https://user-images.githubusercontent.com/23054875/67512314-295dba80-f6b6-11e9-9e3e-dd50c6e62fe9.png)

Closes #26246 from shahidki31/SPARK_29589.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-26 15:46:24 -05:00
rongma1997 2115bf6146 [SPARK-29490][SQL] Reset 'WritableColumnVector' in 'RowToColumnarExec'
### What changes were proposed in this pull request?
Reset the `WritableColumnVector` when getting "next" ColumnarBatch in `RowToColumnarExec`
### Why are the changes needed?
When converting `Iterator[InternalRow]` to `Iterator[ColumnarBatch]`, the vectors used to create a new `ColumnarBatch` should be reset in the iterator's "next()" method.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A

Closes #26137 from rongma1997/reset-WritableColumnVector.

Authored-by: rongma1997 <rong.ma@intel.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 23:11:01 -07:00
Kent Yao 9a46702791 [SPARK-29554][SQL] Add version SQL function
### What changes were proposed in this pull request?

```
hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 2.113 seconds, Fetched: 1 row(s)
```

### Why are the changes needed?

From hive behavior and I guess it is useful for debugging and developing etc.

### Does this PR introduce any user-facing change?

add a misc func

### How was this patch tested?

add ut

Closes #26209 from yaooqinn/SPARK-29554.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 23:02:11 -07:00
Dongjoon Hyun 5bdc58bf8a [SPARK-27653][SQL][FOLLOWUP] Fix since version of min_by/max_by
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/24557 to fix `since` version.

### Why are the changes needed?

This is found during 3.0.0-preview preparation.
The version will be exposed to our SQL document like the following. We had better fix this.
- https://spark.apache.org/docs/latest/api/sql/#array_min

### Does this PR introduce any user-facing change?

Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and  SQL doc, but this is new at 3.0.0.

### How was this patch tested?

Manual.
```
spark-sql> DESC FUNCTION EXTENDED min_by;
Function: min_by
Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`.
Extended Usage:
    Examples:
      > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
       a

    Since: 3.0.0
```

Closes #26264 from dongjoon-hyun/SPARK-27653.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 21:17:17 -07:00
Gabor Somogyi 25493919f8 [SPARK-29580][TESTS] Add kerberos debug messages for Kafka secure tests
### What changes were proposed in this pull request?
`org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details:
```
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database
```
Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs.

### Why are the changes needed?
Failing test doesn't give enough debug information.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I've started the test manually and checked that such additional debug messages show up:
```
>>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Looking for keys for: kafka/localhostEXAMPLE.COM
Added key: 17version: 0
Added key: 23version: 0
Added key: 16version: 0
Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM
MemoryCache: Existing AuthList:
#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM
#2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM
#1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM
```

Closes #26252 from gaborgsomogyi/SPARK-29580.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 14:11:35 -07:00
Dongjoon Hyun 2baf7a1d8f [SPARK-29608][BUILD] Add hadoop-3.2 profile to release build
### What changes were proposed in this pull request?

This PR aims to add `hadoop-3.2` profile to pre-built binary package releases.

### Why are the changes needed?

Since Apache Spark 3.0.0, we provides Hadoop 3.2 pre-built binary.

### Does this PR introduce any user-facing change?

No. (Although the artifacts are available, this change is for release managers).

### How was this patch tested?

Manual. Please note that `DRY_RUN=0` disables these combination.
```
$ dev/create-release/release-build.sh package
...
Packages to build: without-hadoop hadoop3.2 hadoop2.7
make_binary_release without-hadoop -Pscala-2.12 -Phadoop-provided  2.12
make_binary_release hadoop3.2 -Pscala-2.12 -Phadoop-3.2 -Phive -Phive-thriftserver  2.12
make_binary_release hadoop2.7 -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver withpip,withr 2.12
```

Closes #26260 from dongjoon-hyun/SPARK-29608.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 13:57:26 -07:00
Liang-Chi Hsieh ae5b60da32 [SPARK-29182][CORE][FOLLOWUP] Cache preferred locations of checkpointed RDD
### What changes were proposed in this pull request?

This is a followup to #25856. This fixes the document about the config value of spark.rdd.checkpoint.cachePreferredLocsExpireTime.

### Why are the changes needed?

The document is not correct. spark.rdd.checkpoint.cachePreferredLocsExpireTime can not be 0.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This is document only change.

Closes #26251 from viirya/SPARK-29182-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-25 11:13:06 -07:00
Liang-Chi Hsieh 68dca9a095 [SPARK-29527][SQL] SHOW CREATE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add ShowCreateTableStatement and make SHOW CREATE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW CREATE TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running SHOW CREATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26184 from viirya/SPARK-29527.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-25 23:09:08 +08:00
Kent Yao 0cf4f07c66 [SPARK-29545][SQL] Add support for bit_xor aggregate function
### What changes were proposed in this pull request?

bit_xor(expr) - Returns the bitwise XOR of all non-null input values, or null if none

### Why are the changes needed?

As we support `bit_and`, `bit_or` now, we'd better support the related aggregate function **bit_xor** ahead of postgreSQL, because many other popular databases support it.

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbreference/bit-xor-function.html

https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_bit-or

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/BIT_XOR.htm?TocPath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____10

### Does this PR introduce any user-facing change?

add a new bit agg
### How was this patch tested?

UTs added

Closes #26205 from yaooqinn/SPARK-29545.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-25 22:19:19 +09:00
redsk 8bd8f492ea [SPARK-29500][SQL][SS] Support partition column when writing to Kafka
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-29500

`KafkaRowWriter` now supports setting the Kafka partition by reading a "partition" column in the input dataframe.

Code changes in commit nr. 1.
Test changes in commit nr. 2.
Doc changes in commit nr. 3.

tcondie dongjinleekr srowen

### Why are the changes needed?
While it is possible to configure a custom Kafka Partitioner with
`.option("kafka.partitioner.class", "my.custom.Partitioner")`, this is not enough for certain use cases. See the Jira issue.

### Does this PR introduce any user-facing change?
No, as this behaviour is optional.

### How was this patch tested?
Two new UT were added and one was updated.

Closes #26153 from redsk/feature/SPARK-29500.

Authored-by: redsk <nicola.bova@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-25 08:06:36 -05:00
Jungtaek Lim (HeartSaVioR) cfbdd9d293 [SPARK-29461][SQL] Measure the number of records being updated for JDBC writer
### What changes were proposed in this pull request?

This patch adds the functionality to measure records being written for JDBC writer. In reality, the value is meant to be a number of records being updated from queries, as per JDBC spec it will return updated count.

### Why are the changes needed?

Output metrics for JDBC writer are missing now. The value of "bytesWritten" is also missing, but we can't measure it from JDBC API.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test added.

Closes #26109 from HeartSaVioR/SPARK-29461.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-25 16:32:06 +09:00
zhengruifeng 091cbc3be0 [SPARK-9612][ML] Add instance weight support for GBTs
### What changes were proposed in this pull request?
add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees

in summary:
1, add setters of `minWeightFractionPerNode` & `weightCol`
2, update input types in private methods from `RDD[LabeledPoint]` to `RDD[Instance]`:
`DecisionTreeRegressor.train`, `GradientBoostedTrees.run`, `GradientBoostedTrees.runWithValidation`, `GradientBoostedTrees.computeInitialPredictionAndError`, `GradientBoostedTrees.computeError`,
`GradientBoostedTrees.evaluateEachIteration`, `GradientBoostedTrees.boost`, `GradientBoostedTrees.updatePredictionError`
3, add new private method `GradientBoostedTrees.computeError(data, predError)` to compute average error, since original `predError.values.mean()` do not take weights into account.
4, add new tests

### Why are the changes needed?
GBTs should support sample weights like other algs

### Does this PR introduce any user-facing change?
yes, new setters are added

### How was this patch tested?
existing & added testsuites

Closes #25926 from zhengruifeng/gbt_add_weight.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-25 13:48:09 +08:00
Marcelo Vanzin 1474ed05fb [SPARK-29562][SQL] Speed up and slim down metric aggregation in SQL listener
First, a bit of background on the code being changed. The current code tracks
metric updates for each task, recording which metrics the task is monitoring
and the last update value.

Once a SQL execution finishes, then the metrics for all the stages are
aggregated, by building a list with all (metric ID, value) pairs collected
for all tasks in the stages related to the execution, then grouping by metric
ID, and then calculating the values shown in the UI.

That is full of inefficiencies:

- in normal operation, all tasks will be tracking and updating the same
  metrics. So recording the metric IDs per task is wasteful.
- tracking by task means we might be double-counting values if you have
  speculative tasks (as a comment in the code mentions).
- creating a list of (metric ID, value) is extremely inefficient, because now
  you have a huge map in memory storing boxed versions of the metric IDs and
  values.
- same thing for the aggregation part, where now a Seq is built with the values
  for each metric ID.

The end result is that for large queries, this code can become both really
slow, thus affecting the processing of events, and memory hungry.

The updated code changes the approach to the following:

- stages track metrics by their ID; this means the stage tracking code
  naturally groups values, making aggregation later simpler.
- each metric ID being tracked uses a long array matching the number of
  partitions of the stage; this means that it's cheap to update the value of
  the metric once a task ends.
- when aggregating, custom code just concatenates the arrays corresponding to
  the matching metric IDs; this is cheaper than the previous, boxing-heavy
  approach.

The end result is that the listener uses about half as much memory as before
for tracking metrics, since it doesn't need to track metric IDs per task.

I captured heap dumps with the old and the new code during metric aggregation
in the listener, for an execution with 3 stages, 100k tasks per stage, 50
metrics updated per task. The dumps contained just reachable memory - so data
kept by the listener plus the variables in the aggregateMetrics() method.

With the old code, the thread doing aggregation references >1G of memory - and
that does not include temporary data created by the "groupBy" transformation
(for which the intermediate state is not referenced in the aggregation method).
The same thread with the new code references ~250M of memory. The old code uses
about ~250M to track all the metric values for that execution, while the new
code uses about ~130M. (Note the per-thread numbers include the amount used to
track the metrics - so, e.g., in the old case, aggregation was referencing
about ~750M of temporary data.)

I'm also including a small benchmark (based on the Benchmark class) so that we
can measure how much changes to this code affect performance. The benchmark
contains some extra code to measure things the normal Benchmark class does not,
given that the code under test does not really map that well to the
expectations of that class.

Running with the old code (I removed results that don't make much
sense for this benchmark):

```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic
[info] Intel(R) Core(TM) i7-6820HQ CPU  2.70GHz
[info] metrics aggregation (50 metrics, 100k tasks per stage):  Best Time(ms)   Avg Time(ms)
[info] --------------------------------------------------------------------------------------
[info] 1 stage(s)                                                  2113           2118
[info] 2 stage(s)                                                  4172           4392
[info] 3 stage(s)                                                  7755           8460
[info]
[info] Stage Count    Stage Proc. Time    Aggreg. Time
[info]      1              614                1187
[info]      2              620                2480
[info]      3              718                5069
```

With the new code:

```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic
[info] Intel(R) Core(TM) i7-6820HQ CPU  2.70GHz
[info] metrics aggregation (50 metrics, 100k tasks per stage):  Best Time(ms)   Avg Time(ms)
[info] --------------------------------------------------------------------------------------
[info] 1 stage(s)                                                   727            886
[info] 2 stage(s)                                                  1722           1983
[info] 3 stage(s)                                                  2752           3013
[info]
[info] Stage Count    Stage Proc. Time    Aggreg. Time
[info]      1              408                177
[info]      2              389                423
[info]      3              372                660

```

So the new code is faster than the old when processing task events, and about
an order of maginute faster when aggregating metrics.

Note this still leaves room for improvement; for example, using the above
measurements, 600ms is still a huge amount of time to spend in an event
handler. But I'll leave further enhancements for a separate change.

Tested with benchmarking code + existing unit tests.

Closes #26218 from vanzin/SPARK-29562.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 22:18:10 -07:00
Dongjoon Hyun 7417c3e7d5 [SPARK-29597][DOCS] Deprecate old Java 8 versions prior to 8u92
### What changes were proposed in this pull request?

This PR aims to deprecate old Java 8 versions prior to 8u92.

### Why are the changes needed?

This is a preparation to use JVM Option `ExitOnOutOfMemoryError`.
- https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html

### Does this PR introduce any user-facing change?

Yes. It's highly recommended for users to use the latest JDK versions of Java 8/11.

### How was this patch tested?

NA (This is a doc change).

Closes #26249 from dongjoon-hyun/SPARK-29597.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-24 20:51:31 -07:00
wenxuanguan 40df9d246e [SPARK-29227][SS] Track rule info in optimization phase
### What changes were proposed in this pull request?

Track timing info for each rule in optimization phase using `QueryPlanningTracker` in Structured Streaming

### Why are the changes needed?

In Structured Streaming we only track rule info in analysis phase, not in optimization phase.

### Does this PR introduce any user-facing change?

No

Closes #25914 from wenxuanguan/spark-29227.

Authored-by: wenxuanguan <choose_home@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-25 10:02:54 +09:00