Commit graph

25107 commits

Author SHA1 Message Date
Huaxin Gao b85a554487 [SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold
### What changes were proposed in this pull request?
Change "Related Statements" to bold

### Why are the changes needed?
To make doc look nice and consistent.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tested using jykyll build --serve

Before the change:
![image](https://user-images.githubusercontent.com/13592258/63965303-ae797a00-ca4d-11e9-8a85-71fbfdeaaccb.png)

After the change:
![image](https://user-images.githubusercontent.com/13592258/63965316-b76a4b80-ca4d-11e9-9a85-48d7a909f0ef.png)

Before the change:
![image](https://user-images.githubusercontent.com/13592258/63988989-7c8b0680-ca93-11e9-9352-a9ec5457b279.png)

After the change:
![image](https://user-images.githubusercontent.com/13592258/63988996-87459b80-ca93-11e9-9e51-8cb36a632436.png)

Closes #25623 from huaxingao/spark-28786-n.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-31 14:58:41 -07:00
Dilip Biswal b4d7b30aa6 [SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference
### What changes were proposed in this pull request?
Document DESCRIBE TABLE statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-31 at 1 53 35 PM" src="https://user-images.githubusercontent.com/14225158/64069071-f556a380-cbf6-11e9-985d-13dd37a32bbb.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 53 50 PM" src="https://user-images.githubusercontent.com/14225158/64069073-f982c100-cbf6-11e9-925b-eb2fc85c3341.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 02 PM" src="https://user-images.githubusercontent.com/14225158/64069076-0ef7eb00-cbf7-11e9-8062-9a9fb8700bb3.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 15 PM" src="https://user-images.githubusercontent.com/14225158/64069077-0f908180-cbf7-11e9-9a31-9b7f122db2d3.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 30 PM" src="https://user-images.githubusercontent.com/14225158/64069078-0f908180-cbf7-11e9-96ee-438a7b64c961.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 42 PM" src="https://user-images.githubusercontent.com/14225158/64069079-0f908180-cbf7-11e9-9bae-734a1994f936.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25527 from dilipbiswal/ref-doc-desc-table.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-31 14:46:55 -07:00
Unknown d573e4c482 [SPARK-28542][DOCS][WEBUI] Stages Tab
### What changes were proposed in this pull request?
New documentation to explain in detail Web UI Stages page. New images are included to better explanation.
![image](https://user-images.githubusercontent.com/12819544/63807320-c05bff80-c91d-11e9-986f-e09d0b8d4bbb.png)
![image](https://user-images.githubusercontent.com/12819544/63807343-cd78ee80-c91d-11e9-9e4a-2cef3ff70577.png)
![image](https://user-images.githubusercontent.com/12819544/63807363-d9fd4700-c91d-11e9-9691-1d39b0e2c69e.png)
![image](https://user-images.githubusercontent.com/12819544/63807384-e41f4580-c91d-11e9-92bd-cb01aced3752.png)

### Does this PR introduce any user-facing change?
Only documentation

### How was this patch tested?
I have generated it using "jekyll build" to ensure that it's ok

Closes #25598 from planga82/feature/SPARK-28542_ImproveWebUIStagesPage.

Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-31 13:33:44 -05:00
Dongjoon Hyun 1f96ce5443 [SPARK-28932][BUILD] Add scala-library test dependency to network-common module for JDK11
### What changes were proposed in this pull request?

This PR adds `scala-library` test dependency to `network-common` module for JDK11.

### Why are the changes needed?

In JDK11, the following command fails due to scala library.
```
mvn clean install -pl common/network-common -DskipTests
```

**BEFORE**
```
...
error: fatal error: object scala in compiler mirror not found.
one error found
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

**AFTER**
```
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual. On JDK11, do the following.
```
mvn clean install -pl common/network-common -DskipTests
```

Closes #25638 from dongjoon-hyun/SPARK-28932.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-31 10:59:20 -07:00
Sean Owen d5b7eed12f [SPARK-28903][STREAMING][PYSPARK][TESTS] Fix AWS JDK version conflict that breaks Pyspark Kinesis tests
The Pyspark Kinesis tests are failing, at least in master:

```
======================================================================
ERROR: test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/streaming/tests/test_kinesis.py", line 44, in test_kinesis_stream
    kinesisTestUtils = self.ssc._jvm.org.apache.spark.streaming.kinesis.KinesisTestUtils(2)
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.apache.spark.streaming.kinesis.KinesisTestUtils.
: java.lang.NoSuchMethodError: com.amazonaws.regions.Region.getAvailableEndpoints()Ljava/util/Collection;
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1(KinesisTestUtils.scala:211)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1$adapted(KinesisTestUtils.scala:211)
	at scala.collection.Iterator.find(Iterator.scala:993)
	at scala.collection.Iterator.find$(Iterator.scala:990)
	at scala.collection.AbstractIterator.find(Iterator.scala:1429)
	at scala.collection.IterableLike.find(IterableLike.scala:81)
	at scala.collection.IterableLike.find$(IterableLike.scala:80)
	at scala.collection.AbstractIterable.find(Iterable.scala:56)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.getRegionNameByEndpoint(KinesisTestUtils.scala:211)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils.<init>(KinesisTestUtils.scala:46)
...
```

The non-Python Kinesis tests are fine though. It turns out that this is because Pyspark tests use the output of the Spark assembly, and it pulls in `hadoop-cloud`, which in turn pulls in an old AWS Java SDK.

Per Steve Loughran (below), it seems like we can just resolve this by excluding the aws-java-sdk dependency. See the attached PR for some more detail about the debugging and other options.

See https://github.com/apache/spark/pull/25558#issuecomment-524042709

Closes #25559 from srowen/KinesisTest.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-31 10:29:46 -05:00
Dilip Biswal a08f33be68 [SPARK-28804][DOCS][SQL] Document DESCRIBE QUERY in SQL Reference
### What changes were proposed in this pull request?
Document DESCRIBE QUERY statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-29 at 5 47 51 PM" src="https://user-images.githubusercontent.com/14225158/63985609-43e43080-ca85-11e9-8a1a-c9c15d988e24.png">
<img width="1234" alt="Screen Shot 2019-08-29 at 5 48 06 PM" src="https://user-images.githubusercontent.com/14225158/63985610-46468a80-ca85-11e9-882a-7163784f72c6.png">
<img width="1234" alt="Screen Shot 2019-08-29 at 5 48 18 PM" src="https://user-images.githubusercontent.com/14225158/63985617-49da1180-ca85-11e9-9e77-a6d6c7042a85.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25529 from dilipbiswal/ref-doc-desc-query.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-30 16:05:16 -07:00
HyukjinKwon 7cc0f0e9a7 [SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results
### What changes were proposed in this pull request?

See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/

![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
    ...
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
    </testcase>
    <system-out/>
    <system-err/>
</testsuite>
```

Root cause seems a bug in SBT - it truncates the test name based on the last dot.

https://github.com/sbt/sbt/issues/2949
https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79

I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:

```diff
  [info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
+ [info]   inner-join.sql
  [info]   Expected "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[]", but got "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[	b]" Result did not match for query #6
  [info]   SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
```

It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.

Note that this PR does not fully fix the issue but only fix the logs on its failed tests.

### Why are the changes needed?
To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.

### Does this PR introduce any user-facing change?
It will print out the file name of failed tests in Jenkins' test reports.

### How was this patch tested?
Manually tested but Jenkins tests are required in this PR.

Now it at least shows which file it is:

![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)

Closes #25630 from HyukjinKwon/SPARK-28894-1.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 15:10:40 -07:00
younggyu chun 3b07a4eb28 [SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type
## What changes were proposed in this pull request?
This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.

https://www.postgresql.org/docs/devel/datatype-boolean.html
https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html

## How was this patch tested?
Added new tests to CastSuite.

Closes #25458 from younggyuchun/SPARK-27931.

Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 14:18:13 -07:00
mcheah ea90ea6ce7 [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter
## What changes were proposed in this pull request?

Use the shuffle writer APIs introduced in SPARK-28209 in the sort shuffle writer.

## How was this patch tested?

Existing unit tests were changed to use the plugin instead, and they used the local disk version to ensure that there were no regressions.

Closes #25342 from mccheah/shuffle-writer-refactor-sort-shuffle-writer.

Lead-authored-by: mcheah <mcheah@palantir.com>
Co-authored-by: mccheah <mcheah@palantir.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-30 09:43:07 -07:00
HyukjinKwon 92cabf6306 [SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.2.0 and fix build profile on AppVeyor
### What changes were proposed in this pull request?

This PR proposes to upgrade scala-maven-plugin from 3.4.4 to 4.2.0.

Upgrade to 4.1.1 was reverted due to unexpected build failure on AppVeyor.

The root cause seems to be an issue specific to AppVeyor - loading the system library 'kernel32.dll' seems being failed.

```
Suppressed: java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.platform.win32.Kernel32
        at sbt.internal.io.WinMilli$.getHandle(Milli.scala:264)
        at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:289)
        at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:260)
        at sbt.internal.io.MilliNative.getModifiedTime(Milli.scala:61)
        at sbt.internal.io.Milli$.getModifiedTime(Milli.scala:360)
        at sbt.io.IO$.$anonfun$getModifiedTimeOrZero$1(IO.scala:1373)
        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
        at sbt.internal.io.Retry$.liftedTree2$1(Retry.scala:38)
        at sbt.internal.io.Retry$.impl$1(Retry.scala:38)
        at sbt.internal.io.Retry$.apply(Retry.scala:52)
        at sbt.internal.io.Retry$.apply(Retry.scala:24)
        at sbt.io.IO$.getModifiedTimeOrZero(IO.scala:1373)
        at sbt.internal.inc.caching.ClasspathCache$.fromCacheOrHash$1(ClasspathCache.scala:44)
        at sbt.internal.inc.caching.ClasspathCache$.$anonfun$hashClasspath$1(ClasspathCache.scala:53)
        at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:659)
        at scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:53)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:67)
        at scala.collection.parallel.Task.tryLeaf(Tasks.scala:56)
        at scala.collection.parallel.Task.tryLeaf$(Tasks.scala:50)
        at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
        at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal(Tasks.scala:170)
        ... 25 more
```

By setting `-Djna.nosys=true`, it directly loads the library from the jar instead of system's.

In this way, the build seems working fine.

### Why are the changes needed?

It upgrades the plugin to fix bugs and fixes the CI build.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It was tested at https://github.com/apache/spark/pull/25497

Closes #25633 from HyukjinKwon/SPARK-28759.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 09:39:15 -07:00
Liang-Chi Hsieh 2bd02e2b41 [SPARK-28866][ML] Persist item factors RDD when checkpointing in ALS
### What changes were proposed in this pull request?

In ALS ML implementation, for non-implicit case, we checkpoint the RDD of item factors, between intervals. Before checkpointing (.checkpoint()) and materializing (.count()) RDD, this RDD was not persisted. It causes recomputation. In an experiment, there is performance difference between persisting and no persisting before checkpointing the RDD.

The performance difference is not big, but this change is not big too. The actual performance difference varies depending the interval of checkpoint, training dataset, etc.

### Why are the changes needed?

Persisting the RDD before checkpointing the RDD of item factors can avoid recomputation.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manual check RDD recomputation or not.

Taking 30% MovieLens 20M Dataset as training dataset. Setting checkpoint dir for SparkContext. Fitting an ALS model like:

```scala
val als = new ALS()
      .setMaxIter(100)
      .setCheckpointInterval(5)
      .setRegParam(0.01)
      .setUserCol("userId")
      .setItemCol("movieId")
      .setRatingCol("rating")

val t0 = System.currentTimeMillis()
val model = als.fit(training)
val t1 = System.currentTimeMillis()
```

Before this patch:  65.386 s
After this patch: 61.022 s

Closes #25576 from viirya/persist-item-factors.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-30 11:37:06 -05:00
Burak Yavuz 827969399b [SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE
### What changes were proposed in this pull request?

Adds support for the V2SessionCatalog for ALTER TABLE statements.
Implementation changes are ~50 loc. The rest is just test refactoring.

### Why are the changes needed?
To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements.

### How was this patch tested?

By re-using existing tests in DataSourceV2SQLSuite.

Closes #25502 from brkyvz/alterV3.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-30 14:16:47 +08:00
Liang-Chi Hsieh 2db45cbd5a [SPARK-28920][INFRA] Set up java version for github workflow
This patch adds java version parameter to GitHub workflow conf for JDK8/11.

As we want to build JDK8/11 on GitHub workflow, we might need to add java version according current matrix.

No

See the GitHub workflow run result.

Closes #25625 from viirya/github-workflow-java.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 20:55:14 -07:00
Dongjoon Hyun 780aa71749 [SPARK-28919][INFRA] Add more profiles for JDK8/11 build test for Github workflow
### What changes were proposed in this pull request?

This PR aims to add `-Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-3.2 -Phadoop-cloud` profiles to GitHub workflow conf.

### Why are the changes needed?

Currently, we build with JDK8 and test with JDK8/11 in Jenkins.
And, we use GitHub Workflow for JDK8/JDK11 building test.
To test JDK11 fully, we need to enable `hive` and `hadoop-3.2` profiles for `Hive 2.3.6` and `Hadoop 3.2`. Also, this PR adds all resource manager modules, too.

### Does this PR introduce any user-facing change?

No. In addition, Jenkins workload will be the same because this is specific to GitHub workflow.

### How was this patch tested?

See the GitHub workflow run result.

Closes #25624 from dongjoon-hyun/SPARK-JDK11-HIVE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 19:46:21 -07:00
Gabor Somogyi d502c80404 [SPARK-28922][SS] Safe Kafka parameter redaction
### What changes were proposed in this pull request?
At the moment Kafka parameter reduction is expecting `SparkEnv`.  This must exist in normal queries but several unit tests are not providing it to make things simple. As an end-result such tests are throwing similar exception:
```
java.lang.NullPointerException
	at org.apache.spark.kafka010.KafkaRedactionUtil$.redactParams(KafkaRedactionUtil.scala:29)
	at org.apache.spark.kafka010.KafkaRedactionUtilSuite.$anonfun$new$1(KafkaRedactionUtilSuite.scala:33)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
	at org.scalatest.Suite.run(Suite.scala:1147)
	at org.scalatest.Suite.run$(Suite.scala:1129)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1346)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1340)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1031)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1010)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
	at org.scalatest.tools.Runner$.run(Runner.scala:850)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
```
These are annoying and only red herrings so I would like to make them disappear.

There are basically 2 ways to handle this situation:
* Add default value for `SparkEnv` in `KafkaReductionUtil`
* Add `SparkEnv` to all such tests => I think it would be overkill and would just increase number of lines without real value

Considering this I've chosen the first approach.

### Why are the changes needed?
Couple of tests are throwing exceptions even if no real problem.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New + additional unit tests.

Closes #25621 from gaborgsomogyi/safe-reduct.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 19:17:48 -07:00
Ryan Blue 31b59bd805 [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python if not set
### What changes were proposed in this pull request?

When starting python processes, set `OMP_NUM_THREADS` to the number of cores allocated to an executor or driver if `OMP_NUM_THREADS` is not already set. Each python process will use the same `OMP_NUM_THREADS` setting, even if workers are not shared.

This avoids creating an OpenMP thread pool for parallel processing with a number of threads equal to the number of cores on the executor and [significantly reduces memory consumption](https://github.com/numpy/numpy/issues/10455). Instead, this threadpool should use the number of cores allocated to the executor, if available. If a setting for number of cores is not available, this doesn't change any behavior. OpenMP is used by numpy and pandas.

### Why are the changes needed?

To reduce memory consumption for PySpark jobs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Validated this reduces python worker memory consumption by more than 1GB on our cluster.

Closes #25545 from rdblue/SPARK-28843-set-omp-num-cores.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-30 10:29:46 +09:00
Wenchen Fan f8f7c52f12 [SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core
### What changes were proposed in this pull request?

There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them.

After merging, there are 3 classes:
1. `InMemoryTable`
2. `InMemoryTableCatalog`
3. `StagingInMemoryTableCatalog`

For better maintainability, these 3 classes are put in 3 different files.

### Why are the changes needed?

reduce duplicated code

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

N/A

Closes #25610 from cloud-fan/dsv2-test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-08-29 12:56:19 -07:00
Gabor Somogyi 7d72c073dd [SPARK-28760][SS][TESTS] Add Kafka delegation token end-to-end test with mini KDC
### What changes were proposed in this pull request?
At the moment no end-to-end Kafka delegation token test exists which was mainly because of missing embedded KDC. KDC is missing in general from the testing side so I've discovered what kind of possibilities are there. The most obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby runs in the background. What this PR contains:
* Added MiniKDC as test dependency from Hadoop
* Added `maven-bundle-plugin` because couple of dependencies are coming in bundle format
* Added security mode to `KafkaTestUtils`. Namely start KDC -> start Zookeeper in secure mode -> start Kafka in secure mode
* Added a roundtrip test (saves and reads back data from Kafka)

### Why are the changes needed?
No such test exists + security testing with KDC is completely missing.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.
I've put the additional test into a loop and was consuming ~10 sec average.

Closes #25477 from gaborgsomogyi/SPARK-28760.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-29 11:52:35 -07:00
Dilip Biswal fb1053d14a [SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference
### What changes were proposed in this pull request?
Document SHOW DATABASES statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-28 at 11 43 36 PM" src="https://user-images.githubusercontent.com/14225158/63916727-dd600380-c9ed-11e9-8372-789110c9d2dc.png">
<img width="1234" alt="Screen Shot 2019-08-28 at 11 43 57 PM" src="https://user-images.githubusercontent.com/14225158/63916734-e0f38a80-c9ed-11e9-8ad4-d854febeaab8.png">
<img width="1234" alt="Screen Shot 2019-08-28 at 11 44 13 PM" src="https://user-images.githubusercontent.com/14225158/63916740-e4871180-c9ed-11e9-9cfc-199cd8a64852.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25526 from dilipbiswal/ref-doc-show-db.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-29 09:04:27 -07:00
Huaxin Gao 3e09a0fce9 [SPARK-28786][DOC][SQL] Document INSERT statement in SQL Reference
### What changes were proposed in this pull request?
Document INSERT statement in SQL Reference

### Why are the changes needed?
To complete SQL reference.

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Manually checked newly added doc.

Here are the screen shots:

![image](https://user-images.githubusercontent.com/13592258/63490232-0a01a180-c469-11e9-82de-cfdc7c2343e7.png)

![image](https://user-images.githubusercontent.com/13592258/63903006-cce56400-c9c0-11e9-9f24-badd586227a2.png)

<img width="1100" alt="Screen Shot 2019-08-27 at 5 01 48 PM" src="https://user-images.githubusercontent.com/13592258/63816303-845c7680-c8ec-11e9-8c36-1b8e4d3e6286.png">

<img width="1100" alt="Screen Shot 2019-08-27 at 5 03 22 PM" src="https://user-images.githubusercontent.com/13592258/63816347-ac4bda00-c8ec-11e9-9470-fa99522e6f14.png">

![image](https://user-images.githubusercontent.com/13592258/63817393-fc2ca000-c8f0-11e9-9d66-dd9b22a9d900.png)

<img width="1102" alt="Screen Shot 2019-08-27 at 5 05 13 PM" src="https://user-images.githubusercontent.com/13592258/63816423-ea48fe00-c8ec-11e9-8f66-5b226a1ff693.png">

![image](https://user-images.githubusercontent.com/13592258/63903080-0e760f00-c9c1-11e9-966a-f45b0b1c1ea6.png)

<img width="1100" alt="Screen Shot 2019-08-27 at 5 07 19 PM" src="https://user-images.githubusercontent.com/13592258/63816494-37c56b00-c8ed-11e9-88e1-27a9101eb09d.png">

![image](https://user-images.githubusercontent.com/13592258/63816712-131dc300-c8ee-11e9-8ee7-d83b8ad07bf2.png)

![image](https://user-images.githubusercontent.com/13592258/63817479-5a598300-c8f1-11e9-8789-adae7df5535a.png)

![image](https://user-images.githubusercontent.com/13592258/63817900-4adb3980-c8f3-11e9-94fe-d60f7d61c4b4.png)

![image](https://user-images.githubusercontent.com/13592258/63903155-4da46000-c9c1-11e9-88dd-609d4fe685a9.png)

![image](https://user-images.githubusercontent.com/13592258/63817157-d652cb80-c8ef-11e9-944c-99391cf2fb0a.png)

![image](https://user-images.githubusercontent.com/13592258/63903259-aa077f80-c9c1-11e9-982f-b8590ce0270d.png)

![image](https://user-images.githubusercontent.com/13592258/63903270-b1c72400-c9c1-11e9-85c6-6d8e8cd7f006.png)

Closes #25525 from huaxingao/spark-28786.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-29 09:00:42 -07:00
Gengliang Wang 24655583f1 [SPARK-28495][SQL][FOLLOW-UP] Disallow conversions between timestamp and long in ASNI mode
### What changes were proposed in this pull request?

Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy.

### Why are the changes needed?

In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion.

After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers":
> A number is assignable only to sites of numeric type.

In PostgreSQL, the conversion between timestamp and long is also disallowed.

### Does this PR introduce any user-facing change?

Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy.

### How was this patch tested?

Unit test

Closes #25615 from gengliangwang/disallowTimeStampToLong.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-29 19:59:24 +08:00
Matt Hawes 137b20b964 [SPARK-28818][SQL] Respect source column nullability in the arrays created by freqItems()
### What changes were proposed in this pull request?
This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is:
```
StructField(name + "_freqItems", ArrayType(dataType, false))
```
Becomes:
```
StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable))
```

Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem.

I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations.

### Why are the changes needed?
I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues.

### Does this PR introduce any user-facing change?
Nullability of columns when calling freqItems on them is now respected after the change.

### How was this patch tested?
I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected.

Closes #25575 from MGHawes/mhawes/SPARK-28818.

Authored-by: Matt Hawes <mhawes@palantir.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-29 10:49:10 +09:00
Dilip Biswal 74527868b2 [SPARK-28789][DOCS][SQL] Document ALTER DATABASE command
### What changes were proposed in this pull request?
Document ALTER DATABSE statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-28 at 1 51 13 PM" src="https://user-images.githubusercontent.com/14225158/63891854-fc817580-c99a-11e9-918e-6b305edf92e6.png">
<img width="1234" alt="Screen Shot 2019-08-28 at 1 51 27 PM" src="https://user-images.githubusercontent.com/14225158/63891869-0acf9180-c99b-11e9-91a4-04d870474a40.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25523 from dilipbiswal/ref-doc-alterdb.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-28 15:30:38 -07:00
Yuming Wang 1b404b9b99 [SPARK-28890][SQL] Upgrade Hive Metastore Client to the 3.1.2 for Hive 3.1
### What changes were proposed in this pull request?

Hive 3.1.2 has been released. This PR upgrades the Hive Metastore Client to 3.1.2 for Hive 3.1.

Hive 3.1.2 release notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843

### Why are the changes needed?

This is an improvement to support a newly release 3.1.2. Otherwise, it will throws `UnsupportedOperationException` if user `set spark.sql.hive.metastore.version=3.1.2`:
```scala
Exception in thread "main" java.lang.UnsupportedOperationException: Unsupported Hive Metastore version (3.1.2). Please set spark.sql.hive.metastore.version with a valid version.
	at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:109)
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT

Closes #25604 from wangyum/SPARK-28890.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-28 09:16:54 -07:00
zhengruifeng 3e7b0e1dd6 [SPARK-28539][WEBUI][DOC] Document Executors page
### What changes were proposed in this pull request?
1, add a basic doc for executor page
2, btw, move the version number in the document of SQL page outside

### Why are the changes needed?
Spark web UIs are being used to monitor the status and resource consumption of your Spark applications and clusters. However, we do not have the corresponding document. It is hard for end users to use and understand them.

### Does this PR introduce any user-facing change?
yes, the doc is changed

### How was this patch tested?
locally build

<img width="468" alt="图片" src="https://user-images.githubusercontent.com/7322292/63758724-d2727980-c8ee-11e9-8380-cbae51453629.png">

Closes #25596 from zhengruifeng/doc_ui_exe.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-28 08:34:24 -05:00
Gengliang Wang 9d6bec183c [SPARK-28730][SPARK-28495][SQL][FOLLOW-UP] Revise the doc of option spark.sql.storeAssignmentPolicy
### What changes were proposed in this pull request?

Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`.

### Why are the changes needed?

1. Need to point out the ANSI mode is mostly the same with PostgreSQL
2. Need to point out Legacy mode allows type coercion as long as it is valid casting
3. Better examples.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Uni test

Closes #25605 from gengliangwang/reviseDoc.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 19:59:53 +08:00
Yuming Wang e3b32da027 [SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs
## What changes were proposed in this pull request?

This PR update `spark.sql.statistics.fallBackToHdfs`'s doc:
1. This flag is effective only if it is Hive table.
2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.

Related code:
- Non-partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)) -> [LogicalRelation.computeStats()](a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)) -> [HadoopFsRelation.sizeInBytes()](c0632cec04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala (L72-L75)) -> [PartitioningAwareFileIndex.sizeInBytes()](b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103))
`PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)) if table statistics are not available.

- Partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)) -> [LogicalRelation.computeStats()](a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)) -> [CatalogFileIndex.sizeInBytes](5d672b7f3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (L41))
`CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](c30b5297bc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L387)) if table statistics are not available.

## How was this patch tested?

N/A

Closes #24715 from wangyum/SPARK-25474.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 19:15:26 +08:00
hemanth meka 6252c54e39 [SPARK-23519][SQL] create view should work from query with duplicate output columns
**What changes were proposed in this pull request?**

Moving the call for checkColumnNameDuplication out of generateViewProperties. This way we can choose ifcheckColumnNameDuplication will be performed on analyzed or aliased plan without having to pass an additional argument(aliasedPlan) to generateViewProperties.

Before the pr column name duplication was performed on the query output of below sql(c1, c1) and the pr makes it perform check on the user provided schema of view definition(c1, c2)

**Why are the changes needed?**

Changes are to fix SPARK-23519 bug. Below queries would cause an exception. This pr fixes them and also added a test case.

`CREATE TABLE t23519 AS SELECT 1 AS c1
CREATE VIEW v23519 (c1, c2) AS SELECT c1, c1 FROM t23519`

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
new unit test added in SQLViewSuite

Closes #25570 from hem1891/SPARK-23519.

Lead-authored-by: hemanth meka <hmeka@tibco.com>
Co-authored-by: hem1891 <hem1891@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-28 12:11:10 +08:00
HyukjinKwon 8848af2635 [SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use SparkSession(SparkContext(...)) to prevent for Spark conf to affect other tests
### What changes were proposed in this pull request?

This PR proposes to match the test with branch-2.4. See https://github.com/apache/spark/pull/25593#discussion_r318109047

Seems using `SparkSession.builder` with Spark conf possibly affects other tests.

### Why are the changes needed?
To match with branch-2.4 and to make easier to backport.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Test was fixed.

Closes #25603 from HyukjinKwon/SPARK-28881-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-28 10:39:21 +09:00
Wenchen Fan 90b10b4f7a [HOT-FIX] fix compilation
This is caused by 2 PRs that were merged at the same time:
cb06209fc9
2b24a71fec

Closes #25597 from cloud-fan/hot-fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 23:30:44 +08:00
Gengliang Wang 2b24a71fec [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
### What changes were proposed in this pull request?
 Introduce ANSI store assignment policy for table insertion.
With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL.

### Why are the changes needed?
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.

In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases.

Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql.

### Does this PR introduce any user-facing change?
A new optional mode for table insertion.

### How was this patch tested?
Unit test

Closes #25581 from gengliangwang/ANSImode.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 22:13:23 +08:00
wuyi 70f4bbccc5 [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
## What changes were proposed in this pull request?

Since SPARK-27371 has supported GPU-aware resource scheduling in Standalone, this PR adds resources info in Standalone UI.

## How was this patch tested?

Updated `JsonProtocolSuite` and tested manually.

Master page:

![masterpage](https://user-images.githubusercontent.com/16397174/62835958-b933c100-bc90-11e9-814f-22bae048303d.png)

Worker page

![workerpage](https://user-images.githubusercontent.com/16397174/63417947-d2790200-c434-11e9-8979-36b8f558afd3.png)

Application page

![applicationpage](https://user-images.githubusercontent.com/16397174/62835964-cbadfa80-bc90-11e9-99a2-26e05421619a.png)

Closes #25409 from Ngone51/SPARK-28414.

Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2019-08-27 08:59:29 -05:00
zhengruifeng 7fe750674e [SPARK-11215][ML][FOLLOWUP] update the examples and suites using new api
## What changes were proposed in this pull request?
since method `labels` is already deprecated, we should update the examples and suites to turn off warings when compiling spark:
```
[warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala:65: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead.
[warn]       .setLabels(labelIndexer.labels)
[warn]                               ^
[warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala:68: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead.
[warn]       .setLabels(labelIndexer.labels)
[warn]                               ^
```

## How was this patch tested?
existing suites

Closes #25428 from zhengruifeng/del_stringindexer_labels_usage.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-27 08:58:32 -05:00
WeichenXu 7f605f5559 [SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true
### What changes were proposed in this pull request?

Make `spark.sql.crossJoin.enabled` default value true

### Why are the changes needed?

For implicit cross join, we can set up a watchdog to cancel it if running for a long time.
When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user:
* it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast.
* if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing.
* the CROSS JOIN syntax doesn't work well if join reorder happens.
* some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error.

So that in order to address this in simpler way, we can turn off showing this cross-join error by default.

For reference, I list some cases raising mismatching error here:
Providing:
```
spark.range(2).createOrReplaceTempView("sm1") // can be broadcast
spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast
spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast
```
1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g.
```
select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id
```
2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g.
```
select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id
```

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #25520 from WeichenXu123/SPARK-28621.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:53:37 +08:00
Yuming Wang e12da8b957 [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
### What changes were proposed in this pull request?

This PR makes `spark.sql.statistics.fallBackToHdfs` not support Hive partitioned tables.

### Why are the changes needed?

The current implementation is incorrect for external partitions and it is expensive to support partitioned table with external partitions.

### Does this PR introduce any user-facing change?
Yes.  But I think it will not change the join strategy because partitioned table usually very large.

### How was this patch tested?
unit test

Closes #25584 from wangyum/SPARK-28876.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:37:18 +08:00
cyq89051127 4cf81285da [SPARK-28871][MINOR][DOCS] WaterMark doc fix
### What changes were proposed in this pull request?

The code style in the 'Policy for handling multiple watermarks' in structured-streaming-programming-guide.md

### Why are the changes needed?

Making it look friendly  to user.

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?

    cd docs
    SKIP_API=1 jekyll build

Closes #25580 from cyq89051127/master.

Authored-by: cyq89051127 <chaiyq@asiainfo.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-27 08:13:39 -05:00
Yuming Wang 96179732aa [SPARK-27592][SQL][TEST][FOLLOW-UP] Test set the partitioned bucketed data source table SerDe correctly
### What changes were proposed in this pull request?
This PR add test for set the partitioned bucketed data source table SerDe correctly.

### Why are the changes needed?
Improve test.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #25591 from wangyum/SPARK-27592-f1.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 21:10:58 +08:00
Wenchen Fan cb06209fc9 [SPARK-28747][SQL] merge the two data source v2 fallback configs
## What changes were proposed in this pull request?

Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path.

However, I found it's awkward to work with these 2 configs:
1. for `CREATE TABLE USING format`, should this be read path or write path?
2. for `V2SessionCatalog.loadTable`,  we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write.

We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately.

This PR proposes to merge these 2 configs into one.

## How was this patch tested?

existing tests

Closes #25465 from cloud-fan/merge-conf.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 20:47:24 +08:00
hongdd c02c86e4e8 [SPARK-28691][EXAMPLES] Add Java/Scala DirectKerberizedKafkaWordCount examples
## What changes were proposed in this pull request?

Now, DirectKafkaWordCount example is not support to visit kafka using kerberos authentication. Add Java/Scala DirectKerberizedKafkaWordCount.

## How was this patch tested?
Use cmd to visit kafka using kerberos authentication.
```
$ bin/run-example --files ${path}/kafka_jaas.conf \
   --driver-java-options "-Djava.security.auth.login.config=${path}/kafka_jaas.conf" \
   --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
   streaming.DirectKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
   consumer-group topic1,topic2
```

Closes #25412 from hddong/example-streaming-support-kafka-kerberos.

Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 21:25:39 +09:00
HyukjinKwon 00cb2f99cc [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize
### What changes were proposed in this pull request?
This PR proposes to add a test case for:

```bash
./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
```

```python
spark.range(10000000).toPandas()
```

```
Empty DataFrame
Columns: [id]
Index: []
```

which can result in partial results (see https://github.com/apache/spark/pull/25593#issuecomment-525153808). This regression was found between Spark 2.3 and Spark 2.4, and accidentally fixed.

### Why are the changes needed?
To prevent the same regression in the future.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Test was added.

Closes #25594 from HyukjinKwon/SPARK-28881.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 17:30:06 +09:00
Yuming Wang ab1819d38a [SPARK-28527][SQL][TEST][FOLLOW-UP] Ignores Thrift server ThriftServerQueryTestSuite
### What changes were proposed in this pull request?

This PR ignores Thrift server `ThriftServerQueryTestSuite`.

### Why are the changes needed?

This ThriftServerQueryTestSuite test case led to frequent Jenkins build failure.

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?
N/A

Closes #25592 from wangyum/SPARK-28527-f1.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 15:41:22 +09:00
Dongjoon Hyun 7701d29af5 [SPARK-28877][PYSPARK][test-hadoop3.2][test-java11] Make jaxb-runtime compile-time dependency
### What changes were proposed in this pull request?

`mllib` has `jaxb-runtime-2.3.2` as a runtime dependency. This PR makes it as a compile-time dependency. This doesn't change our dependency manifest and `LICENSE`s. Instead, this will add the following three jars into our pre-built artifacts.

- jaxb-runtime-2.3.2.jar
- jakarta.xml.bind-api-2.3.2.jar
- istack-commons-runtime-3.0.8.jar

### Why are the changes needed?

We need to use the followings.
- JDK8: `com.sun.xml.internal.bind.v2.ContextFactory`
- JDK11: `com.sun.xml.bind.v2.ContextFactory`

`com.sun.xml.bind.v2.ContextFactory` is inside `jaxb-runtime-2.3.2`.
```
$ javap -cp jaxb-runtime-2.3.2.jar com.sun.xml.bind.v2.ContextFactory
Compiled from "ContextFactory.java"
public class com.sun.xml.bind.v2.ContextFactory {
  public static final java.lang.String USE_JAXB_PROPERTIES;
  public com.sun.xml.bind.v2.ContextFactory();
  public static javax.xml.bind.JAXBContext createContext(java.lang.Class[], java.util.Map<java.lang.String, java.lang.Object>) throws javax.xml.bind.JAXBException;
  public static com.sun.xml.bind.api.JAXBRIContext createContext(java.lang.Class[], java.util.Collection<com.sun.xml.bind.api.TypeReference>, java.util.Map<java.lang.Class, java.lang.Class>, java.lang.String, boolean, com.sun.xml.bind.v2.model.annotation.RuntimeAnnotationReader, boolean, boolean, boolean) throws javax.xml.bind.JAXBException;
  public static com.sun.xml.bind.api.JAXBRIContext createContext(java.lang.Class[], java.util.Collection<com.sun.xml.bind.api.TypeReference>, java.util.Map<java.lang.Class, java.lang.Class>, java.lang.String, boolean, com.sun.xml.bind.v2.model.annotation.RuntimeAnnotationReader, boolean, boolean, boolean, boolean) throws javax.xml.bind.JAXBException;
  public static javax.xml.bind.JAXBContext createContext(java.lang.String, java.lang.ClassLoader, java.util.Map<java.lang.String, java.lang.Object>) throws javax.xml.bind.JAXBException;
}
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with `test-java11`.

For manual testing, do the following with JDK11.
```scala
$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.3+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.3+7, mixed mode)

$ build/sbt -Pyarn -Phadoop-3.2 -Phadoop-cloud -Phive -Phive-thriftserver -Psparkr test:package

$ python/run-tests.py --python-executables python --modules pyspark-ml
...
Finished test(python): pyspark.ml.recommendation (65s)
Tests passed in 174 seconds
```

Closes #25587 from dongjoon-hyun/SPARK-28877.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-26 23:23:17 -07:00
Burak Yavuz e31aec9be4 [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
### What changes were proposed in this pull request?

This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog.

### Why are the changes needed?

This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs.

### Does this PR introduce any user-facing change?

No behavior changes.

### How was this patch tested?

Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths.

Closes #25507 from brkyvz/insertSesh.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-27 12:59:53 +08:00
Nikita Gorbachevsky 13b1eb65d7 [SPARK-22955][DSTREAMS] - graceful shutdown shouldn't lead to job gen…
### What changes were proposed in this pull request?
During graceful shutdown of ``StreamingContext`` ``graph.stop()`` is invoked right after stopping of ``timer`` which generates new job. Thus it's possible that the latest jobs generated by timer are still in the middle of generation but invocation of ``graph.stop()`` closes some objects required to job generation, e.g. consumer for Kafka, and generation fails. That also leads to fully waiting of ``spark.streaming.gracefulStopTimeout`` which is equal to 10 batch intervals by default. Stopping of the graph should be performed later, after ``haveAllBatchesBeenProcessed`` is completed.

### How was this patch tested?
Added test to existing test suite.

Closes #25511 from choojoyq/SPARK-22955-job-generation-error-on-graceful-stop.

Authored-by: Nikita Gorbachevsky <nikitag@playtika.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-26 21:42:20 -05:00
Jungtaek Lim (HeartSaVioR) 64032cb01f [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
### What changes were proposed in this pull request?

This patch proposes to reuse KafkaSourceInitialOffsetWriter to remove identical code in KafkaSource.

Credit to jaceklaskowski for finding this.
https://lists.apache.org/thread.html/7faa6ac29d871444eaeccefc520e3543a77f4362af4bb0f12a3f7cb2%3Cdev.spark.apache.org%3E

### Why are the changes needed?

The code is duplicated with identical code, which opens the chance to maintain the code separately and might end up with bugs not addressed one side.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing UTs, as it's simple refactor.

Closes #25583 from HeartSaVioR/MINOR-SS-reuse-KafkaSourceInitialOffsetWriter.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-26 18:06:18 -07:00
Gabor Somogyi b205269ae0 [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
### What changes were proposed in this pull request?
When Task retry happens with Kafka source then it's not known whether the consumer is the issue so the old consumer removed from cache and new consumer created. The feature works fine but not covered with tests.

In this PR I've added such test for DStreams + Structured Streaming.

### Why are the changes needed?
No such tests are there.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + new unit tests.

Closes #25582 from gaborgsomogyi/SPARK-28875.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-26 13:12:14 -07:00
shane knapp 84d4f94596 [SPARK-28701][INFRA][FOLLOWUP] Fix the key error when looking in os.environ
### What changes were proposed in this pull request?

i broke run-tests.py for non-PRB builds in this PR:
https://github.com/apache/spark/pull/25423

### Why are the changes needed?

to fix what i broke

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
the build system will test this

Closes #25585 from shaneknapp/fix-run-tests.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
2019-08-26 12:40:31 -07:00
Alessandro Bellina dd0725d7ea [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
## What changes were proposed in this pull request?

This fixes issues that can arise when the jars for different hadoop versions mix, and short-circuits the case where we are running with a spark that was not built for yarn 3 (resource support).

## How was this patch tested?

I tested it manually.

Closes #25403 from abellina/SPARK-28679.

Authored-by: Alessandro Bellina <abellina@nvidia.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-26 12:00:33 -07:00
mcheah 2efa6f5dd3 [SPARK-28607][CORE][SHUFFLE] Don't store partition lengths twice
## What changes were proposed in this pull request?

The shuffle writer API introduced in SPARK-28209 has a flaw that leads to a memory usage regression - we ended up tracking the partition lengths in two places. Here, we modify the API slightly to avoid redundant tracking. The implementation of the shuffle writer plugin is now responsible for tracking the lengths of partitions, and propagating this back up to the higher shuffle writer as part of the commitAllPartitions API.

## How was this patch tested?

Existing unit tests.

Closes #25341 from mccheah/dont-redundantly-store-part-lengths.

Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-26 10:39:29 -07:00
shane knapp 13fd32c9a9 [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 support for pull request builds
## What changes were proposed in this pull request?

we need to add the ability to test PRBs against java11.

see comments here:  https://github.com/apache/spark/pull/25405

## How was this patch tested?

the build system will test this.

Closes #25423 from shaneknapp/spark-prb-java11.

Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-27 00:48:01 +09:00