Commit graph

25188 commits

Author SHA1 Message Date
Shixiong Zhu 84a4d3a17c
[SPARK-28976][CORE] Use KeyLock to simplify MapOutputTracker.getStatuses
### What changes were proposed in this pull request?

Use `KeyLock` added in #25612 to simplify `MapOutputTracker.getStatuses`. It also has some improvement after the refactoring:
- `InterruptedException` is no longer sallowed.
- When a shuffle block is fetched, we don't need to wake up unrelated sleeping threads.

### Why are the changes needed?

`MapOutputTracker.getStatuses` is pretty hard to maintain right now because it has a special lock mechanism which we needs to pay attention to whenever updating this method. As we can use `KeyLock` to hide the complexity of locking behind a dedicated lock class, it's better to refactor it to make it easy to understand and maintain.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #25680 from zsxwing/getStatuses.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-09-04 23:20:27 -07:00
Huaxin Gao e4f70023ad [SPARK-28830][DOC][SQL] Document UNCACHE TABLE statement in SQL Reference
### What changes were proposed in this pull request?
Document UNCACHE TABLE statement in SQL Reference
### Why are the changes needed?
To complete SQL Reference

### Does this PR introduce any user-facing change?
Yes.

After change:

![image](https://user-images.githubusercontent.com/13592258/64299133-e04a7f00-cf2c-11e9-8f39-9b288e46c995.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25540 from huaxingao/spark-28830.

Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 21:42:01 -07:00
Ryan Blue 5adaa2e103 [SPARK-28979][SQL] Rename UnresovledTable to V1Table
### What changes were proposed in this pull request?

Rename `UnresolvedTable` to `V1Table` because it is not unresolved.

### Why are the changes needed?

The class name is inaccurate. This should be fixed before it is in a release.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25683 from rdblue/SPARK-28979-rename-unresolved-table.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-05 11:41:21 +08:00
Xianjin YE ca71177868 [SPARK-28907][CORE] Review invalid usage of new Configuration()
### What changes were proposed in this pull request?
Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop

### Why are the changes needed?
Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existed tests.

Closes #25616 from advancedxy/remove_invalid_configuration.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-04 19:52:19 -05:00
Sean Owen ded23f83dd [SPARK-28921][K8S][FOLLOWUP] Also bump K8S client version in integration-tests
### What changes were proposed in this pull request?

Per https://github.com/apache/spark/pull/25640#issuecomment-527397689 also bump K8S client version in integration-tests module.

### Why are the changes needed?

Harmonize the version as intended.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #25664 from srowen/SPARK-28921.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-04 19:51:04 -05:00
maryannxue a7a3935c97 [SPARK-11150][SQL] Dynamic Partition Pruning
### What changes were proposed in this pull request?
This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach:
1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or
2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise
3. As a bypassed condition (`true`).

### Why are the changes needed?
This is an important performance feature.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added UT
- Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature.
- Testing DPP with reused broadcast results.
- Testing the key iterators on different HashedRelation types.
- Testing the packing and unpacking of the broadcast keys in a LongType.

Closes #25600 from maryannxue/dpp.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 13:13:23 -07:00
Dilip Biswal f96486b4aa [SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference
### What changes were proposed in this pull request?
Document SHOW FUNCTIONS statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**

![image](https://user-images.githubusercontent.com/11567269/64281840-e3cc0f00-cf08-11e9-9784-f01392276130.png)

<img width="589" alt="Screen Shot 2019-09-04 at 11 41 44 AM" src="https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png">

<img width="572" alt="Screen Shot 2019-09-04 at 11 41 54 AM" src="https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25539 from dilipbiswal/ref-doc-show-functions.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 11:47:10 -07:00
Dilip Biswal b992160eae [SPARK-28811][DOCS][SQL] Document SHOW TBLPROPERTIES in SQL Reference
### What changes were proposed in this pull request?
Document SHOW TBLPROPERTIES statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
![image](https://user-images.githubusercontent.com/11567269/64281442-fdb92200-cf07-11e9-90ba-4699b6e93e23.png)
![Screen Shot 2019-09-04 at 11 32 11 AM](https://user-images.githubusercontent.com/11567269/64281484-188b9680-cf08-11e9-8e42-f130751ca495.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25571 from dilipbiswal/ref-show-tblproperties.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 11:36:45 -07:00
Jungtaek Lim (HeartSaVioR) 594c9c5a3e [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
## What changes were proposed in this pull request?

This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following:

* Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks.
* It also enables applying different policies on pool, which helps optimization of pooling for each pool.
* We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case.
* It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter.

Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits:

* We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer.
* We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily.
* We can get various statistics for the object pool, and also be able to enable JMX for the pool.

FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this.

This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module.

## How was this patch tested?

Existing unit tests as well as new tests for object pool.

Also did some experiment regarding proving concurrent access of consumers for same topic partition.

* Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening.
* branches
  * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging
  * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging
* Test query (doing self-join)
  * https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2
* Ran query from spark-shell, with using `local[*]` to maximize the chance to have concurrent access
* Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile | wc -l`
* Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile | wc -l`

Topic and data distribution is follow:

```
truck_speed_events_stream_spark_25151_v1:0:99440
truck_speed_events_stream_spark_25151_v1:1:99489
truck_speed_events_stream_spark_25151_v1:2:397759
truck_speed_events_stream_spark_25151_v1:3:198917
truck_speed_events_stream_spark_25151_v1:4:99484
truck_speed_events_stream_spark_25151_v1:5:497320
truck_speed_events_stream_spark_25151_v1:6:99430
truck_speed_events_stream_spark_25151_v1:7:397887
truck_speed_events_stream_spark_25151_v1:8:397813
truck_speed_events_stream_spark_25151_v1:9:0
```

The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier.

The result of experiment is below:

branch | create Kafka consumer | fetch request
-- | -- | --
master | 1986 | 2837
patch | 8 | 1706

Closes #22138 from HeartSaVioR/SPARK-25151.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-04 10:17:38 -07:00
Jungtaek Lim (HeartSaVioR) 712874fa09 [SPARK-28931][CORE][TESTS] Fix couple of bugs in FsHistoryProviderSuite
### What changes were proposed in this pull request?

This patch fixes the bugs in test code itself, FsHistoryProviderSuite.

1. When creating log file via `newLogFile`, codec is ignored, leading to wrong file name. (No one tends to create test for test code, as well as the bug doesn't affect existing tests indeed, so not easy to catch.)
2. When writing events to log file via `writeFile`, metadata (in case of new format) gets written to file regardless of its codec, and the content is overwritten by another stream, hence no information for Spark version is available. It affects existing test, hence we have wrong expected value to workaround the bug.

This patch also removes redundant parameter `isNewFormat` in `writeFile`, as according to review comment, Spark no longer supports old format.

### Why are the changes needed?

Explained in above section why they're bugs, though they only reside in test-code. (Please note that the bug didn't come from non-test side of code.)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified existing UTs, as well as read event log file in console to see metadata is not overwritten by other contents.

Closes #25629 from HeartSaVioR/MINOR-FIX-FsHistoryProviderSuite.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-04 09:41:15 -07:00
angerszhu 9f478a6832 [SPARK-28901][SQL] SparkThriftServer's Cancel SQL Operation show it in JDBC Tab UI
### What changes were proposed in this pull request?
Current Spark Thirft Server can't support cancel SQL job,  when we use Hue to query throgh Spark Thrift Server, when we run a sql and then click cancel button to cancel this sql, we will it won't work in backend and in the spark JDBC UI tab, we can see the SQL's status is always COMPILED, then the duration of SQL is always increasing, this may make people confused.

![image](https://user-images.githubusercontent.com/46485123/63869830-60338f00-c9eb-11e9-8776-cee965adcb0a.png)

### Why are the changes needed?

If sql status can't reflect sql's true status, it will make user confused.

### Does this PR introduce any user-facing change?

SparkthriftServer's UI tab will show SQL's status in CANCELED when we cancel a SQL .

### How was this patch tested?
Manuel tested

UI TAB Status
![image](https://user-images.githubusercontent.com/46485123/63915010-80a12f00-ca67-11e9-9342-830dfa9c719f.png)

![image](https://user-images.githubusercontent.com/46485123/63915084-a9292900-ca67-11e9-8e26-375bf8ce0963.png)

backend log
![image](https://user-images.githubusercontent.com/46485123/63914864-1092a900-ca67-11e9-93f2-08690ed9abf4.png)

Closes #25611 from AngersZhuuuu/SPARK-28901.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-04 09:20:51 -07:00
yangjie01 a07f795aea [SPARK-28577][YARN] Resource capability requested for each executor add offHeapMemorySize
## What changes were proposed in this pull request?

If MEMORY_OFFHEAP_ENABLED is true, add MEMORY_OFFHEAP_SIZE to resource requested for executor to ensure instance has enough memory to use.

In this pr add a helper method `executorOffHeapMemorySizeAsMb` in `YarnSparkHadoopUtil`.

## How was this patch tested?
Add 3 new test suite to test `YarnSparkHadoopUtil#executorOffHeapMemorySizeAsMb`

Closes #25309 from LuciferYang/spark-28577.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2019-09-04 09:00:12 -05:00
Sean Owen df39855db8 [SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases
### What changes were proposed in this pull request?

Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release.

### Why are the changes needed?

If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly.

Closes #25667 from srowen/SPARK-28963.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-04 13:11:09 +09:00
hongdd a838699ee0 [SPARK-28694][EXAMPLES] Add Java/Scala StructuredKerberizedKafkaWordCount examples
### What changes were proposed in this pull request?
Add Java/Scala StructuredKerberizedKafkaWordCount examples to test kerberized kafka.

### Why are the changes needed?
Now,`StructuredKafkaWordCount` example is not support to visit kafka using kerberos authentication.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
```
Yarn client:
   $ bin/run-example --files ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab \
     --driver-java-options "-Djava.security.auth.login.config=${path}/kafka_driver_jaas.conf" \
     --conf \
     "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
     --master yarn
     sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
     subscribe topic1,topic2
Yarn cluster:
$ bin/run-example --files \
    ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab,${krb5_path}/krb5.conf \
    --driver-java-options \
    "-Djava.security.auth.login.config=./kafka_jaas.conf \
    -Djava.security.krb5.conf=./krb5.conf" \
    --conf \
    "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \
    --master yarn --deploy-mode cluster \
    sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \
    subscribe topic1,topic2
```

Closes #25649 from hddong/Add-StructuredKerberizedKafkaWordCount-examples.

Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-04 13:06:01 +09:00
Thomas Graves 4c8f114783 [SPARK-27489][WEBUI] UI updates to show executor resource information
### What changes were proposed in this pull request?
We are adding other resource type support to the executors and Spark. We should show the resource information for each executor on the UI Executors page.
This also adds a toggle button to show the resources column.  It is off by default.

![executorui1](https://user-images.githubusercontent.com/4563792/63891432-c815b580-c9aa-11e9-9f41-62975649efbc.png)

![Screenshot from 2019-08-28 14-56-26](https://user-images.githubusercontent.com/4563792/63891516-fd220800-c9aa-11e9-9fe4-89fcdca37306.png)

### Why are the changes needed?
to show user what resources the executors have. Like Gpus, fpgas, etc

### Does this PR introduce any user-facing change?
Yes introduces UI and rest api changes to show the resources

### How was this patch tested?
Unit tests and manual UI tests on yarn and standalone modes.

Closes #25613 from tgravescs/SPARK-27489-gpu-ui-latest.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-09-04 09:45:44 +08:00
Huaxin Gao 56f2887dc8 [SPARK-28788][DOC][SQL] Document ANALYZE TABLE statement in SQL Reference
### What changes were proposed in this pull request?
Document ANALYZE TABLE statement in SQL Reference

### Why are the changes needed?
To complete SQL reference

### Does this PR introduce any user-facing change?
Yes

***Before***:
There was no documentation for this.

***After***:
![image](https://user-images.githubusercontent.com/13592258/64046883-f8339480-cb21-11e9-85da-6617d5c96412.png)

![image](https://user-images.githubusercontent.com/13592258/64209526-9a6eb780-ce55-11e9-9004-53c5c5d24567.png)

![image](https://user-images.githubusercontent.com/13592258/64209542-a2c6f280-ce55-11e9-8624-e7349204ec8e.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25524 from huaxingao/spark-28788.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-03 15:26:12 -07:00
Shixiong Zhu 89800931aa
[SPARK-3137][CORE] Replace the global TorrentBroadcast lock with fine grained KeyLock
### What changes were proposed in this pull request?

This PR provides a new lock mechanism `KeyLock` to lock  with a given key. Also use this new lock in `TorrentBroadcast` to avoid blocking tasks from fetching different broadcast values.

### Why are the changes needed?

`TorrentBroadcast.readObject` uses a global lock so only one task can be fetching the blocks at the same time. This is not optimal if we are running multiple stages concurrently because they should be able to independently fetch their own blocks.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25612 from zsxwing/SPARK-3137.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-09-03 14:09:07 -07:00
Ryan Blue 5ea134c354 [SPARK-28628][SQL] Implement SupportsNamespaces in V2SessionCatalog
## What changes were proposed in this pull request?

This adds namespace support to V2SessionCatalog.

## How was this patch tested?

WIP: will add tests for v2 session catalog namespace methods.

Closes #25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-09-03 13:13:27 -07:00
Xiao Li 2856398de9 [SPARK-28961][HOT-FIX][BUILD] Upgrade Maven from 3.6.1 to 3.6.2
### What changes were proposed in this pull request?
This PR is to upgrade the maven dependence from 3.6.1 to 3.6.2.

### Why are the changes needed?
All the builds are broken because 3.6.1 is not available.  http://ftp.wayne.edu/apache//maven/maven-3/

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/485/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10536/

![image](https://user-images.githubusercontent.com/11567269/64196667-36d69100-ce39-11e9-8f93-40eb333d595d.png)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25665 from gatorsmile/upgradeMVN.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-03 11:06:57 -07:00
Dilip Biswal 94e66744a7 [SPARK-28805][DOCS][SQL] Document DESCRIBE FUNCTION in SQL Reference
### What changes were proposed in this pull request?
Document DESCRIBE FUNCTION statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-09-02 at 11 14 09 PM" src="https://user-images.githubusercontent.com/14225158/64148193-85534380-cdd7-11e9-9c07-5956b5e8276e.png">
<img width="1234" alt="Screen Shot 2019-09-02 at 11 14 29 PM" src="https://user-images.githubusercontent.com/14225158/64148201-8a17f780-cdd7-11e9-93d8-10ad9932977c.png">
<img width="1234" alt="Screen Shot 2019-09-02 at 11 14 42 PM" src="https://user-images.githubusercontent.com/14225158/64148208-8dab7e80-cdd7-11e9-97c5-3a4ce12cac7a.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25530 from dilipbiswal/ref-doc-desc-function.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-03 09:45:58 -07:00
Dilip Biswal 92ae271081 [SPARK-28806][DOCS][SQL] Document SHOW COLUMNS in SQL Reference
### What changes were proposed in this pull request?
Document SHOW COLUMNS statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-09-02 at 11 07 48 PM" src="https://user-images.githubusercontent.com/14225158/64148033-0fe77300-cdd7-11e9-93ee-e5951c7ed33c.png">
<img width="1234" alt="Screen Shot 2019-09-02 at 11 08 08 PM" src="https://user-images.githubusercontent.com/14225158/64148039-137afa00-cdd7-11e9-8bec-634ea9d2594c.png">
<img width="1234" alt="Screen Shot 2019-09-02 at 11 11 45 PM" src="https://user-images.githubusercontent.com/14225158/64148046-17a71780-cdd7-11e9-91c3-95a9c97e7a77.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25531 from dilipbiswal/ref-doc-show-columns.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-03 09:39:26 -07:00
HyukjinKwon 5cf2602ccb [SPARK-28946][R][DOCS] Add some more information about building SparkR on Windows
### What changes were proposed in this pull request?

This PR adds three more information:

- Mentions that `bash` in `PATH` to build is required.
- Specifies supported JDK and Maven versions
- Explicitly mentions that building on Windows is not the official support

### Why are the changes needed?

In order to make SparkR developers on Windows able to work, and describe what is needed for AppVeyor build.

### Does this PR introduce any user-facing change?

No. It just adds some information in `R/WINDOWS.md`

### How was this patch tested?

This is already being tested as so in AppVeyor. Also, I tested as so (long ago though).

Closes #25647 from HyukjinKwon/SPARK-28946.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-03 15:08:18 +09:00
Xianjin YE d5688dc732 [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table
## What changes were proposed in this pull request?
Datasource table now supports partition tables long ago. This commit adds the ability to translate
the InsertIntoTable(HiveTableRelation) to datasource table insertion.

## How was this patch tested?
Existing tests with some modification

Closes #25306 from advancedxy/SPARK-28573.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-03 13:40:06 +08:00
Andy Grove 35d4edffa2 [SPARK-28921][BUILD][K8S] Upgrade kubernetes client to 4.4.2
### What changes were proposed in this pull request?

Upgrade kubernetes client from 4.1.2 to 4.4.2

### Why are the changes needed?

To fix compatibility issue with EKS since Amazon rolled out some security patches over the past week; 1.15.3, 1.14.6, 1.13.10, 1.12.10, and 1.11.10.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins and manually test on EKS.

Closes #25640 from andygrove/SPARK-28921.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-02 16:50:58 -07:00
Dongjoon Hyun 560df0ea8e [SPARK-28951][INFRA] Add release announce template
### What changes were proposed in this pull request?

This PR adds a release announce template.

### Why are the changes needed?

- We want to use a formal template including HTTPS in the future release.
- The future release managers don't need to search mailing list to find this form.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #25656 from dongjoon-hyun/SPARK-28951.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-02 14:55:05 -07:00
sandeep katta e1946a598b [SPARK-28705][SQL][TEST] Drop tables after being used in AnalysisExternalCatalogSuite
## What changes were proposed in this pull request?

drop the table after the test `query builtin functions don't call the external catalog`  executed

This is required for [SPARK-25464](https://github.com/apache/spark/pull/22466)

## How was this patch tested?

existing UT

Closes #25427 from sandeep-katta/cleanuptable.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-02 20:32:32 +09:00
HyukjinKwon bd3915e356 Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API"
This reverts commit 3821d75b83.
2019-09-02 12:47:14 +09:00
Liang-Chi Hsieh 19f882ce1b [SPARK-28933][ML] Reduce unnecessary shuffle in ALS when initializing factors
### What changes were proposed in this pull request?

When Initializing factors in ALS, we should use `mapPartitions` instead of current `map`, so we can preserve existing partition of the RDD of `InBlock`. The RDD of `InBlock` is already partitioned by src block id. We don't change the partition when initializing factors.

### Why are the changes needed?

This patch can reduce unnecessary shuffle after initializing factors.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

It should not change existing tests. It should pass added test that verifies shuffle dependency of factor RDDs.

Closes #25639 from viirya/fix-als-partition.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-09-01 19:49:50 -07:00
Huaxin Gao 585954dbed [SPARK-28790][DOC][SQL] Document CACHE TABLE statement in SQL Reference
### What changes were proposed in this pull request?
Document CACHE TABLE statement in SQL Reference

### Why are the changes needed?
To complete SQL Reference

### Does this PR introduce any user-facing change?
Yes.

Here is the screen shot:

![image](https://user-images.githubusercontent.com/13592258/64072307-26f45c80-cc41-11e9-8ab3-dc56fe8ff45f.png)

![image](https://user-images.githubusercontent.com/13592258/64072309-2cea3d80-cc41-11e9-9a4d-8cb9eb63569f.png)

### How was this patch tested?
Tested using jykyll build --serve

Closes #25532 from huaxingao/spark-28790.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-01 17:08:09 -07:00
Sean Owen eb037a8180 [SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations
### What changes were proposed in this pull request?

The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.

The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)

It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.

### Why are the changes needed?

Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25558 from srowen/SPARK-28855.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-01 10:15:00 -05:00
Ryan Blue 3821d75b83 [SPARK-28612][SQL] Add DataFrameWriterV2 API
## What changes were proposed in this pull request?

This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:

* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.

Here are a few example uses of the new API:

```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```

## How was this patch tested?

Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.

Closes #25354 from rdblue/SPARK-28612-add-data-frame-writer-v2.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-08-31 21:28:20 -07:00
Huaxin Gao b85a554487 [SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold
### What changes were proposed in this pull request?
Change "Related Statements" to bold

### Why are the changes needed?
To make doc look nice and consistent.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tested using jykyll build --serve

Before the change:
![image](https://user-images.githubusercontent.com/13592258/63965303-ae797a00-ca4d-11e9-8a85-71fbfdeaaccb.png)

After the change:
![image](https://user-images.githubusercontent.com/13592258/63965316-b76a4b80-ca4d-11e9-9a85-48d7a909f0ef.png)

Before the change:
![image](https://user-images.githubusercontent.com/13592258/63988989-7c8b0680-ca93-11e9-9352-a9ec5457b279.png)

After the change:
![image](https://user-images.githubusercontent.com/13592258/63988996-87459b80-ca93-11e9-9e51-8cb36a632436.png)

Closes #25623 from huaxingao/spark-28786-n.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-31 14:58:41 -07:00
Dilip Biswal b4d7b30aa6 [SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference
### What changes were proposed in this pull request?
Document DESCRIBE TABLE statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-31 at 1 53 35 PM" src="https://user-images.githubusercontent.com/14225158/64069071-f556a380-cbf6-11e9-985d-13dd37a32bbb.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 53 50 PM" src="https://user-images.githubusercontent.com/14225158/64069073-f982c100-cbf6-11e9-925b-eb2fc85c3341.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 02 PM" src="https://user-images.githubusercontent.com/14225158/64069076-0ef7eb00-cbf7-11e9-8062-9a9fb8700bb3.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 15 PM" src="https://user-images.githubusercontent.com/14225158/64069077-0f908180-cbf7-11e9-9a31-9b7f122db2d3.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 30 PM" src="https://user-images.githubusercontent.com/14225158/64069078-0f908180-cbf7-11e9-96ee-438a7b64c961.png">
<img width="1234" alt="Screen Shot 2019-08-31 at 1 54 42 PM" src="https://user-images.githubusercontent.com/14225158/64069079-0f908180-cbf7-11e9-9bae-734a1994f936.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25527 from dilipbiswal/ref-doc-desc-table.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-31 14:46:55 -07:00
Unknown d573e4c482 [SPARK-28542][DOCS][WEBUI] Stages Tab
### What changes were proposed in this pull request?
New documentation to explain in detail Web UI Stages page. New images are included to better explanation.
![image](https://user-images.githubusercontent.com/12819544/63807320-c05bff80-c91d-11e9-986f-e09d0b8d4bbb.png)
![image](https://user-images.githubusercontent.com/12819544/63807343-cd78ee80-c91d-11e9-9e4a-2cef3ff70577.png)
![image](https://user-images.githubusercontent.com/12819544/63807363-d9fd4700-c91d-11e9-9691-1d39b0e2c69e.png)
![image](https://user-images.githubusercontent.com/12819544/63807384-e41f4580-c91d-11e9-92bd-cb01aced3752.png)

### Does this PR introduce any user-facing change?
Only documentation

### How was this patch tested?
I have generated it using "jekyll build" to ensure that it's ok

Closes #25598 from planga82/feature/SPARK-28542_ImproveWebUIStagesPage.

Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-31 13:33:44 -05:00
Dongjoon Hyun 1f96ce5443 [SPARK-28932][BUILD] Add scala-library test dependency to network-common module for JDK11
### What changes were proposed in this pull request?

This PR adds `scala-library` test dependency to `network-common` module for JDK11.

### Why are the changes needed?

In JDK11, the following command fails due to scala library.
```
mvn clean install -pl common/network-common -DskipTests
```

**BEFORE**
```
...
error: fatal error: object scala in compiler mirror not found.
one error found
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

**AFTER**
```
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual. On JDK11, do the following.
```
mvn clean install -pl common/network-common -DskipTests
```

Closes #25638 from dongjoon-hyun/SPARK-28932.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-31 10:59:20 -07:00
Sean Owen d5b7eed12f [SPARK-28903][STREAMING][PYSPARK][TESTS] Fix AWS JDK version conflict that breaks Pyspark Kinesis tests
The Pyspark Kinesis tests are failing, at least in master:

```
======================================================================
ERROR: test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/streaming/tests/test_kinesis.py", line 44, in test_kinesis_stream
    kinesisTestUtils = self.ssc._jvm.org.apache.spark.streaming.kinesis.KinesisTestUtils(2)
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.apache.spark.streaming.kinesis.KinesisTestUtils.
: java.lang.NoSuchMethodError: com.amazonaws.regions.Region.getAvailableEndpoints()Ljava/util/Collection;
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1(KinesisTestUtils.scala:211)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.$anonfun$getRegionNameByEndpoint$1$adapted(KinesisTestUtils.scala:211)
	at scala.collection.Iterator.find(Iterator.scala:993)
	at scala.collection.Iterator.find$(Iterator.scala:990)
	at scala.collection.AbstractIterator.find(Iterator.scala:1429)
	at scala.collection.IterableLike.find(IterableLike.scala:81)
	at scala.collection.IterableLike.find$(IterableLike.scala:80)
	at scala.collection.AbstractIterable.find(Iterable.scala:56)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils$.getRegionNameByEndpoint(KinesisTestUtils.scala:211)
	at org.apache.spark.streaming.kinesis.KinesisTestUtils.<init>(KinesisTestUtils.scala:46)
...
```

The non-Python Kinesis tests are fine though. It turns out that this is because Pyspark tests use the output of the Spark assembly, and it pulls in `hadoop-cloud`, which in turn pulls in an old AWS Java SDK.

Per Steve Loughran (below), it seems like we can just resolve this by excluding the aws-java-sdk dependency. See the attached PR for some more detail about the debugging and other options.

See https://github.com/apache/spark/pull/25558#issuecomment-524042709

Closes #25559 from srowen/KinesisTest.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-31 10:29:46 -05:00
Dilip Biswal a08f33be68 [SPARK-28804][DOCS][SQL] Document DESCRIBE QUERY in SQL Reference
### What changes were proposed in this pull request?
Document DESCRIBE QUERY statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-29 at 5 47 51 PM" src="https://user-images.githubusercontent.com/14225158/63985609-43e43080-ca85-11e9-8a1a-c9c15d988e24.png">
<img width="1234" alt="Screen Shot 2019-08-29 at 5 48 06 PM" src="https://user-images.githubusercontent.com/14225158/63985610-46468a80-ca85-11e9-882a-7163784f72c6.png">
<img width="1234" alt="Screen Shot 2019-08-29 at 5 48 18 PM" src="https://user-images.githubusercontent.com/14225158/63985617-49da1180-ca85-11e9-9e77-a6d6c7042a85.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25529 from dilipbiswal/ref-doc-desc-query.

Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-30 16:05:16 -07:00
HyukjinKwon 7cc0f0e9a7 [SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results
### What changes were proposed in this pull request?

See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/

![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
    ...
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
    </testcase>
    <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
    </testcase>
    <system-out/>
    <system-err/>
</testsuite>
```

Root cause seems a bug in SBT - it truncates the test name based on the last dot.

https://github.com/sbt/sbt/issues/2949
https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79

I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:

```diff
  [info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
+ [info]   inner-join.sql
  [info]   Expected "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[]", but got "1	a
  [info]   1	a
  [info]   1	b
  [info]   1[	b]" Result did not match for query #6
  [info]   SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
  [info]   org.scalatest.exceptions.TestFailedException:
  [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
```

It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.

Note that this PR does not fully fix the issue but only fix the logs on its failed tests.

### Why are the changes needed?
To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.

### Does this PR introduce any user-facing change?
It will print out the file name of failed tests in Jenkins' test reports.

### How was this patch tested?
Manually tested but Jenkins tests are required in this PR.

Now it at least shows which file it is:

![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)

Closes #25630 from HyukjinKwon/SPARK-28894-1.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 15:10:40 -07:00
younggyu chun 3b07a4eb28 [SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type
## What changes were proposed in this pull request?
This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.

https://www.postgresql.org/docs/devel/datatype-boolean.html
https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html

## How was this patch tested?
Added new tests to CastSuite.

Closes #25458 from younggyuchun/SPARK-27931.

Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 14:18:13 -07:00
mcheah ea90ea6ce7 [SPARK-28571][CORE][SHUFFLE] Use the shuffle writer plugin for the SortShuffleWriter
## What changes were proposed in this pull request?

Use the shuffle writer APIs introduced in SPARK-28209 in the sort shuffle writer.

## How was this patch tested?

Existing unit tests were changed to use the plugin instead, and they used the local disk version to ensure that there were no regressions.

Closes #25342 from mccheah/shuffle-writer-refactor-sort-shuffle-writer.

Lead-authored-by: mcheah <mcheah@palantir.com>
Co-authored-by: mccheah <mcheah@palantir.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-30 09:43:07 -07:00
HyukjinKwon 92cabf6306 [SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.2.0 and fix build profile on AppVeyor
### What changes were proposed in this pull request?

This PR proposes to upgrade scala-maven-plugin from 3.4.4 to 4.2.0.

Upgrade to 4.1.1 was reverted due to unexpected build failure on AppVeyor.

The root cause seems to be an issue specific to AppVeyor - loading the system library 'kernel32.dll' seems being failed.

```
Suppressed: java.lang.NoClassDefFoundError: Could not initialize class com.sun.jna.platform.win32.Kernel32
        at sbt.internal.io.WinMilli$.getHandle(Milli.scala:264)
        at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:289)
        at sbt.internal.io.WinMilli$.getModifiedTimeNative(Milli.scala:260)
        at sbt.internal.io.MilliNative.getModifiedTime(Milli.scala:61)
        at sbt.internal.io.Milli$.getModifiedTime(Milli.scala:360)
        at sbt.io.IO$.$anonfun$getModifiedTimeOrZero$1(IO.scala:1373)
        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
        at sbt.internal.io.Retry$.liftedTree2$1(Retry.scala:38)
        at sbt.internal.io.Retry$.impl$1(Retry.scala:38)
        at sbt.internal.io.Retry$.apply(Retry.scala:52)
        at sbt.internal.io.Retry$.apply(Retry.scala:24)
        at sbt.io.IO$.getModifiedTimeOrZero(IO.scala:1373)
        at sbt.internal.inc.caching.ClasspathCache$.fromCacheOrHash$1(ClasspathCache.scala:44)
        at sbt.internal.inc.caching.ClasspathCache$.$anonfun$hashClasspath$1(ClasspathCache.scala:53)
        at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:659)
        at scala.collection.parallel.Task.$anonfun$tryLeaf$1(Tasks.scala:53)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.control.Breaks$$anon$1.catchBreak(Breaks.scala:67)
        at scala.collection.parallel.Task.tryLeaf(Tasks.scala:56)
        at scala.collection.parallel.Task.tryLeaf$(Tasks.scala:50)
        at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
        at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal(Tasks.scala:170)
        ... 25 more
```

By setting `-Djna.nosys=true`, it directly loads the library from the jar instead of system's.

In this way, the build seems working fine.

### Why are the changes needed?

It upgrades the plugin to fix bugs and fixes the CI build.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

It was tested at https://github.com/apache/spark/pull/25497

Closes #25633 from HyukjinKwon/SPARK-28759.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-30 09:39:15 -07:00
Liang-Chi Hsieh 2bd02e2b41 [SPARK-28866][ML] Persist item factors RDD when checkpointing in ALS
### What changes were proposed in this pull request?

In ALS ML implementation, for non-implicit case, we checkpoint the RDD of item factors, between intervals. Before checkpointing (.checkpoint()) and materializing (.count()) RDD, this RDD was not persisted. It causes recomputation. In an experiment, there is performance difference between persisting and no persisting before checkpointing the RDD.

The performance difference is not big, but this change is not big too. The actual performance difference varies depending the interval of checkpoint, training dataset, etc.

### Why are the changes needed?

Persisting the RDD before checkpointing the RDD of item factors can avoid recomputation.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manual check RDD recomputation or not.

Taking 30% MovieLens 20M Dataset as training dataset. Setting checkpoint dir for SparkContext. Fitting an ALS model like:

```scala
val als = new ALS()
      .setMaxIter(100)
      .setCheckpointInterval(5)
      .setRegParam(0.01)
      .setUserCol("userId")
      .setItemCol("movieId")
      .setRatingCol("rating")

val t0 = System.currentTimeMillis()
val model = als.fit(training)
val t1 = System.currentTimeMillis()
```

Before this patch:  65.386 s
After this patch: 61.022 s

Closes #25576 from viirya/persist-item-factors.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-30 11:37:06 -05:00
Burak Yavuz 827969399b [SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE
### What changes were proposed in this pull request?

Adds support for the V2SessionCatalog for ALTER TABLE statements.
Implementation changes are ~50 loc. The rest is just test refactoring.

### Why are the changes needed?
To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements.

### How was this patch tested?

By re-using existing tests in DataSourceV2SQLSuite.

Closes #25502 from brkyvz/alterV3.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-30 14:16:47 +08:00
Liang-Chi Hsieh 2db45cbd5a [SPARK-28920][INFRA] Set up java version for github workflow
This patch adds java version parameter to GitHub workflow conf for JDK8/11.

As we want to build JDK8/11 on GitHub workflow, we might need to add java version according current matrix.

No

See the GitHub workflow run result.

Closes #25625 from viirya/github-workflow-java.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 20:55:14 -07:00
Dongjoon Hyun 780aa71749 [SPARK-28919][INFRA] Add more profiles for JDK8/11 build test for Github workflow
### What changes were proposed in this pull request?

This PR aims to add `-Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-3.2 -Phadoop-cloud` profiles to GitHub workflow conf.

### Why are the changes needed?

Currently, we build with JDK8 and test with JDK8/11 in Jenkins.
And, we use GitHub Workflow for JDK8/JDK11 building test.
To test JDK11 fully, we need to enable `hive` and `hadoop-3.2` profiles for `Hive 2.3.6` and `Hadoop 3.2`. Also, this PR adds all resource manager modules, too.

### Does this PR introduce any user-facing change?

No. In addition, Jenkins workload will be the same because this is specific to GitHub workflow.

### How was this patch tested?

See the GitHub workflow run result.

Closes #25624 from dongjoon-hyun/SPARK-JDK11-HIVE.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 19:46:21 -07:00
Gabor Somogyi d502c80404 [SPARK-28922][SS] Safe Kafka parameter redaction
### What changes were proposed in this pull request?
At the moment Kafka parameter reduction is expecting `SparkEnv`.  This must exist in normal queries but several unit tests are not providing it to make things simple. As an end-result such tests are throwing similar exception:
```
java.lang.NullPointerException
	at org.apache.spark.kafka010.KafkaRedactionUtil$.redactParams(KafkaRedactionUtil.scala:29)
	at org.apache.spark.kafka010.KafkaRedactionUtilSuite.$anonfun$new$1(KafkaRedactionUtilSuite.scala:33)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
	at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
	at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
	at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
	at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
	at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
	at org.scalatest.Suite.run(Suite.scala:1147)
	at org.scalatest.Suite.run$(Suite.scala:1129)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
	at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
	at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
	at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1346)
	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1340)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1031)
	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1010)
	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
	at org.scalatest.tools.Runner$.run(Runner.scala:850)
	at org.scalatest.tools.Runner.run(Runner.scala)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131)
	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
```
These are annoying and only red herrings so I would like to make them disappear.

There are basically 2 ways to handle this situation:
* Add default value for `SparkEnv` in `KafkaReductionUtil`
* Add `SparkEnv` to all such tests => I think it would be overkill and would just increase number of lines without real value

Considering this I've chosen the first approach.

### Why are the changes needed?
Couple of tests are throwing exceptions even if no real problem.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New + additional unit tests.

Closes #25621 from gaborgsomogyi/safe-reduct.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-29 19:17:48 -07:00
Ryan Blue 31b59bd805 [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python if not set
### What changes were proposed in this pull request?

When starting python processes, set `OMP_NUM_THREADS` to the number of cores allocated to an executor or driver if `OMP_NUM_THREADS` is not already set. Each python process will use the same `OMP_NUM_THREADS` setting, even if workers are not shared.

This avoids creating an OpenMP thread pool for parallel processing with a number of threads equal to the number of cores on the executor and [significantly reduces memory consumption](https://github.com/numpy/numpy/issues/10455). Instead, this threadpool should use the number of cores allocated to the executor, if available. If a setting for number of cores is not available, this doesn't change any behavior. OpenMP is used by numpy and pandas.

### Why are the changes needed?

To reduce memory consumption for PySpark jobs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Validated this reduces python worker memory consumption by more than 1GB on our cluster.

Closes #25545 from rdblue/SPARK-28843-set-omp-num-cores.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-30 10:29:46 +09:00
Wenchen Fan f8f7c52f12 [SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core
### What changes were proposed in this pull request?

There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them.

After merging, there are 3 classes:
1. `InMemoryTable`
2. `InMemoryTableCatalog`
3. `StagingInMemoryTableCatalog`

For better maintainability, these 3 classes are put in 3 different files.

### Why are the changes needed?

reduce duplicated code

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

N/A

Closes #25610 from cloud-fan/dsv2-test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
2019-08-29 12:56:19 -07:00
Gabor Somogyi 7d72c073dd [SPARK-28760][SS][TESTS] Add Kafka delegation token end-to-end test with mini KDC
### What changes were proposed in this pull request?
At the moment no end-to-end Kafka delegation token test exists which was mainly because of missing embedded KDC. KDC is missing in general from the testing side so I've discovered what kind of possibilities are there. The most obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby runs in the background. What this PR contains:
* Added MiniKDC as test dependency from Hadoop
* Added `maven-bundle-plugin` because couple of dependencies are coming in bundle format
* Added security mode to `KafkaTestUtils`. Namely start KDC -> start Zookeeper in secure mode -> start Kafka in secure mode
* Added a roundtrip test (saves and reads back data from Kafka)

### Why are the changes needed?
No such test exists + security testing with KDC is completely missing.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.
I've put the additional test into a loop and was consuming ~10 sec average.

Closes #25477 from gaborgsomogyi/SPARK-28760.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-08-29 11:52:35 -07:00
Dilip Biswal fb1053d14a [SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference
### What changes were proposed in this pull request?
Document SHOW DATABASES statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1234" alt="Screen Shot 2019-08-28 at 11 43 36 PM" src="https://user-images.githubusercontent.com/14225158/63916727-dd600380-c9ed-11e9-8372-789110c9d2dc.png">
<img width="1234" alt="Screen Shot 2019-08-28 at 11 43 57 PM" src="https://user-images.githubusercontent.com/14225158/63916734-e0f38a80-c9ed-11e9-8ad4-d854febeaab8.png">
<img width="1234" alt="Screen Shot 2019-08-28 at 11 44 13 PM" src="https://user-images.githubusercontent.com/14225158/63916740-e4871180-c9ed-11e9-9cfc-199cd8a64852.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25526 from dilipbiswal/ref-doc-show-db.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-08-29 09:04:27 -07:00