Commit graph

25516 commits

Author SHA1 Message Date
shahid 76d4bebb54 [SPARK-29559][WEBUI] Support pagination for JDBC/ODBC Server page
### What changes were proposed in this pull request?
Supports pagination for SQL Statisitcs table in the JDBC/ODBC tab using existing Spark pagination framework.

### Why are the changes needed?
It will easier for user to analyse the table and it may fix the potential issues like oom while loading the page, that may occur similar to the SQL page (refer https://github.com/apache/spark/pull/22645)

### Does this PR introduce any user-facing change?
There will be no change in the `SQLStatistics` table in JDBC/ODBC server page execpt pagination support.

### How was this patch tested?
Manually verified.

Before PR:
![Screenshot 2019-10-22 at 11 37 29 PM](https://user-images.githubusercontent.com/23054875/67316080-73636680-f525-11e9-91bc-ff7e06e3736d.png)

After PR:

![Screenshot 2019-10-22 at 10 33 00 PM](https://user-images.githubusercontent.com/23054875/67316092-778f8400-f525-11e9-93f8-1e2815abd66f.png)

Closes #26215 from shahidki31/jdbcPagination.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-24 08:29:05 -05:00
Pavithra Ramachandran 1ec1b2bd17 [SPARK-28791][DOC] Documentation for Alter table Command
What changes were proposed in this pull request?
Document ALTER TABLE statement in SQL Reference Guide.

Why are the changes needed?
Adding documentation for SQL reference.

Does this PR introduce any user-facing change?
yes

Before:
There was no documentation for this.

After.
![1](https://user-images.githubusercontent.com/51401130/65674372-1087c800-e06a-11e9-9155-ac70b419b069.png)
![2](https://user-images.githubusercontent.com/51401130/65674384-14b3e580-e06a-11e9-9c57-bca566dfdbc2.png)
![3](https://user-images.githubusercontent.com/51401130/65674391-18e00300-e06a-11e9-950a-6cc948dedd7d.png)
![4](https://user-images.githubusercontent.com/51401130/65674397-1bdaf380-e06a-11e9-87b0-b1523a745f83.png)
![5](https://user-images.githubusercontent.com/51401130/65674406-209fa780-e06a-11e9-8440-7e8105a77117.png)
![6](https://user-images.githubusercontent.com/51401130/65674417-23020180-e06a-11e9-8fff-30511836bb08.png)

How was this patch tested?
Used jekyll build and serve to verify.

Closes #25590 from PavithraRamachandran/alter_doc.

Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-24 08:19:03 -05:00
angerszhu 67cf0433ee [SPARK-29145][SQL] Support sub-queries in join conditions
### What changes were proposed in this pull request?
Support SparkSQL use iN/EXISTS with subquery  in JOIN condition.

### Why are the changes needed?
Support SQL use iN/EXISTS with subquery  in JOIN condition.

### Does this PR introduce any user-facing change?

This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table
```
CREATE TABLE A(id String);
CREATE TABLE B(id String);
CREATE TABLE C(id String);
```
we can do query like :
```
SELECT A.id  from  A JOIN B ON A.id = B.id and A.id IN (select C.id from C)
```

### How was this patch tested?
ADDED UT

Closes #25854 from AngersZhuuuu/SPARK-29145.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-10-24 21:55:03 +09:00
Pavithra Ramachandran 1296bbb8ac [SPARK-29504][WEBUI] Toggle full job description on click
### What changes were proposed in this pull request?
On clicking job description in jobs page, the description was not shown fully.
Add the function for the click event on description.

### Why are the changes needed?
when there is a long description of a job, it cannot be seen fully in the UI.
The feature was added in https://github.com/apache/spark/pull/24145
But it is missed after https://github.com/apache/spark/pull/25374

Before change:
![Screenshot from 2019-10-23 11-23-00](https://user-images.githubusercontent.com/51401130/67361914-827b0080-f587-11e9-9181-d49a6a836046.png)
After change: on Double click over decription
![Screenshot from 2019-10-23 11-20-02](https://user-images.githubusercontent.com/51401130/67361936-932b7680-f587-11e9-9e59-d290abed4b70.png)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?
Manually test

Closes #26222 from PavithraRamachandran/jobs_description_tooltip.

Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-10-24 11:14:31 +02:00
Yuanjian Li 9e77d48315 [SPARK-21492][SQL][FOLLOW UP] Reimplement UnsafeExternalRowSorter in database style iterator
### What changes were proposed in this pull request?
Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base.

### Why are the changes needed?
During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #26229 from xuanyuanking/SPARK-21492-follow-up.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 15:43:13 +08:00
Liang-Chi Hsieh 177bf672e4 [SPARK-29522][SQL] CACHE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
CACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26179 from viirya/SPARK-29522.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 15:00:21 +08:00
07ARB 55ced9c148 [SPARK-29571][SQL][TESTS][FOLLOWUP] Fix UT in AllExecutionsPageSuite
### What changes were proposed in this pull request?

This is a follow-up of #24052 to correct assert condition.

### Why are the changes needed?
 To test IllegalArgumentException condition..

### Does this PR introduce any user-facing change?
 No.

### How was this patch tested?

Manual Test (during fixing of SPARK-29453 find this issue)

Closes #26234 from 07ARB/SPARK-29571.

Authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-24 15:57:16 +09:00
DB Tsai fd899d6331 [SPARK-29576][CORE] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
### What changes were proposed in this pull request?
Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in a buffered stream to avoid overhead excessive of JNI call while trying to compress/decompress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if it's needed.

### Why are the changes needed?
Faster performance.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #26235 from dbtsai/optimizeDeser.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 18:17:37 -07:00
Dongjoon Hyun 7ecf968527 [SPARK-29567][TESTS] Update JDBC Integration Test Docker Images
### What changes were proposed in this pull request?

This PR updates JDBC Integration Test DBMS Docker Images.

| DBMS | Docker Image Tag | Release |
| ------ | ------------------ | ------ |
| MySQL | mysql:5.7.28 | Oct 13, 2019 |
| PostgreSQL | postgres:12.0-alpine | Oct 3, 2019 |

* For `MySQL`, `SET GLOBAL sql_mode = ''` is added to disable all strict modes because `test("Basic write test")` creates a table like the following. The latest MySQL rejects `0000-00-00 00:00:00` as TIMESTAMP and causes the test case failure.
```
mysql> desc datescopy;
+-------+-----------+------+-----+---------------------+-----------------------------+
| Field | Type      | Null | Key | Default             | Extra                       |
+-------+-----------+------+-----+---------------------+-----------------------------+
| d     | date      | YES  |     | NULL                |                             |
| t     | timestamp | NO   |     | CURRENT_TIMESTAMP   | on update CURRENT_TIMESTAMP |
| dt    | timestamp | NO   |     | 0000-00-00 00:00:00 |                             |
| ts    | timestamp | NO   |     | 0000-00-00 00:00:00 |                             |
| yr    | date      | YES  |     | NULL                |                             |
+-------+-----------+------+-----+---------------------+-----------------------------+
```
* For `PostgreSQL`, I chose the smallest image in `12` releases. It reduces the image size a lot, `312MB` -> `72.8MB`. This is good for CI/CI testing environment.
```
$ docker images | grep postgres
postgres      12.0-alpine          5b681acb1cfc        2 days ago          72.8MB
postgres      11.4                 53912975086f        3 months ago        312MB
```

Note that
- For `MsSqlServer`, we are using `2017-GA-ubuntu` and the next version `2019-CTP3.2-ubuntu` is still `Community Technology Preview` status.
- For `DB2` and `Oracle`, the official images are not available.

### Why are the changes needed?

This is to make it sure we are testing with the latest DBMS images during preparing `3.0.0`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Since this is the integration test, we need to run this manually.
```
build/mvn install -DskipTests
build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

Closes #26224 from dongjoon-hyun/SPARK-29567.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 16:44:10 -07:00
Dongjoon Hyun b91356e4c2 [SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/26189 to regenerate the result on EC2.

### Why are the changes needed?

This will be used for the other PR reviews.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #26233 from dongjoon-hyun/SPARK-29533.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-10-23 21:41:05 +00:00
Luca Canali 5867707835 [SPARK-29557][BUILD] Update dropwizard/codahale metrics library to 3.2.6
### What changes were proposed in this pull request?
This proposes to update the dropwizard/codahale metrics library version used by Spark to `3.2.6` which is the last version supporting Ganglia.

### Why are the changes needed?
Spark is currently using Dropwizard metrics version 3.1.5, a version that is no more actively developed nor maintained, according to the project's Github repo README.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests + manual tests on a YARN cluster.

Closes #26212 from LucaCanali/updateDropwizardVersion.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 10:45:11 -07:00
jiake 7e8e4c0a14 [SPARK-29552][SQL] Execute the "OptimizeLocalShuffleReader" rule when creating new query stage and then can optimize the shuffle reader to local shuffle reader as much as possible
### What changes were proposed in this pull request?
`OptimizeLocalShuffleReader` rule is very conservative and gives up optimization as long as there are extra shuffles introduced. It's very likely that most of the added local shuffle readers are fine and only one introduces extra shuffle.

However, it's very hard to make `OptimizeLocalShuffleReader` optimal, a simple workaround is to run this rule again right before executing a query stage.

### Why are the changes needed?
Optimize more shuffle reader to local shuffle reader.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing ut

Closes #26207 from JkSelf/resolve-multi-joins-issue.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 01:18:07 +08:00
Jungtaek Lim (HeartSaVioR) bfbf2821f3 [SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe
### What changes were proposed in this pull request?

There's a case where MapObjects has a lambda function which creates nested struct - unsafe data in safe data struct. In this case, MapObjects doesn't copy the row returned from lambda function (as outmost data type is safe data struct), which misses copying nested unsafe data.

The culprit is that `UnsafeProjection.toUnsafeExprs` converts `CreateNamedStruct` to `CreateNamedStructUnsafe` (this is the only place where `CreateNamedStructUnsafe` is used) which incurs safe and unsafe being mixed up temporarily, which may not be needed at all at least logically, as it will finally assembly these evaluations to `UnsafeRow`.

> Before the patch

```
/* 105 */   private ArrayData MapObjects_0(InternalRow i) {
/* 106 */     boolean isNull_1 = i.isNullAt(0);
/* 107 */     ArrayData value_1 = isNull_1 ?
/* 108 */     null : (i.getArray(0));
/* 109 */     ArrayData value_0 = null;
/* 110 */
/* 111 */     if (!isNull_1) {
/* 112 */
/* 113 */       int dataLength_0 = value_1.numElements();
/* 114 */
/* 115 */       ArrayData[] convertedArray_0 = null;
/* 116 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 117 */
/* 118 */
/* 119 */       int loopIndex_0 = 0;
/* 120 */
/* 121 */       while (loopIndex_0 < dataLength_0) {
/* 122 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 123 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 124 */
/* 125 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 126 */           -1, 1L, " createArray failed.");
/* 127 */
/* 128 */         mutableStateArray_0[0].reset();
/* 129 */
/* 130 */
/* 131 */         mutableStateArray_0[0].zeroOutNullBytes();
/* 132 */
/* 133 */
/* 134 */         if (isNull_MapObject_lambda_variable_1) {
/* 135 */           mutableStateArray_0[0].setNullAt(0);
/* 136 */         } else {
/* 137 */           mutableStateArray_0[0].write(0, value_MapObject_lambda_variable_1);
/* 138 */         }
/* 139 */         arrayData_0.update(0, (mutableStateArray_0[0].getRow()));
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 144 */         }
/* 145 */
/* 146 */         loopIndex_0 += 1;
/* 147 */       }
/* 148 */
/* 149 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 150 */     }
/* 151 */     globalIsNull_0 = isNull_1;
/* 152 */     return value_0;
/* 153 */   }
```

> After the patch

```
/* 104 */   private ArrayData MapObjects_0(InternalRow i) {
/* 105 */     boolean isNull_1 = i.isNullAt(0);
/* 106 */     ArrayData value_1 = isNull_1 ?
/* 107 */     null : (i.getArray(0));
/* 108 */     ArrayData value_0 = null;
/* 109 */
/* 110 */     if (!isNull_1) {
/* 111 */
/* 112 */       int dataLength_0 = value_1.numElements();
/* 113 */
/* 114 */       ArrayData[] convertedArray_0 = null;
/* 115 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 116 */
/* 117 */
/* 118 */       int loopIndex_0 = 0;
/* 119 */
/* 120 */       while (loopIndex_0 < dataLength_0) {
/* 121 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 122 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 123 */
/* 124 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 125 */           -1, 1L, " createArray failed.");
/* 126 */
/* 127 */         Object[] values_0 = new Object[1];
/* 128 */
/* 129 */
/* 130 */         if (isNull_MapObject_lambda_variable_1) {
/* 131 */           values_0[0] = null;
/* 132 */         } else {
/* 133 */           values_0[0] = value_MapObject_lambda_variable_1;
/* 134 */         }
/* 135 */
/* 136 */         final InternalRow value_3 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(values_0);
/* 137 */         values_0 = null;
/* 138 */         arrayData_0.update(0, value_3);
/* 139 */         if (false) {
/* 140 */           convertedArray_0[loopIndex_0] = null;
/* 141 */         } else {
/* 142 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 143 */         }
/* 144 */
/* 145 */         loopIndex_0 += 1;
/* 146 */       }
/* 147 */
/* 148 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 149 */     }
/* 150 */     globalIsNull_0 = isNull_1;
/* 151 */     return value_0;
/* 152 */   }
```

### Why are the changes needed?

This patch fixes the bug described above.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

UT added which fails on master branch and passes on PR.

Closes #26173 from HeartSaVioR/SPARK-29503.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-24 00:41:48 +08:00
Terry Kim 53a5f17803 [SPARK-29513][SQL] REFRESH TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
REFRESH TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26183 from imback82/refresh_table.

Lead-authored-by: Terry Kim <yuminkim@gmail.com>
Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-23 08:26:47 -07:00
HyukjinKwon df00b5c17d [SPARK-29569][BUILD][DOCS] Copy and paste minified jquery instead when post-processing badges in JavaDoc
### What changes were proposed in this pull request?

This PR fixes our documentation build to copy minified jquery file instead.

The original file `jquery.js` seems missing as of Scala 2.12 upgrade. Scala 2.12 seems started to use minified `jquery.min.js` instead.

Since we dropped Scala 2.11, we won't have to take care about legacy `jquery.js` anymore.

Note that, there seem multiple weird stuff in the current ScalaDoc (e.g., some pages are weird, it starts from `scala.collection.*` or some pages are missing, or some docs are truncated, some badges look missing). It needs a separate double check and investigation.

This PR targets to make the documentation generation pass in order to unblock Spark 3.0 preview.

### Why are the changes needed?

To fix and make our official documentation build able to run.

### Does this PR introduce any user-facing change?

It will enable to build the documentation in our official way.

**Before:**

```
Making directory api/scala
cp -r ../target/scala-2.12/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.js from Scala API to Java API for page post-processing of badges
jekyll 3.8.6 | Error:  No such file or directory  rb_sysopen - ./api/scala/lib/jquery.js
```

**After:**

```
Making directory api/scala
cp -r ../target/scala-2.12/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Updating JavaDoc files for badge post-processing
Copying jquery.min.js from Scala API to Java API for page post-processing of badges
Copying api_javadocs.js to Java API for page post-processing of badges
Appending content of api-javadocs.css to JavaDoc stylesheet.css for badge styles
...
```

### How was this patch tested?

Manually tested via:

```
SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
```

Closes #26228 from HyukjinKwon/SPARK-29569.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-23 15:23:25 +02:00
Xianyang Liu 0a7095156b [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier
### What changes were proposed in this pull request?

Add mapPartitionsWithIndex for RDDBarrier.

### Why are the changes needed?

There is only one method in `RDDBarrier`. We often use the partition index as a label for the current partition. We need to get the index from `TaskContext`  index in the method of `mapPartitions` which is not convenient.
### Does this PR introduce any user-facing change?

No

### How was this patch tested?

New UT.

Closes #26148 from ConeyLiu/barrier-index.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-23 13:46:09 +02:00
turbofei 70dd9c0cab [SPARK-29542][SQL][DOC] Make the descriptions of spark.sql.files.* be clearly
### What changes were proposed in this pull request?
As described  in [SPARK-29542](https://issues.apache.org/jira/browse/SPARK-29542) , the descriptions of `spark.sql.files.*` are confused.
In this PR, I make their descriptions be clearly.

### Why are the changes needed?
It makes the descriptions of `spark.sql.files.*` be clearly.
### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #26200 from turboFei/SPARK-29542-partition-maxSize.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-23 20:31:06 +09:00
Burak Yavuz cbe6eadc0c [SPARK-29352][SQL][SS] Track active streaming queries in the SparkSession.sharedState
### What changes were proposed in this pull request?

This moves the tracking of active queries from a per SparkSession state, to the shared SparkSession for better safety in isolated Spark Session environments.

### Why are the changes needed?

We have checks to prevent the restarting of the same stream on the same spark session, but we can actually make that better in multi-tenant environments by actually putting that state in the SharedState instead of SessionState. This would allow a more comprehensive check for multi-tenant clusters.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added tests to StreamingQueryManagerSuite

Closes #26018 from brkyvz/sharedStreamingQueryManager.

Lead-authored-by: Burak Yavuz <burak@databricks.com>
Co-authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2019-10-23 10:56:19 +02:00
Dongjoon Hyun 8c3469009c [SPARK-29546][TESTS] Recover jersey-guava test dependency in docker-integration-tests
### What changes were proposed in this pull request?

Currently, `docker-integration-tests` is broken in both JDK8/11.
This PR aims to recover JDBC integration test for JDK8/11.

### Why are the changes needed?

While SPARK-28737 upgraded `Jersey` to 2.29 for JDK11, `docker-integration-tests` is broken because `com.spotify.docker-client` still depends on `jersey-guava`. The latest `com.spotify.docker-client` also has this problem.

- https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2
  -> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-client/2.19
    -> https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-common/2.19
      -> https://mvnrepository.com/artifact/org.glassfish.jersey.bundles.repackaged/jersey-guava/2.19

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual because this is an integration test suite.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)

$ build/mvn install -DskipTests

$ build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

```
$ java -version
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.5+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.5+10, mixed mode)

$ build/mvn install -DskipTests

$ build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

**BEFORE**
```
*** RUN ABORTED ***
  com.spotify.docker.client.exceptions.DockerException: java.util.concurrent.ExecutionException: javax.ws.rs.ProcessingException: java.lang.NoClassDefFoundError: jersey/repackaged/com/google/common/util/concurrent/MoreExecutors
  at com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1607)
  at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1538)
  at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:387)
  at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:81)
```

**AFTER**
```
Run completed in 47 seconds, 999 milliseconds.
Total number of tests run: 30
Suites: completed 6, aborted 0
Tests: succeeded 30, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
```

Closes #26203 from dongjoon-hyun/SPARK-29546.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-23 00:14:43 -07:00
Terry Kim c128ac564d [SPARK-29511][SQL] DataSourceV2: Support CREATE NAMESPACE
### What changes were proposed in this pull request?

This PR adds `CREATE NAMESPACE` support for V2 catalogs.

### Why are the changes needed?

Currently, you cannot explicitly create namespaces for v2 catalogs.

### Does this PR introduce any user-facing change?

The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
```
to create a namespace `ns` inside `mycatalog` V2 catalog.

### How was this patch tested?

Added unit tests.

Closes #26166 from imback82/create_namespace.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-23 12:17:20 +08:00
DylanGuedes e6749092f7 [SPARK-29107][SQL][TESTS] Port window.sql (Part 1)
### What changes were proposed in this pull request?

This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 1~319

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out

### Why are the changes needed?

To ensure compatibility with PostgreSQL.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the Jenkins. And, Comparison with PgSQL results.

Closes #26119 from DylanGuedes/spark-29107.

Authored-by: DylanGuedes <djmgguedes@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-23 10:24:38 +09:00
Dongjoon Hyun f23c5d7f67 [SPARK-29560][BUILD] Add typesafe bintray repo for sbt-mima-plugin
### What changes were proposed in this pull request?

This add `typesafe` bintray repo for `sbt-mima-plugin`.

### Why are the changes needed?

Since Oct 21, the following plugin causes [Jenkins failures](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/611/console
) due to the missing jar.

- `branch-2.4`: `sbt-mima-plugin:0.1.17` is missing.
- `master`: `sbt-mima-plugin:0.3.0` is missing.

These versions of `sbt-mima-plugin` seems to be removed from the old repo.

```
$ rm -rf ~/.ivy2/

$ build/sbt scalastyle test:scalastyle
...
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: com.typesafe#sbt-mima-plugin;0.1.17: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Check `GitHub Action` linter result. This PR should pass. Or, manual check.
(Note that Jenkins PR builder didn't fail until now due to the local cache.)

Closes #26217 from dongjoon-hyun/SPARK-29560.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-22 16:30:29 -07:00
Huaxin Gao 3bf5355e24 [SPARK-29539][SQL] SHOW PARTITIONS should look up catalog/table like v2 commands
### What changes were proposed in this pull request?
Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users.

### Does this PR introduce any user-facing change?
Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?
Unit tests.

Closes #26198 from huaxingao/spark-29539.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-22 14:47:17 -07:00
Sean Owen 80094688fd [SPARK-29556][CORE] Avoid putting request path in error response in ErrorServlet
### What changes were proposed in this pull request?

Don't include `$path` from user query in the error response.

### Why are the changes needed?

The path could contain input that is then rendered as HTML in the error response. It's not clear whether it's exploitable, but better safe than sorry as the path info really isn't that important in this context.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #26211 from srowen/SPARK-29556.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-22 14:14:59 -07:00
chenjuanni 2036a8cca7 [SPARK-29488][WEBUI] In Web UI, stage page has js error when sort table
### What changes were proposed in this pull request?
In Web UI, stage page has js error when sort table.
https://issues.apache.org/jira/browse/SPARK-29488

### Why are the changes needed?
In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.".
 1) Click "Summary Metrics..." 's tablehead "Min"
 2) Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
 3) Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
In Web UI, follow the steps below, no error occur.
 1) Click "Summary Metrics..." 's tablehead "Min"
 2) Click "Aggregated Metrics by Executor" 's tablehead "Task Time"
 3) Click "Summary Metrics..." 's tablehead "Min"(the same as step 1.)
![image](https://user-images.githubusercontent.com/7802338/66899878-464b1b80-f02e-11e9-9660-6cdaab283491.png)

Closes #26136 from cjn082030/SPARK-1.

Authored-by: chenjuanni <chenjuanni@inspur.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-22 08:58:12 -05:00
Dilip Biswal c1c64851ed [SPARK-28793][DOC][SQL] Document CREATE FUNCTION in SQL Reference
### What changes were proposed in this pull request?
Document CREATE FUNCTION statement in SQL Reference Guide.

### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

### Does this PR introduce any user-facing change?
Yes.

**Before:**
There was no documentation for this.

**After.**
<img width="1260" alt="Screen Shot 2019-09-22 at 3 01 52 PM" src="https://user-images.githubusercontent.com/14225158/65395036-5bdc6680-dd4a-11e9-9873-0a1da88706a8.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 02 11 PM" src="https://user-images.githubusercontent.com/14225158/65395037-5bdc6680-dd4a-11e9-964f-c02d23803b68.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 02 39 PM" src="https://user-images.githubusercontent.com/14225158/65395038-5bdc6680-dd4a-11e9-831b-6ba1d041893d.png">
<img width="1260" alt="Screen Shot 2019-09-22 at 3 04 04 PM" src="https://user-images.githubusercontent.com/14225158/65395040-5bdc6680-dd4a-11e9-8226-250f77dfeaf3.png">

### How was this patch tested?
Tested using jykyll build --serve

Closes #25894 from dilipbiswal/sql-ref-create-function.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-22 08:56:44 -05:00
Huaxin Gao 877993847c [SPARK-28787][DOC][SQL] Document LOAD DATA statement in SQL Reference
### What changes were proposed in this pull request?
Document LOAD DATA statement in SQL Reference

### Why are the changes needed?
To complete the SQL Reference

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Tested using jykyll build --serve

Here are the screen shots:

![image](https://user-images.githubusercontent.com/13592258/64073167-e7cd0800-cc4e-11e9-9fcc-92fe4cb5a942.png)

![image](https://user-images.githubusercontent.com/13592258/64073169-ee5b7f80-cc4e-11e9-9a36-cc023bcd32b1.png)

![image](https://user-images.githubusercontent.com/13592258/64073170-f4516080-cc4e-11e9-9101-2609a01fe6fe.png)

Closes #25522 from huaxingao/spark-28787.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-22 08:55:37 -05:00
Liang-Chi Hsieh b4844eea1f [SPARK-29517][SQL] TRUNCATE TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
TRUNCATE TABLE t // report table not found as there is no table t in the session catalog
```

### Does this PR introduce any user-facing change?

yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26174 from viirya/SPARK-29517.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-22 19:17:28 +08:00
Yuanjian Li bb49c80c89 [SPARK-21492][SQL] Fix memory leak in SortMergeJoin
### What changes were proposed in this pull request?
We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below:
- Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR.
- Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator.

### Why are the changes needed?
Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario.

Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test  [here](https://github.com/apache/spark/pull/23762#issuecomment-463303175)) will pass with this PR.

```
from pyspark.sql.functions import rand, col

spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
# spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true")

r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
r1 = r1.withColumn('value', rand())
r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
r2 = r2.withColumn('value2', rand())
joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
joined = joined.coalesce(1)
joined.explain()
joined.show()
```

Closes #26164 from xuanyuanking/SPARK-21492.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-22 19:08:09 +08:00
Yuming Wang 3163b6b43b [SPARK-29516][SQL][TEST] Test ThriftServerQueryTestSuite asynchronously
### What changes were proposed in this pull request?
This PR test `ThriftServerQueryTestSuite` in an asynchronous way.

### Why are the changes needed?
The default value of `spark.sql.hive.thriftServer.async` is `true`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
```
build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite" -Phive-thriftserver
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver
```

Closes #26172 from wangyum/SPARK-29516.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-10-22 03:20:49 -07:00
Huaxin Gao 868d851dac [SPARK-29232][ML] Update the parameter maps of the DecisionTreeRegression/Classification Models
### What changes were proposed in this pull request?
The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees.
Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor```

### Why are the changes needed?
User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values

### Does this PR introduce any user-facing change?
Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values.

### How was this patch tested?
Add tests

Closes #26154 from huaxingao/spark-29232.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-22 17:49:44 +08:00
HyukjinKwon 811d563fbf [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8
### What changes were proposed in this pull request?

Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py

https://github.com/cloudpipe/cloudpickle/pull/269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1.

Once we drop Python 2, we can switch to the highest version.

### Why are the changes needed?

positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters)

Particularly the newly added argument to `types.CodeType` was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api):

> `types.CodeType` has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in **PEP 570**. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new `replace()` method of `types.CodeType` can be used to make the code future-proof.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below.

<details>
<p>

```bash
cd python
./run-tests --python-executables=python3.8
```

```
Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.8']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Starting test(python3.8): pyspark.ml.tests.test_algorithms
Starting test(python3.8): pyspark.ml.tests.test_feature
Starting test(python3.8): pyspark.ml.tests.test_base
Starting test(python3.8): pyspark.ml.tests.test_evaluation
Finished test(python3.8): pyspark.ml.tests.test_base (12s)
Starting test(python3.8): pyspark.ml.tests.test_image
Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s)
Starting test(python3.8): pyspark.ml.tests.test_linalg
Finished test(python3.8): pyspark.ml.tests.test_feature (23s)
Starting test(python3.8): pyspark.ml.tests.test_param
Finished test(python3.8): pyspark.ml.tests.test_image (22s)
Starting test(python3.8): pyspark.ml.tests.test_persistence
Finished test(python3.8): pyspark.ml.tests.test_param (25s)
Starting test(python3.8): pyspark.ml.tests.test_pipeline
Finished test(python3.8): pyspark.ml.tests.test_linalg (37s)
Starting test(python3.8): pyspark.ml.tests.test_stat
Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s)
Starting test(python3.8): pyspark.ml.tests.test_training_summary
Finished test(python3.8): pyspark.ml.tests.test_stat (21s)
Starting test(python3.8): pyspark.ml.tests.test_tuning
Finished test(python3.8): pyspark.ml.tests.test_persistence (45s)
Starting test(python3.8): pyspark.ml.tests.test_wrapper
Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s)
Starting test(python3.8): pyspark.mllib.tests.test_algorithms
Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s)
Starting test(python3.8): pyspark.mllib.tests.test_feature
Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s)
Starting test(python3.8): pyspark.mllib.tests.test_linalg
Finished test(python3.8): pyspark.mllib.tests.test_feature (32s)
Starting test(python3.8): pyspark.mllib.tests.test_stat
Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s)
Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms
Finished test(python3.8): pyspark.mllib.tests.test_stat (37s)
Starting test(python3.8): pyspark.mllib.tests.test_util
Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s)
Starting test(python3.8): pyspark.sql.tests.test_arrow
Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_catalog
Finished test(python3.8): pyspark.mllib.tests.test_util (15s)
Starting test(python3.8): pyspark.sql.tests.test_column
Finished test(python3.8): pyspark.sql.tests.test_catalog (24s)
Starting test(python3.8): pyspark.sql.tests.test_conf
Finished test(python3.8): pyspark.sql.tests.test_column (21s)
Starting test(python3.8): pyspark.sql.tests.test_context
Finished test(python3.8): pyspark.ml.tests.test_tuning (125s)
Starting test(python3.8): pyspark.sql.tests.test_dataframe
Finished test(python3.8): pyspark.sql.tests.test_conf (9s)
Starting test(python3.8): pyspark.sql.tests.test_datasources
Finished test(python3.8): pyspark.sql.tests.test_context (29s)
Starting test(python3.8): pyspark.sql.tests.test_functions
Finished test(python3.8): pyspark.sql.tests.test_datasources (32s)
Starting test(python3.8): pyspark.sql.tests.test_group
Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window
Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped
Starting test(python3.8): pyspark.sql.tests.test_readwriter
Finished test(python3.8): pyspark.sql.tests.test_functions (29s)
Starting test(python3.8): pyspark.sql.tests.test_serde
Finished test(python3.8): pyspark.sql.tests.test_group (20s)
Starting test(python3.8): pyspark.sql.tests.test_session
Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s)
Starting test(python3.8): pyspark.sql.tests.test_streaming
Finished test(python3.8): pyspark.sql.tests.test_serde (25s)
Starting test(python3.8): pyspark.sql.tests.test_types
Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s)
Starting test(python3.8): pyspark.sql.tests.test_udf
Finished test(python3.8): pyspark.sql.tests.test_session (32s)
Starting test(python3.8): pyspark.sql.tests.test_utils
Finished test(python3.8): pyspark.sql.tests.test_utils (17s)
Starting test(python3.8): pyspark.streaming.tests.test_context
Finished test(python3.8): pyspark.sql.tests.test_types (45s)
Starting test(python3.8): pyspark.streaming.tests.test_dstream
Finished test(python3.8): pyspark.sql.tests.test_udf (44s)
Starting test(python3.8): pyspark.streaming.tests.test_kinesis
Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
Starting test(python3.8): pyspark.streaming.tests.test_listener
Finished test(python3.8): pyspark.streaming.tests.test_context (28s)
Starting test(python3.8): pyspark.tests.test_appsubmit
Finished test(python3.8): pyspark.sql.tests.test_streaming (60s)
Starting test(python3.8): pyspark.tests.test_broadcast
Finished test(python3.8): pyspark.streaming.tests.test_listener (11s)
Starting test(python3.8): pyspark.tests.test_conf
Finished test(python3.8): pyspark.tests.test_conf (17s)
Starting test(python3.8): pyspark.tests.test_context
Finished test(python3.8): pyspark.tests.test_broadcast (39s)
Starting test(python3.8): pyspark.tests.test_daemon
Finished test(python3.8): pyspark.tests.test_daemon (5s)
Starting test(python3.8): pyspark.tests.test_join
Finished test(python3.8): pyspark.tests.test_context (31s)
Starting test(python3.8): pyspark.tests.test_profiler
Finished test(python3.8): pyspark.tests.test_join (9s)
Starting test(python3.8): pyspark.tests.test_rdd
Finished test(python3.8): pyspark.tests.test_profiler (12s)
Starting test(python3.8): pyspark.tests.test_readwrite
Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped
Starting test(python3.8): pyspark.tests.test_serializers
Finished test(python3.8): pyspark.tests.test_appsubmit (94s)
Starting test(python3.8): pyspark.tests.test_shuffle
Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s)
Starting test(python3.8): pyspark.tests.test_taskcontext
Finished test(python3.8): pyspark.tests.test_rdd (42s)
Starting test(python3.8): pyspark.tests.test_util
Finished test(python3.8): pyspark.tests.test_serializers (11s)
Starting test(python3.8): pyspark.tests.test_worker
Finished test(python3.8): pyspark.tests.test_shuffle (12s)
Starting test(python3.8): pyspark.accumulators
Finished test(python3.8): pyspark.tests.test_util (7s)
Starting test(python3.8): pyspark.broadcast
Finished test(python3.8): pyspark.accumulators (8s)
Starting test(python3.8): pyspark.conf
Finished test(python3.8): pyspark.broadcast (8s)
Starting test(python3.8): pyspark.context
Finished test(python3.8): pyspark.tests.test_worker (19s)
Starting test(python3.8): pyspark.ml.classification
Finished test(python3.8): pyspark.conf (4s)
Starting test(python3.8): pyspark.ml.clustering
Finished test(python3.8): pyspark.context (22s)
Starting test(python3.8): pyspark.ml.evaluation
Finished test(python3.8): pyspark.tests.test_taskcontext (49s)
Starting test(python3.8): pyspark.ml.feature
Finished test(python3.8): pyspark.ml.clustering (43s)
Starting test(python3.8): pyspark.ml.fpm
Finished test(python3.8): pyspark.ml.evaluation (27s)
Starting test(python3.8): pyspark.ml.image
Finished test(python3.8): pyspark.ml.image (8s)
Starting test(python3.8): pyspark.ml.linalg.__init__
Finished test(python3.8): pyspark.ml.linalg.__init__ (0s)
Starting test(python3.8): pyspark.ml.recommendation
Finished test(python3.8): pyspark.ml.classification (63s)
Starting test(python3.8): pyspark.ml.regression
Finished test(python3.8): pyspark.ml.fpm (23s)
Starting test(python3.8): pyspark.ml.stat
Finished test(python3.8): pyspark.ml.stat (30s)
Starting test(python3.8): pyspark.ml.tuning
Finished test(python3.8): pyspark.ml.regression (51s)
Starting test(python3.8): pyspark.mllib.classification
Finished test(python3.8): pyspark.ml.feature (93s)
Starting test(python3.8): pyspark.mllib.clustering
Finished test(python3.8): pyspark.ml.tuning (39s)
Starting test(python3.8): pyspark.mllib.evaluation
Finished test(python3.8): pyspark.mllib.classification (38s)
Starting test(python3.8): pyspark.mllib.feature
Finished test(python3.8): pyspark.mllib.evaluation (25s)
Starting test(python3.8): pyspark.mllib.fpm
Finished test(python3.8): pyspark.mllib.clustering (64s)
Starting test(python3.8): pyspark.mllib.linalg.__init__
Finished test(python3.8): pyspark.ml.recommendation (131s)
Starting test(python3.8): pyspark.mllib.linalg.distributed
Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s)
Starting test(python3.8): pyspark.mllib.random
Finished test(python3.8): pyspark.mllib.feature (36s)
Starting test(python3.8): pyspark.mllib.recommendation
Finished test(python3.8): pyspark.mllib.fpm (31s)
Starting test(python3.8): pyspark.mllib.regression
Finished test(python3.8): pyspark.mllib.random (16s)
Starting test(python3.8): pyspark.mllib.stat.KernelDensity
Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python3.8): pyspark.mllib.stat._statistics
Finished test(python3.8): pyspark.mllib.stat._statistics (25s)
Starting test(python3.8): pyspark.mllib.tree
Finished test(python3.8): pyspark.mllib.regression (44s)
Starting test(python3.8): pyspark.mllib.util
Finished test(python3.8): pyspark.mllib.recommendation (49s)
Starting test(python3.8): pyspark.profiler
Finished test(python3.8): pyspark.mllib.linalg.distributed (53s)
Starting test(python3.8): pyspark.rdd
Finished test(python3.8): pyspark.profiler (14s)
Starting test(python3.8): pyspark.serializers
Finished test(python3.8): pyspark.mllib.tree (30s)
Starting test(python3.8): pyspark.shuffle
Finished test(python3.8): pyspark.shuffle (2s)
Starting test(python3.8): pyspark.sql.avro.functions
Finished test(python3.8): pyspark.mllib.util (30s)
Starting test(python3.8): pyspark.sql.catalog
Finished test(python3.8): pyspark.serializers (17s)
Starting test(python3.8): pyspark.sql.column
Finished test(python3.8): pyspark.rdd (31s)
Starting test(python3.8): pyspark.sql.conf
Finished test(python3.8): pyspark.sql.conf (7s)
Starting test(python3.8): pyspark.sql.context
Finished test(python3.8): pyspark.sql.avro.functions (19s)
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.catalog (16s)
Starting test(python3.8): pyspark.sql.functions
Finished test(python3.8): pyspark.sql.column (27s)
Starting test(python3.8): pyspark.sql.group
Finished test(python3.8): pyspark.sql.context (26s)
Starting test(python3.8): pyspark.sql.readwriter
Finished test(python3.8): pyspark.sql.group (52s)
Starting test(python3.8): pyspark.sql.session
Finished test(python3.8): pyspark.sql.dataframe (73s)
Starting test(python3.8): pyspark.sql.streaming
Finished test(python3.8): pyspark.sql.functions (75s)
Starting test(python3.8): pyspark.sql.types
Finished test(python3.8): pyspark.sql.readwriter (57s)
Starting test(python3.8): pyspark.sql.udf
Finished test(python3.8): pyspark.sql.types (13s)
Starting test(python3.8): pyspark.sql.window
Finished test(python3.8): pyspark.sql.session (32s)
Starting test(python3.8): pyspark.streaming.util
Finished test(python3.8): pyspark.streaming.util (1s)
Starting test(python3.8): pyspark.util
Finished test(python3.8): pyspark.util (0s)
Finished test(python3.8): pyspark.sql.streaming (30s)
Finished test(python3.8): pyspark.sql.udf (27s)
Finished test(python3.8): pyspark.sql.window (22s)
Tests passed in 855 seconds
```
</p>
</details>

Closes #26194 from HyukjinKwon/SPARK-29536.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 16:18:34 +09:00
denglingang 467c3f610f [SPARK-29529][DOCS] Remove unnecessary orc version and hive version in doc
### What changes were proposed in this pull request?

This PR remove unnecessary orc version and hive version in doc.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #26146 from denglingang/SPARK-24576.

Lead-authored-by: denglingang <chitin1027@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 14:49:23 +09:00
angerszhu 484f93e255 [SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe
### What changes were proposed in this pull request?
As I have comment in  [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977)
SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation.

In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf.
Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe

### Why are the changes needed?
Fix bug

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
NO

Closes #26187 from AngersZhuuuu/SPARK-29530.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-22 10:38:06 +08:00
wuyi 3d567a357c [MINOR][SQL] Avoid unnecessary invocation on checkAndGlobPathIfNecessary
### What changes were proposed in this pull request?

Only invoke `checkAndGlobPathIfNecessary()` when we have to use `InMemoryFileIndex`.

### Why are the changes needed?

Avoid unnecessary function invocation.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #26196 from Ngone51/dev-avoid-unnecessary-invocation-on-globpath.

Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-21 21:10:21 -05:00
DylanGuedes bb4400c23a [SPARK-29108][SQL][TESTS] Port window.sql (Part 2)
### What changes were proposed in this pull request?

This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 320~562

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out

## How was this patch tested?
Pass the Jenkins.

### Why are the changes needed?
To ensure compatibility with PGSQL

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Comparison with PgSQL results.

Closes #26121 from DylanGuedes/spark-29108.

Authored-by: DylanGuedes <djmgguedes@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 10:49:40 +09:00
Maxim Gekk eef11ba9ef [SPARK-29518][SQL][TEST] Benchmark date_part for INTERVAL
### What changes were proposed in this pull request?
I extended `ExtractBenchmark` to support the `INTERVAL` type of the `source` parameter of the `date_part` function.

### Why are the changes needed?
- To detect performance issues while changing implementation of the `date_part` function in the future.
- To find out current performance bottlenecks in `date_part` for the `INTERVAL` type

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the benchmark and print out produced values per each `field` value.

Closes #26175 from MaxGekk/extract-interval-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 10:47:54 +09:00
Maxim Gekk 6ffec5e6a6 [SPARK-29533][SQL][TEST] Benchmark casting strings to intervals
### What changes were proposed in this pull request?
Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103) . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc.

### Why are the changes needed?
- To find out current performance issues in casting to intervals
- The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the benchmark via the command:
```shell
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark"
```

Closes #26189 from MaxGekk/interval-from-string-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-22 10:47:04 +09:00
fuwhu 31a5dea48f [SPARK-29531][SQL][TEST] refine ThriftServerQueryTestSuite.blackList to reuse black list in SQLQueryTestSuite
### What changes were proposed in this pull request?
This pr refine the code in ThriftServerQueryTestSuite.blackList to reuse the black list of SQLQueryTestSuite instead of duplicating all test cases from SQLQueryTestSuite.blackList.

### Why are the changes needed?
To reduce code duplication.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26188 from fuwhu/SPARK-TBD.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-10-21 05:19:27 -07:00
Dongjoon Hyun 5fc363b307 [SPARK-29528][BUILD][TEST-MAVEN] Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
### What changes were proposed in this pull request?

This PR upgrades `scala-maven-plugin` to `4.2.4` for Scala `2.13.1`.

### Why are the changes needed?
Scala 2.13.1 seems to break the binary compatibility.

We need to upgrade `scala-maven-plugin` to bring the the following fixes for the latest Scala 2.13.1.
- https://github.com/davidB/scala-maven-plugin/issues/363
- https://github.com/sbt/zinc/issues/698

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

For now, we don't support Scala-2.13. This PR at least needs to pass the existing Jenkins with Maven to get prepared for Scala-2.13.

Closes #26185 from dongjoon-hyun/SPARK-29528.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-21 19:05:27 +09:00
Yuming Wang e99a9f78ea [SPARK-29498][SQL] CatalogTable to HiveTable should not change the table's ownership
### What changes were proposed in this pull request?

`CatalogTable` to `HiveTable` will change the table's ownership. How to reproduce:
```scala
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType}
import org.apache.spark.sql.types.{LongType, StructType}

val identifier = TableIdentifier("spark_29498", None)
val owner = "SPARK-29498"
val newTable = CatalogTable(
  identifier,
  tableType = CatalogTableType.EXTERNAL,
  storage = CatalogStorageFormat(
    locationUri = None,
    inputFormat = None,
    outputFormat = None,
    serde = None,
    compressed = false,
    properties = Map.empty),
  owner = owner,
  schema = new StructType().add("i", LongType, false),
  provider = Some("hive"))

spark.sessionState.catalog.createTable(newTable, false)
// The owner is not SPARK-29498
println(spark.sessionState.catalog.getTableMetadata(identifier).owner)
```

This PR makes it set the `HiveTable`'s owner to `CatalogTable`'s owner if it's owner is not empty when converting `CatalogTable` to `HiveTable`.

### Why are the changes needed?
We should not change the ownership of the table when converting `CatalogTable` to `HiveTable`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
unit test

Closes #26160 from wangyum/SPARK-29498.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-21 15:53:36 +08:00
Kent Yao 5b4d9170ed [SPARK-27879][SQL] Add support for bit_and and bit_or aggregates
### What changes were proposed in this pull request?

```
bit_and(expression) -- The bitwise AND of all non-null input values, or null if none
bit_or(expression) -- The bitwise OR of all non-null input values, or null if none
```
More details:
https://www.postgresql.org/docs/9.3/functions-aggregate.html

### Why are the changes needed?

Postgres, Mysql and many other popular db support them.

### Does this PR introduce any user-facing change?

add two bit agg

### How was this patch tested?

add ut

Closes #26155 from yaooqinn/SPARK-27879.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-10-21 14:32:31 +08:00
DB Tsai f4d5aa4213 [SPARK-29434][CORE] Improve the MapStatuses Serialization Performance
### What changes were proposed in this pull request?
Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time.

The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller `MapStatues`, we find it's 2x faster.

Here is the benchmark result.

#### 20k map outputs, and each has 500 blocks
1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes
2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes
3. GZip: 0.092 ops/ms, 217,345 bytes

#### 20k map outputs, and each has 5 blocks
1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes
2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes
3. GZip: 0.21 ops/ms, 160,094 bytes

### Why are the changes needed?
Decrease the time for serializing the `MapStatuses` in large scale job.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #26085 from dbtsai/mapStatus.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-20 13:56:23 -07:00
Yuming Wang 0f65b49f55 [SPARK-29525][SQL][TEST] Fix the associated location already exists in SQLQueryTestSuite
### What changes were proposed in this pull request?

This PR fix Fix the associated location already exists in `SQLQueryTestSuite`:
```
build/sbt "~sql/test-only *SQLQueryTestSuite -- -z postgreSQL/join.sql"
...
[info] - postgreSQL/join.sql *** FAILED *** (35 seconds, 420 milliseconds)
[info]   postgreSQL/join.sql
[info]   Expected "[]", but got "[org.apache.spark.sql.AnalysisException
[info]   Can not create the managed table('`default`.`tt3`'). The associated location('file:/root/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/tt3') already exists.;]" Result did not match for query #108
```

### Why are the changes needed?
Fix bug.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #26181 from wangyum/TestError.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-20 13:31:59 -07:00
shahid 4a6005c795 [SPARK-29235][ML][PYSPARK] Support avgMetrics in read/write of CrossValidatorModel
### What changes were proposed in this pull request?
 Currently pyspark doesn't write/read `avgMetrics` in `CrossValidatorModel`, whereas scala supports it.

### Why are the changes needed?
 Test step to reproduce it:

```
dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
      (Vectors.dense([0.6]), 1.0),
      (Vectors.dense([1.0]), 1.0)] * 10,
     ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2)
cvModel = cv.fit(dataset)
cvModel.write().save("/tmp/model")
cvModel2 = CrossValidatorModel.read().load("/tmp/model")
print(cvModel.avgMetrics) # prints non empty result as expected
print(cvModel2.avgMetrics) # Bug: prints an empty result.
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually tested

Before patch:
```
>>> cvModel.write().save("/tmp/model_0")
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0")
>>> print(cvModel2.avgMetrics)
[]
```

After patch:
```
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2")
>>> print(cvModel2.avgMetrics[0])
0.5
```

Closes #26038 from shahidki31/avgMetrics.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-19 15:23:57 -05:00
Terry Kim ab92e1715e [SPARK-29512][SQL] REPAIR TABLE should look up catalog/table like v2 commands
### What changes were proposed in this pull request?

Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands.

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.

```
USE my_catalog
DESC t // success and describe the table t from my_catalog
MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?

yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.

### How was this patch tested?

New unit tests

Closes #26168 from imback82/repair_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-10-18 22:43:58 -07:00
Wenchen Fan 2437878299 [SPARK-29502][SQL] typed interval expression should fail for invalid format
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/25241 .

The typed interval expression should fail for invalid format.

### Why are the changes needed?

Te be consistent with the typed timestamp/date expression

### Does this PR introduce any user-facing change?

Yes. But this feature is not released yet.

### How was this patch tested?

updated test

Closes #26151 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-10-18 16:12:03 -07:00
Dongjoon Hyun e4b4a35de2 [SPARK-29466][WEBUI] Show Duration for running drivers in Standalone master web UI
### What changes were proposed in this pull request?

This PR aims to add a new column `Duration` for running drivers in Apache Spark `Standalone` master web UI in order to improve UX. This help users like the other `Duration` columns in the `Running` and `Completed` application tables.

### Why are the changes needed?

When we use `--supervise`, the drivers can survive longer.
Technically, the `Duration` column is not the same. (Please see the image below.)

### Does this PR introduce any user-facing change?

Yes. The red box is added newly.

<img width="1312" alt="Screen Shot 2019-10-14 at 12 53 43 PM" src="https://user-images.githubusercontent.com/9700541/66779127-50301b80-ee82-11e9-853f-72222cd011ac.png">

### How was this patch tested?

Manual since this is a UI column. After starting standalone cluster and jobs, kill the `DriverWrapper` and see the UI.

```
$ sbin/start-master.sh
$ sbin/start-slave.sh spark://$(hostname):7077
$ bin/spark-submit --master spark://(hostname):7077 --deploy-mode cluster --supervise --class org.apache.spark.examples.JavaSparkPi examples/target/scala-2.12/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar 1000
$ jps
41521 DriverWrapper
...
$ kill -9 41521   // kill the `DriverWrapper`.
```

Closes #26113 from dongjoon-hyun/SPARK-29466.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-18 15:39:44 -07:00
Rahul Mahadev 4cfce3e5d0 [SPARK-29494][SQL] Fix for ArrayOutofBoundsException while converting string to timestamp
### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.

### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing  ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.

Closes #26143 from rahulsmahadev/SPARK-29494.

Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-18 16:45:25 -05:00
DB Tsai 23f45f1822 [SPARK-29515][CORE] MapStatuses SerDeser Benchmark
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
Add benchmark code for MapStatuses serialization & deserialization performance.

### Why are the changes needed?
For comparing the performance differences against optimization.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
No test is required.

Closes #26169 from dbtsai/benchmark.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-10-18 21:30:36 +00:00