Commit graph

27192 commits

Author SHA1 Message Date
Dongjoon Hyun 3772154442
[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from build/mvn
### What changes were proposed in this pull request?

This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string.

### Why are the changes needed?

SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following.
```
$ echo 'Falling back to archive.apache.org to download Maven' > out
$ build/mvn help:evaluate -Dexpression=project.version >> out
Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn
$ cat out | grep -v INFO | grep -v WARNING | grep -v Download
Falling back to archive.apache.org to download Maven
3.1.0-SNAPSHOT
$ cat out | grep -v INFO | grep -v WARNING | grep -vi Download
3.1.0-SNAPSHOT
```

Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-12 14:24:56 -07:00
Dongjoon Hyun 0feb3cbe77
[SPARK-31687][INFRA] Use GitHub instead of GitBox in release script
### What changes were proposed in this pull request?

This PR aims to use GitHub urls instead of GitHub in the release scripts.

### Why are the changes needed?

Currently, Spark Packaing jobs are broken due to GitBox issue.
```
fatal: unable to access 'https://gitbox.apache.org/repos/asf/spark.git/': Failed to connect to gitbox.apache.org port 443: Connection timed out
```

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2906/console
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-3.0-maven-snapshots/105/console
 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.4-maven-snapshots/439/console

### Does this PR introduce _any_ user-facing change?

No. (This is a dev-only script.)

### How was this patch tested?

Manual.
```
$ cat ./test.sh
. dev/create-release/release-util.sh
get_release_info
git clone "$ASF_REPO"

$ sh test.sh
Branch [branch-3.0]:
Current branch version is 3.0.1-SNAPSHOT.
Release [3.0.0]:
RC # [2]:
Full name [Dongjoon Hyun]:
GPG key [dongjoonapache.org]:
================
Release details:
BRANCH:     branch-3.0
VERSION:    3.0.0
TAG:        v3.0.0-rc2
NEXT:       3.0.1-SNAPSHOT

ASF USER:   dongjoon
GPG KEY:    dongjoonapache.org
FULL NAME:  Dongjoon Hyun
E-MAIL:     dongjoonapache.org
================
Is this info correct [y/n]? y
ASF password:
GPG passphrase:
Cloning into 'spark'...
remote: Enumerating objects: 223, done.
remote: Counting objects: 100% (223/223), done.
remote: Compressing objects: 100% (117/117), done.
remote: Total 708324 (delta 70), reused 138 (delta 32), pack-reused 708101
Receiving objects: 100% (708324/708324), 322.08 MiB | 2.94 MiB/s, done.
Resolving deltas: 100% (289268/289268), done.
Updating files: 100% (16287/16287), done.

$ sh test.sh

...
```

Closes #28513 from dongjoon-hyun/SPARK-31687.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-12 13:07:00 -07:00
Dongjoon Hyun 07209f3e2d [SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result
### What changes were proposed in this pull request?

This PR aims to update Prometheus-related output format to be consistent with DropWizard 4.1 result.
- Add `Number` metrics for gauges metrics.
- Add `type` labels.

### Why are the changes needed?

SPARK-29032 added Prometheus support. After that, SPARK-29674 upgraded DropWizard for JDK9+ support and this caused difference in output labels and number of keys for Guage metrics. The current status is different from Apache Spark 2.4.5. Since we cannot change DropWizard, this PR aims to be consistent in Apache Spark 3.0.0 only.

**DropWizard 3.x**
```
metrics_master_aliveWorkers_Value 1.0
```

**DropWizard 4.1**
```
metrics_master_aliveWorkers_Value{type="gauges",} 1.0
metrics_master_aliveWorkers_Number{type="gauges",} 1.0
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new feature in 3.0.0.

### How was this patch tested?

Manually check the output like the following.

**JMXExporter Result**
```
$ curl -s http://localhost:8088/ | grep "^metrics_master" | sort
metrics_master_aliveWorkers_Number{type="gauges",} 1.0
metrics_master_aliveWorkers_Value{type="gauges",} 1.0
metrics_master_apps_Number{type="gauges",} 0.0
metrics_master_apps_Value{type="gauges",} 0.0
metrics_master_waitingApps_Number{type="gauges",} 0.0
metrics_master_waitingApps_Value{type="gauges",} 0.0
metrics_master_workers_Number{type="gauges",} 1.0
metrics_master_workers_Value{type="gauges",} 1.0
```

**This PR**
```
$ curl -s http://localhost:8080/metrics/master/prometheus/ | grep master
metrics_master_aliveWorkers_Number{type="gauges"} 1
metrics_master_aliveWorkers_Value{type="gauges"} 1
metrics_master_apps_Number{type="gauges"} 0
metrics_master_apps_Value{type="gauges"} 0
metrics_master_waitingApps_Number{type="gauges"} 0
metrics_master_waitingApps_Value{type="gauges"} 0
metrics_master_workers_Number{type="gauges"} 1
metrics_master_workers_Value{type="gauges"} 1
```

Closes #28510 from dongjoon-hyun/SPARK-31683.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-05-12 19:57:48 +00:00
Ali Smesseim 6994c64efd [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener
### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim <ali.smesseim@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-12 09:14:34 -07:00
Weichen Xu e248bc7af6 [SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF
### What changes were proposed in this pull request?
Expose hashFunc property in HashingTF

Some third-party library such as mleap need to access it.
See background description here:
https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Why are the changes needed?
See https://github.com/combust/mleap/pull/665#issuecomment-621258623

### Does this PR introduce any user-facing change?
No. Only add a package private constructor.

### How was this patch tested?
N/A

Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2020-05-12 08:54:28 -07:00
Max Gekk a3fafddf39 [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 14:05:31 +00:00
Javier Fuentes 178ca961fe [SPARK-31102][SQL] Spark-sql fails to parse when contains comment
### What changes were proposed in this pull request?

This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

### Why are the changes needed?

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes #27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:46:24 +00:00
Kent Yao ce714d8189 [SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs
### What changes were proposed in this pull request?

When I was finding the root cause for SPARK-31675, I noticed that it was very difficult for me to see what was actually going on, since it output nothing else but only
```sql
Error in query: java.lang.IllegalArgumentException: Wrong FS: blablah/.hive-staging_blahbla, expected: hdfs://cluster1
```

It is really hard for us to find causes through such a simple error message without a certain amount of experience.

In this PR, I propose to print all of the stack traces when AnalysisException occurs if there are underlying root causes, also we can escape this via `-S` option.

### Why are the changes needed?

In SPARK-11188,

>For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query.

But nowadays, some `AnalysisException`s do have useful information for us to debug, e.g. the `AnalysisException` below may contain exceptions from hive or Hadoop side.

a28ed86a38/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (L97-L112)

```scala
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468)
	at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626)
	at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593)
```

### Does this PR introduce _any_ user-facing change?

Yes, `bin/spark-sql` will print all the stack trace when an AnalysisException which contains root causes occurs, before this fix, only the message will be printed.

#### before

```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10;
```

#### After
```scala
Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
	at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:376)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:486)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:480)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:2093)
	at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:289)
	at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1221)
	at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2607)
	at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2892)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:927)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
	at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
	... 52 more
```

You can use `-S` option to restore old behavior if you find the error is too verbose.

### How was this patch tested?

Existing CliSuite -
`SPARK-11188 Analysis error reporting`

Add new test and verify manually

Closes #28499 from yaooqinn/SPARK-31678.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:37:13 +00:00
Antonin Delpeuch 59d90997a5 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch <antonin@delpeuch.eu>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-12 08:27:43 -05:00
zhengruifeng e7fa778dc7 [SPARK-30699][ML][PYSPARK] GMM blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks

### Why are the changes needed?
performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
added testsuites

Closes #27473 from zhengruifeng/blockify_gmm.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-12 12:54:03 +08:00
beliefer a89006aba0 [SPARK-31393][SQL] Show the correct alias in schema for expression
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `TimeWindow`
- `MaxBy`
- `MinBy`
- `UnaryMinus`
- `BitwiseCount`

This PR also fix a typo issue, please look at b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)

Note:

1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed.  We can reuse `prettyName` to replace `funcName`.
2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below:
`override def sql: String = s"bit_count(${child.sql})"`
I don't think it's elegant enough. Because `Expression` gives the following definitions.
```
  def sql: String = {
    val childrenSQL = children.map(_.sql).mkString(", ")
    s"$prettyName($childrenSQL)"
  }
```
By this definition, `BitwiseCount` should override `prettyName` method.

### Why are the changes needed?
Improve the implement of some expression.

### Does this PR introduce any user-facing change?
 'Yes'. This PR will let user see the correct alias in schema.

### How was this patch tested?
Jenkins test.

Closes #28164 from beliefer/elegant-pretty-name-for-function.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-12 10:25:04 +09:00
Jungtaek Lim (HeartSaVioR) 842b1dcdff [SPARK-31559][YARN] Re-obtain tokens at the startup of AM for yarn cluster mode if principal and keytab are available
### What changes were proposed in this pull request?

This patch re-obtain tokens at the start of AM for yarn cluster mode, if principal and keytab are available. It basically transfers the credentials from the original user, so this patch puts the new tokens into credentials from the original user via overwriting.

To obtain tokens from providers in user application, this patch leverages the user classloader as context classloader while initializing token manager in the startup of AM.

### Why are the changes needed?

Submitter will obtain delegation tokens for yarn-cluster mode, and add these credentials to the launch context. AM will be launched with these credentials, and AM and driver are able to leverage these tokens.

In Yarn cluster mode, driver is launched in AM, which in turn initializes token manager (while initializing SparkContext) and obtain delegation tokens (+ schedule to renew) if both principal and keytab are available.

That said, even we provide principal and keytab to run application with yarn-cluster mode, AM always starts with initial tokens from launch context until token manager runs and obtains delegation tokens.

So there's a "gap", and if user codes (driver) access to external system with delegation tokens (e.g. HDFS) before initializing SparkContext, it cannot leverage the tokens token manager will obtain. It will make the application fail if AM is killed "after" the initial tokens are expired and relaunched.

This is even a regression: see below codes in branch-2.4:

https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/AMCredentialRenewer.scala

In Spark 2.4.x, AM runs AMCredentialRenewer at initialization, and AMCredentialRenew obtains tokens and merge with credentials being provided with launch context of AM. So it guarantees new tokens in driver run.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested with specifically crafted application (simple reproducer) - https://github.com/HeartSaVioR/spark-delegation-token-experiment/blob/master/src/main/scala/net/heartsavior/spark/example/LongRunningAppWithHDFSConfig.scala

Before this patch, new AM attempt failed when I killed AM after the expiration of tokens. After this patch the new AM attempt runs fine.

Closes #28336 from HeartSaVioR/SPARK-31559.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
2020-05-11 17:25:41 -07:00
fan31415 64fb358a99 [SPARK-31671][ML] Wrong error message in VectorAssembler
### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 <fan12356789@gmail.com>
Co-authored-by: yijiefan <fanyije@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 18:23:23 -05:00
oleg d7c3e9e53e
[SPARK-31456][CORE] Fix shutdown hook priority edge cases
### What changes were proposed in this pull request?
Fix application order for shutdown hooks for the priorities of Int.MaxValue, Int.MinValue

### Why are the changes needed?
The bug causes out-of-order execution of shutdown hooks if their priorities were Int.MinValue or Int.MaxValue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a test covering the change.

Closes #28494 from oleg-smith/SPARK-31456_shutdown_hook_priority.

Authored-by: oleg <oleg@nexla.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-11 13:10:39 -07:00
Gabor Somogyi 5a5af46a94 [SPARK-31575][SQL] Synchronise global JVM security configuration modification
### What changes were proposed in this pull request?
There is a race in secure JDBC connection providers. Namely multiple providers can read and/or write the exact same JVM security configuration at the same time. In this PR I've synchronised them with an object class. Since the configuration read and write takes couple of milliseconds it won't cause performance degradation.

### Why are the changes needed?
There is a race in secure JDBC connection providers.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing unit + integration tests.

Closes #28368 from gaborgsomogyi/SPARK-31575.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:10:58 -05:00
Huaxin Gao 7a670b5a0a [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest
### What changes were proposed in this pull request?
Add Python version of
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```

### Why are the changes needed?
parity between scala and python

### Does this PR introduce _any_ user-facing change?
yes
new method
```
Since("3.1.0")
def test(
    dataset: DataFrame,
    featuresCol: String,
    labelCol: String,
    flatten: Boolean): DataFrame
```
in PySpark ANOVATest/ChisqTest/FValueTest

### How was this patch tested?
New doctest

Closes #28483 from huaxingao/flatten_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-11 09:09:00 -05:00
Max Gekk 32a5398b65 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps
### What changes were proposed in this pull request?
Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test  HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType.

### Why are the changes needed?
To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.

Closes #28481 from MaxGekk/test-random-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 12:59:41 +00:00
Dongjoon Hyun b80309bdb4
[SPARK-31674][CORE][DOCS] Make Prometheus metric endpoints experimental
### What changes were proposed in this pull request?

This PR aims to new Prometheus-format metric endpoints experimental in Apache Spark 3.0.0.

### Why are the changes needed?

Although the new metrics are disabled by default, we had better make it experimental explicitly in Apache Spark 3.0.0 since the output format is still not fixed. We can finalize it in Apache Spark 3.1.0.

### Does this PR introduce _any_ user-facing change?

Only doc-change is visible to the users.

### How was this patch tested?

Manually check the code since this is a documentation and class annotation change.

Closes #28495 from dongjoon-hyun/SPARK-31674.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-10 22:32:26 -07:00
Max Gekk 5d5866be12 [SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`.

### Why are the changes needed?
This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala>
Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS")
  .select($"tsS".cast("timestamp").as("ts")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-07 00:32:20.123|
...
|1001-01-07 00:32:20.123|
+-----------------------+
```
Expected values **must be 1001-01-01 01:02:03.123** but not 1001-01-07 00:32:20.123.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-01 01:02:03.123|
...
|1001-01-01 01:02:03.123|
+-----------------------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 04:58:08 +00:00
Max Gekk 9f768fa991 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps
### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar.

### Why are the changes needed?
This makes RowEncoderSuite much stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY))
```

Closes #28486 from MaxGekk/fix-RowEncoderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 14:22:12 -05:00
Huaxin Gao a75dc80a76 [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference
### What changes were proposed in this pull request?
Remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
Please see #28414

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually build and check

Closes #28451 from huaxingao/html_cleanup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 12:57:25 -05:00
Max Gekk ce63bef1da [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.

### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-10 13:31:26 +09:00
tianlzhang ecda38a7b3
[SPARK-31611][YARN] Register NettyMemoryMetrics into Node Manager's metrics system
### What changes were proposed in this pull request?

Register `NettyMemoryMetrics` into Node Manager's metrics system through `YarnShuffleServiceMetrics`.

- usedDirectMemory
- usedHeapMemory

### Why are the changes needed?

Such that `NettyMemoryMetrics` can be exposed through Node Manager's JMX.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT to ensure NettyMemoryMetrics are registered into Node Manager's metrics system.

Closes #28416 from manuzhang/spark-31611.

Authored-by: tianlzhang <tianlzhang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-08 15:50:19 -07:00
wang-zhun c1801fd6da [SPARK-31235][FOLLOWUP][TESTS][TEST-HADOOP3.2] Fix test "specify a more specific type for the ap…
### What changes were proposed in this pull request?
Update the input parameters for instantiating `RMAppManager` and `ClientRMService`

### Why are the changes needed?
For hadoop3.2, if `RMAppManager` is not created correctly, the following exception will occur:
```
java.lang.RuntimeException: java.lang.NullPointerException
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
	at org.apache.hadoop.yarn.security.YarnAuthorizationProvider.getInstance(YarnAuthorizationProvider.java:55)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.<init>(RMAppManager.java:117)
```

### How was this patch tested?
UTs

Closes #28456 from wang-zhun/Fix-SPARK-31235.

Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-08 15:41:23 -05:00
manuzhang 77c690a725 [SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan
### What changes were proposed in this pull request?
Show write commands on SQL UI of an AQE plan

### Why are the changes needed?
Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI.

#### Before

![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png)

#### After

![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add UT.

Closes #28474 from manuzhang/aqe-ui.

Lead-authored-by: manuzhang <owenzhang1990@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-08 10:24:13 -07:00
Kousuke Saruta 0fb607ef37 [SPARK-30385][WEBUI] WebUI occasionally throw IOException on stop()
### What changes were proposed in this pull request?

This PR added a workaround for the issue which occasionally happens when SparkContext#stop() is called.
I think this issue can occurs on macOS with OpenJDK / OracleJDK 1.8.
If this issue happens, following stack trace is shown.
```
20/05/03 02:17:54 WARN AbstractConnector:
java.io.IOException: No such file or directory
	at sun.nio.ch.NativeThread.signal(Native Method)
	at sun.nio.ch.ServerSocketChannelImpl.implCloseSelectableChannel(ServerSocketChannelImpl.java:292)
	at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
	at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
	at org.eclipse.jetty.server.ServerConnector.close(ServerConnector.java:368)
	at org.eclipse.jetty.server.AbstractNetworkConnector.shutdown(AbstractNetworkConnector.java:105)
	at org.eclipse.jetty.server.Server.doStop(Server.java:439)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
	at org.apache.spark.ui.ServerInfo.stop(JettyUtils.scala:501)
	at org.apache.spark.ui.WebUI.$anonfun$stop$2(WebUI.scala:173)
	at org.apache.spark.ui.WebUI.$anonfun$stop$2$adapted(WebUI.scala:173)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.ui.WebUI.stop(WebUI.scala:173)
	at org.apache.spark.ui.SparkUI.stop(SparkUI.scala:101)
	at org.apache.spark.SparkContext.$anonfun$stop$6(SparkContext.scala:1966)
	at org.apache.spark.SparkContext.$anonfun$stop$6$adapted(SparkContext.scala:1966)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.SparkContext.$anonfun$stop$5(SparkContext.scala:1966)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1966)
	at org.apache.spark.repl.Main$.$anonfun$doMain$3(Main.scala:79)
	at org.apache.spark.repl.Main$.$anonfun$doMain$3$adapted(Main.scala:79)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.repl.Main$.doMain(Main.scala:79)
	at org.apache.spark.repl.Main$.main(Main.scala:58)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

This issue happens when the Jetty's acceptor thread shrinks before the main thread sends a signal to the thread.

Jetty's acceptor thread waits for a new connection request and blocked by `accept(this.fd, newfd, isaa)` in [`sun.nio.ch.ServerSocketChannelImpl#accept`](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l241).

When `org.eclipse.jetty.server.Server.doStop` is called in the main thread, the thread reaches [this code](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l280).

The server socket descriptor will be closed by `nd.preClose` in the main thread.
Then, `accept()` in acceptor thread throws an Exception due to "Bad file descriptor" in case of macOS.
After the exception is thrown, the acceptor thread will continue to [fetch a task](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L783).
If the thread obtain the `SHRINK` task [here](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L854), the thread will be shrink.
If, the acceptor thread finishes before `NativeThread.signal` is called in the main thread, this issue happens.

I have confirmed this issue happens even `jetty-9.4.28.v20200408`.
Because the stack trace is displayed by the [logger](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java#L372), it's difficult to suppress it.
According to [this condition](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L842), shrink doesn't happen if the idle time is 0. So this PR adds a workaround that set the idle time to 0 immediately before stop.

In case of Linux, the acceptor thread is still blocked by `accept` even though `np.preClose` is called in the main thread.
The acceptor thread will return from `accept` when `NativeThread.signal` is called in the main thread.
It seems that the implementation of `accept systemcall` called in `accept` is different between Linux and macOS.
So, I believe this issue doesn't happen on Linux.

Also, the implementation of `NativeThread.signal` is a little bit changed in [OpenJDK 9](http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/7b17bff2ea36) for macOS.
So this issue doesn't happen for macOS with OpenJDK 9+.

You can reproduce this issue by following instructions using debugger.

1. Launch spark-shell in local mode with JDWP enabled.
2. Access to WebUI. This is needed to increase the number of SparkUI thread to greater than minThreads to meet the condition of shrink.
3. Enable the following breakpoints. Note that don't suspend all threads when a thread reaches one of the breakpoints. Only the threads which reach the line should be suspended.
  3.1 [long now = System.nanoTime(); at org.eclipse.jetty.util.thread.QueuedThreadPool#idleJobPoll](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L850)
  3.2 [NativeThread.signal(th); at sun.nio.ch.ServerSocketChannelImpl#implCloseSelectableChannel](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l283)
  3.3 [thread = 0; at ServerSocketChannelImpl#accept](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l247)
4. Quit spark-shell.
5.  Waiting for a thread reaching the breakpoint `3.1` and until the following condition become true (The idle time of those threads are 1min and you can confirm it using the expression evaluation feature if your debugger supports ).
`(System.nanoTime() - last) > TimeUnit.MILLISECONDS.toNanos(_idleTimeout)`
6. The acceptor thread named `SparkUI-<N>-acceptor-0` should be suspended at the breakpoint `3.3` so continue this thread. This thread will reach the breakpoint at `3.1` and continue further. Then, the acceptor thread will be shrink.
7. Continue all the threads rest.

### Why are the changes needed?

This stack trace is not brought by Spark but it confuses users.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested by the reproduce procedure above and confirmed acceptor thread is no longer shrink.

Closes #28437 from sarutak/SPARK-30385.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-08 08:41:18 +00:00
zhengruifeng bb9b50c217 [SPARK-31656][ML][PYSPARK] AFT blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (~10X speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28473 from zhengruifeng/blockify_aft.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 14:06:36 +08:00
Huaxin Gao 18d2ba53e4 [SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark
### What changes were proposed in this pull request?
Add ANOVASelector and FValueSelector to PySpark

### Why are the changes needed?
ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well.

### Does this PR introduce _any_ user-facing change?
Yes. Add Python version of ANOVASelector and FValueSelector

### How was this patch tested?
new doctest

Closes #28464 from huaxingao/selector_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 11:02:24 +08:00
Huaxin Gao 08335b651a [SPARK-31659][ML][DOCS] Add VarianceThresholdSelector examples and doc
### What changes were proposed in this pull request?
Add VarianceThresholdSelector examples and doc

### Why are the changes needed?
VarianceThresholdSelector is a new feature selector in 3.1.0. We need to add examples and doc

### Does this PR introduce _any_ user-facing change?
Yes.
add Scala, Python and Java examples for VarianceThresholdSelector. Also add doc

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 01 AM" src="https://user-images.githubusercontent.com/13592258/81321791-e3f84d80-9047-11ea-837b-e39c193bd437.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 44 AM" src="https://user-images.githubusercontent.com/13592258/81321806-e8246b00-9047-11ea-8f35-206e330a92ab.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 27 AM" src="https://user-images.githubusercontent.com/13592258/81321822-ea86c500-9047-11ea-8743-99adec7f502b.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 43 AM" src="https://user-images.githubusercontent.com/13592258/81321826-ec508880-9047-11ea-9e7a-22ee5e13f495.png">

### How was this patch tested?
Manually checked

Closes #28478 from huaxingao/variance_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:57:35 +08:00
zhengruifeng 97332f26bf [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

### Why are the changes needed?
it will obtain performance gain on dense datasets, such as `epsilon`
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines;  (up to 6X(squaredError)~12X(huber) speedup)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28471 from zhengruifeng/blockify_lir_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:52:01 +08:00
Dongjoon Hyun 24fac1e0c7
[SPARK-31646][FOLLOWUP][TESTS] Add clean up code and disable irrelevent conf 2020-05-07 17:50:32 -07:00
tianlzhang dad61ed465
[SPARK-31646][SHUFFLE] Remove unused registeredConnections counter from ShuffleMetrics
### What changes were proposed in this pull request?
Remove unused `registeredConnections` counter from `ExternalBlockHandler#ShuffleMetrics`

This was added by SPARK-25642 at 3.0.0
- 8dd29fe36b

### Why are the changes needed?
It's `registeredConnections` counter created in `TransportContext` that's really counting the numbers and it's misleading for people who want to add new metrics like `registeredConnections`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UTs to ensure all expected metrics are registered for `ExternalShuffleService` and `YarnShuffleService`

Closes #28457 from manuzhang/spark-31611-pre.

Lead-authored-by: tianlzhang <tianlzhang@ebay.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-07 15:22:13 -07:00
angerszhu 0d9faf602e
[SPARK-31655][BUILD] Upgrade snappy-java to 1.1.7.5
### What changes were proposed in this pull request?

snappy-java have release v1.1.7.5, upgrade to latest version.

Fixed in v1.1.7.4
- Caching internal buffers for SnappyFramed streams #234
- Fixed the native lib for ppc64le to work with glibc 2.17 (Previously it depended on 2.22)

Fixed in v1.1.7.5
- Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4

https://github.com/xerial/snappy-java/compare/1.1.7.3...1.1.7.5

v 1.1.7.5 release note:
edc4ec28bd
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need

Closes #28472 from AngersZhuuuu/spark-31655.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-07 12:01:43 -07:00
Max Gekk 272d229005 [SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing
### What changes were proposed in this pull request?
In PR, I propose to modify two tests of `ParquetIOSuite`:
- SPARK-31159: rebasing timestamps in write
- SPARK-31159: rebasing dates in write

to check non-vectorized Parquet reader together with vectorized reader.

### Why are the changes needed?
To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `PaquetIOSuite`:
```
$ ./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-07 07:52:29 +00:00
Kent Yao b31ae7bb0b [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
### What changes were proposed in this pull request?

The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods.

This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of  the RuntimeReplaceable ones with toPettySQL

### Why are the changes needed?

Consistency of schema output between RuntimeReplaceable expressions and normal ones.

For example, `date_format` vs `to_timestamp`, before this PR, they output differently

#### Before
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>
```
#### After

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")

struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>

````

### Does this PR introduce _any_ user-facing change?

Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example

### How was this patch tested?
regenerate all related tests

Closes #28420 from yaooqinn/SPARK-31615.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:40:26 +09:00
Kent Yao bd6b53cc0b [SPARK-31631][TESTS] Fix test flakiness caused by MiniKdc which throws 'address in use' BindException with retry
### What changes were proposed in this pull request?
The `Kafka*Suite`s are flaky because of the Hadoop MiniKdc issue - https://issues.apache.org/jira/browse/HADOOP-12656
> Looking at MiniKdc implementation, if port is 0, the constructor use ServerSocket to find an unused port, assign the port number to the member variable port and close the ServerSocket object; later, in initKDCServer(), instantiate a TcpTransport object and bind at that port.

> It appears that the port may be used in between, and then throw the exception.

Related test failures are suspected,  such as https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122225/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

```scala
[info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED *** (15 seconds, 426 milliseconds)
[info]   java.net.BindException: Address already in use
[info]   at sun.nio.ch.Net.bind0(Native Method)
[info]   at sun.nio.ch.Net.bind(Net.java:433)
[info]   at sun.nio.ch.Net.bind(Net.java:425)
[info]   at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
[info]   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
[info]   at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198)
[info]   at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68)
[info]   at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422)
[info]   at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
```
After comparing the error stack trace with similar issues reported  in different projects, such as
https://issues.apache.org/jira/browse/KAFKA-3453
https://issues.apache.org/jira/browse/HBASE-14734

We can be sure that they are caused by the same problem issued in HADOOP-12656.

In the PR, We apply the approach from HBASE first before we finally drop Hadoop 2.7.x

### Why are the changes needed?

fix test flakiness

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

the test itself passing Jenkins

Closes #28442 from yaooqinn/SPARK-31631.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:37:03 +09:00
zhengruifeng 052ff49acd [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`

### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28458 from zhengruifeng/blockify_lor_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-07 10:07:24 +08:00
Liang-Chi Hsieh 9bf738724a [SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown
### What changes were proposed in this pull request?

This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document.

### Why are the changes needed?

Make developers less confusing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Only doc change.

Closes #28468 from viirya/SPARK-31365-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:57:08 +09:00
Max Gekk 3d38bc2605 [SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite
### What changes were proposed in this pull request?
Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`.

### Why are the changes needed?
Avro config is not relevant to the parquet tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetIOSuite` via
```
./build/sbt "test:testOnly *ParquetIOSuite"
```

Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-07 09:46:42 +09:00
HyukjinKwon 5c5dd77d6a [SPARK-31647][SQL] Deprecate 'spark.sql.optimizer.metadataOnly' configuration
### What changes were proposed in this pull request?

This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release.

### Why are the changes needed?

This optimization can cause a potential correctness issue, see also SPARK-26709.
Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590.

Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side.

### Does this PR introduce _any_ user-facing change?

Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning:

```scala
scala> spark.conf.unset("spark.sql.optimizer.metadataOnly")
```
```
20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
 deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
```scala
scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
```
```
20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```

### How was this patch tested?

Manually tested.

Closes #28459 from HyukjinKwon/SPARK-31647.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:00:59 +09:00
Huaxin Gao 09ece50799 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark
### What changes were proposed in this pull request?
Add VarianceThresholdSelector to PySpark

### Why are the changes needed?
parity between Scala and Python

### Does this PR introduce any user-facing change?
Yes.
VarianceThresholdSelector is added to PySpark

### How was this patch tested?
new doctest

Closes #28409 from huaxingao/variance_py.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-06 09:11:03 -05:00
yi.wu b16ea8e1ab [SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries
### What changes were proposed in this pull request?

Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages.

### Why are the changes needed?

If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries.

Before:

<img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png">

After:
<img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png">

### Does this PR introduce _any_ user-facing change?

No(AQE feature hasn't been released).

### How was this patch tested?

Tested manually.

Closes #28460 from Ngone51/fix_aqe_ui.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 12:52:53 +00:00
Huaxin Gao f05560bf50 [SPARK-31127][ML] Implement abstract Selector
### What changes were proposed in this pull request?
Implement abstract Selector. Put the common code among ```ANOVASelector```, ```ChiSqSelector```, ```FValueSelector``` and ```VarianceThresholdSelector``` to Selector.

### Why are the changes needed?
code reuse

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #27978 from huaxingao/spark-31127.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-06 16:10:30 +08:00
Liang-Chi Hsieh 4952f1a03c [SPARK-31365][SQL] Enable nested predicate pushdown per data sources
### What changes were proposed in this pull request?

This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown.

### Why are the changes needed?

We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources.

In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added/Modified unit tests.

Closes #28366 from viirya/SPARK-31365.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:50:06 +00:00
Daoyuan Wang 53a9bf8fec [SPARK-31595][SQL] Spark sql should allow unescaped quote mark in quoted string
### What changes were proposed in this pull request?
`def splitSemiColon` cannot handle unescaped quote mark like "'" or '"' correctly. When there are unmatched quotes in a string, `splitSemiColon` will not drop off semicolon as expected.

### Why are the changes needed?
Some regex expression will use quote mark in string. We should process semicolon correctly.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added Unit test and also manual test.

Closes #28393 from adrian-wang/unescaped.

Authored-by: Daoyuan Wang <me@daoyuan.wang>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:34:43 +00:00
zhengruifeng ebdf41dd69 [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors
### What changes were proposed in this pull request?
1, add new param `blockSize`;
2, add a new class InstanceBlock;
3, **if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP);**
4, if `blockSize>1`, standardize the input outside of optimization procedure;

### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`)

### Does this PR introduce any user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28349 from zhengruifeng/blockify_svc_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-06 10:06:23 +08:00
sychen 588966d696 [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters
### What changes were proposed in this pull request?
Metadata-only queries should not include subquery in partition filters.

### Why are the changes needed?

Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`.

### Does this PR introduce any user-facing change?
Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters.

### How was this patch tested?

add UT

Closes #28383 from cxzl25/fix_SPARK-31590.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-06 10:56:19 +09:00
yi.wu 61a6ca5d3f
[SPARK-31643][TEST] Fix flaky o.a.s.scheduler.BarrierTaskContextSuite.barrier task killed, interrupt
### What changes were proposed in this pull request?

Make sure the task has nearly reached `context.barrier()` before killing.

### Why are the changes needed?

In case of the task is killed before it reaches `context.barrier()`, the task will not create the expected file.

```
Error Message
org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed.
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed.
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1(BarrierTaskContextSuite.scala:266)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1$adapted(BarrierTaskContextSuite.scala:226)
	at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:163)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.testBarrierTaskKilled(BarrierTaskContextSuite.scala:226)
	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$29(BarrierTaskContextSuite.scala:277)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
```

[Here's](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122273/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/barrier_task_killed__interrupt/) the full error messages.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #28454 from Ngone51/fix_kill_interrupt.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-05 12:36:42 -07:00
Steve Loughran 86c4e43525
[SPARK-31644][BUILD] Make Spark's guava version configurable from the command line
### What changes were proposed in this pull request?

This adds the maven property guava.version which can be
used to control the guava version for a build.

It does not change the current version.

### Why are the changes needed?

All future Hadoop releases are going to be built with a later guava version, including Hadoop 3.1.4. This means to run the spark tests with that release you need to update the spark guava version. This patch lets whoever builds spark do this locally.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

ran the hadoop-cloud module tests with the 3.1.4 RC0

```
mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging -Phadoop-cloud,yarn,kinesis-asl test --pl hadoop-cloud
```

observed the linkage problem

```
  java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
```

made the version configurable, retested with

```
-Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging Dguava.version=27.0-jre
```

all good.

Closes #28455 from steveloughran/SPARK-31644-guava-version.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-05 12:17:24 -07:00
Max Gekk bd26429931 [SPARK-31641][SQL] Fix days conversions by JSON legacy parser
### What changes were proposed in this pull request?
Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date.

### Why are the changes needed?
The changes fix a bug and restore compatibility with Spark 2.4 in which:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-11|
+----------+
```

After:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### How was this patch tested?
Add a test to `JsonSuite`.

Closes #28453 from MaxGekk/json-rebase-legacy-days.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:15:31 +00:00