Commit graph

26032 commits

Author SHA1 Message Date
Anton Okolnychyi a9f1809a2a [SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be generic
### What changes were proposed in this pull request?

This PR renames `normalizeFilters` in `DataSourceStrategy` to be more generic as the logic is not specific to filters.

### Why are the changes needed?

These changes are needed to support PR #26751.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26830 from aokolnychyi/rename-normalize-exprs.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-10 07:49:22 -08:00
Huaxin Gao 1cac9b2cc6 [SPARK-29967][ML][PYTHON] KMeans support instance weighting
### What changes were proposed in this pull request?
add weight support in KMeans
### Why are the changes needed?
KMeans should support weighting
### Does this PR introduce any user-facing change?
Yes. ```KMeans.setWeightCol```

### How was this patch tested?
Unit Tests

Closes #26739 from huaxingao/spark-29967.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-10 09:33:06 -06:00
yi.wu aa9da9365f [SPARK-30151][SQL] Issue better error message when user-specified schema mismatched
### What changes were proposed in this pull request?

Issue better error message when user-specified schema and not match relation schema

### Why are the changes needed?

Inspired by https://github.com/apache/spark/pull/25248#issuecomment-559594305, user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error.

### Does this PR introduce any user-facing change?

Yes, user will see error message changes.

### How was this patch tested?

Updated existed tests.

Closes #26781 from Ngone51/dev-mismatch-schema.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-10 20:56:21 +08:00
Takeshi Yamamuro be867e8a9e [SPARK-30196][BUILD] Bump lz4-java version to 1.7.0
### What changes were proposed in this pull request?

This pr intends to upgrade lz4-java from 1.6.0 to 1.7.0.

### Why are the changes needed?

This release includes a performance bug (https://github.com/lz4/lz4-java/pull/143) fixed by JoshRosen and some improvements (e.g., LZ4 binary update). You can see the link below for the changes;
https://github.com/lz4/lz4-java/blob/master/CHANGES.md#170

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #26823 from maropu/LZ4_1_7_0.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-10 12:22:03 +09:00
Luan 3d98c9f985 [SPARK-30179][SQL][TESTS] Improve test in SingleSessionSuite
### What changes were proposed in this pull request?

improve the temporary functions test in SingleSessionSuite by verifying the result in a query

### Why are the changes needed?

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #26812 from leoluan2009/SPARK-30179.

Authored-by: Luan <xuluan@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-10 10:57:32 +09:00
Sean Owen 36fa1980c2 [SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray
### What changes were proposed in this pull request?

Use Seq instead of Array in sc.parallelize, with reference types.
Remove usage of WrappedArray.

### Why are the changes needed?

These both enable building on Scala 2.13.

### Does this PR introduce any user-facing change?

None

### How was this patch tested?

Existing tests

Closes #26787 from srowen/SPARK-30158.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-09 14:41:48 -06:00
Huaxin Gao 8a9cccf1f3 [SPARK-30146][ML][PYSPARK] Add setWeightCol to GBTs in PySpark
### What changes were proposed in this pull request?
add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor```

### Why are the changes needed?
https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
doc test

Closes #26774 from huaxingao/spark-30146.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2019-12-09 13:39:33 -06:00
Jungtaek Lim (HeartSaVioR) 538b8d101c [SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports
### What changes were proposed in this pull request?

This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate.

### Why are the changes needed?

Java linter starts complaining.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

lint-java passed locally

This closes #26819

Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-09 08:57:20 -08:00
Luca Canali 729f43f499 [SPARK-27189][CORE] Add Executor metrics and memory usage instrumentation to the metrics system
## What changes were proposed in this pull request?

This PR proposes to add instrumentation of memory usage via the Spark Dropwizard/Codahale metrics system. Memory usage metrics are available via the Executor metrics, recently implemented as detailed in https://issues.apache.org/jira/browse/SPARK-23206.
Additional notes: This takes advantage of the metrics poller introduced in #23767.

## Why are the changes needed?
Executor metrics bring have many useful insights on memory usage, in particular on the usage of storage memory and executor memory. This is useful for troubleshooting. Having the information in the metrics systems allows to add those metrics to Spark performance dashboards and study memory usage as a function of time, as in the example graph https://issues.apache.org/jira/secure/attachment/12962810/Example_dashboard_Spark_Memory_Metrics.PNG

## Does this PR introduce any user-facing change?
Adds `ExecutorMetrics` source to publish executor metrics via the Dropwizard metrics system. Details of the available metrics in docs/monitoring.md
Adds configuration parameter `spark.metrics.executormetrics.source.enabled`

## How was this patch tested?

Tested on YARN cluster and with an existing setup for a Spark dashboard based on InfluxDB and Grafana.

Closes #24132 from LucaCanali/memoryMetricsSource.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2019-12-09 08:55:30 -06:00
Gengliang Wang a717d219a6 [SPARK-30159][SQL][TESTS] Fix the method calls of QueryTest.checkAnswer
### What changes were proposed in this pull request?

Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen.
The actual exceptions are thrown in the trait `QueryTest`.

However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers.

After this PR:
1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results.
2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message.
3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`.

### Why are the changes needed?

We should fix these method calls to perform actual validations in test suites.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #26788 from gengliangwang/fixCheckAnswer.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-09 22:19:08 +09:00
fuwhu c2f29d5ea5 [SPARK-30138][SQL] Separate configuration key of max iterations for analyzer and optimizer
### What changes were proposed in this pull request?
separate the configuration keys "spark.sql.optimizer.maxIterations" and "spark.sql.analyzer.maxIterations".

### Why are the changes needed?
Currently, both Analyzer and Optimizer use conf "spark.sql.optimizer.maxIterations" to set the max iterations to run, which is a little confusing.
It is clearer to add a new conf "spark.sql.analyzer.maxIterations" for analyzer max iterations.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
Existing unit tests.

Closes #26766 from fuwhu/SPARK-30138.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-12-09 19:43:32 +09:00
Aman Omer dcea7a4c9a [SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and bool_or()
### What changes were proposed in this pull request?
This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`.

### Why are the changes needed?
Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tested manually.

For query,
`select every('true');`

Output before this PR,

> Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7;

After this PR,

> Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7;

Closes #26712 from amanomer/29883.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-09 13:23:16 +08:00
HyukjinKwon a57bbf2ee0 [SPARK-30164][TESTS][DOCS] Exclude Hive domain in Unidoc build explicitly
### What changes were proposed in this pull request?

This PR proposes to exclude Unidoc checking in Hive domain. We don't publish this as a part of Spark documentation (see also https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30) and most of them are copy of Hive thrift server so that we can officially use Hive 2.3 release.

It doesn't much make sense to check the documentation generation against another domain, and that we don't use in documentation publish.

### Why are the changes needed?

To avoid unnecessary computation.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

By Jenkins:

```
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-2.7 -Phive-2.3 -Phive -Pmesos -Pkubernetes -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn test:package streaming-kinesis-asl-assembly/assembly
...

========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-2.7 -Phive-2.3 -Phive -Pmesos -Pkubernetes -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn unidoc
...
[info] Main Java API documentation successful.
...
[info] Main Scala API documentation successful.
```

Closes #26800 from HyukjinKwon/do-not-merge.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-09 13:15:49 +09:00
Pablo Langa bca9de6684 [SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution
### What changes were proposed in this pull request?

Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands.

We don’t have this methods in the catalog to implement an V2 command
* catalog.listFunctions

### Why are the changes needed?

It's important to make all the commands have the same table resolution behavior, to avoid confusing
`SHOW FUNCTIONS LIKE namespace.function`

### Does this PR introduce any user-facing change?

Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog.

### How was this patch tested?

Unit tests.

Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-12-08 20:15:09 -08:00
Dongjoon Hyun 16f1b23d75 [SPARK-30163][INFRA][FOLLOWUP] Make .m2 directory for cold start without cache
### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/26793 and aims to initialize `~/.m2` directory.

### Why are the changes needed?

In case of cache reset, `~/.m2` directory doesn't exist. It causes a failure.
- `master` branch has a cache as of now. So, we missed this.
- `branch-2.4` has no cache as of now, and we hit this failure.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This PR is tested against personal `branch-2.4`.
- https://github.com/dongjoon-hyun/spark/pull/12

Closes #26794 from dongjoon-hyun/SPARK-30163-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-07 12:58:00 -08:00
Dongjoon Hyun 1068b8b249 [SPARK-30163][INFRA] Use Google Maven mirror in GitHub Action
### What changes were proposed in this pull request?

This PR aims to use [Google Maven mirror](https://cloudplatform.googleblog.com/2015/11/faster-builds-for-Java-developers-with-Maven-Central-mirror.html) in `GitHub Action` jobs to improve the stability.

```xml
<settings>
  <mirrors>
    <mirror>
      <id>google-maven-central</id>
      <name>GCS Maven Central mirror</name>
      <url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>
</settings>
```

### Why are the changes needed?

Although we added Maven cache inside `GitHub Action`, the timeouts happen too frequently during access `artifact descriptor`.

```
[ERROR] Failed to execute goal on project spark-mllib_2.12:
...
Failed to read artifact descriptor for ...
...
Connection timed out (Read failed) -> [Help 1]
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This PR is irrelevant to Jenkins.

This is tested on the personal repository first. `GitHub Action` of this PR should pass.
- https://github.com/dongjoon-hyun/spark/pull/11

Closes #26793 from dongjoon-hyun/SPARK-30163.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-07 12:04:10 -08:00
Kent Yao e88d74052b [SPARK-30147][SQL] Trim the string when cast string type to booleans
### What changes were proposed in this pull request?

Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean.
This behavior makes type cast and coercion inconsistency in Spark.
Not fitting ANSI SQL standard either.
```
If TD is boolean, then
Case:
a) If SD is character string, then SV is replaced by
    TRIM ( BOTH ' ' FROM VE )
    Case:
    i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid
value of the data type TD, then let TV be that value.
   ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast.
b) If SD is boolean, then TV is SV
```
In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces.

### Why are the changes needed?

Type cast/coercion consistency

### Does this PR introduce any user-facing change?

yes, string with whitespaces in both ends will be trimmed before converted to booleans.

e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null`
### How was this patch tested?

add unit tests

Closes #26776 from yaooqinn/SPARK-30147.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-12-07 15:03:51 +09:00
Dongjoon Hyun afc4fa02bd [SPARK-30156][BUILD] Upgrade Jersey from 2.29 to 2.29.1
### What changes were proposed in this pull request?

This PR aims to upgrade `Jersey` from 2.29 to 2.29.1.

### Why are the changes needed?

This will bring several bug fixes and important dependency upgrades.
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.29.1.html

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #26785 from dongjoon-hyun/SPARK-30156.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 18:49:43 -08:00
Dongjoon Hyun 1e0037b5e9 [SPARK-30157][BUILD][TEST-HADOOP3.2][TEST-JAVA11] Upgrade Apache HttpCore from 4.4.10 to 4.4.12
### What changes were proposed in this pull request?

This PR aims to upgrade `Apache HttpCore` from 4.4.10 to 4.4.12.

### Why are the changes needed?

`Apache HttpCore v4.4.11` is the first official release for JDK11.
> This is a maintenance release that corrects a number of defects in non-blocking SSL session code that caused compatibility issues with TLSv1.3 protocol implementation shipped with Java 11.

For the full release note, please see the following.
- https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins.

Closes #26786 from dongjoon-hyun/SPARK-30157.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-07 10:59:10 +09:00
Aman Omer 51aa7a920e [SPARK-30148][SQL] Optimize writing plans if there is an analysis exception
### What changes were proposed in this pull request?
Optimized QueryExecution.scala#writePlans().

### Why are the changes needed?
If any query fails in Analysis phase and gets AnalysisException, there is no need to execute further phases since those will return a same result i.e, AnalysisException.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually

Closes #26778 from amanomer/optExplain.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-07 10:58:02 +09:00
Sean Owen a30ec19a73 [SPARK-30155][SQL] Rename parse() to parseString() to avoid conflict in Scala 2.13
### What changes were proposed in this pull request?

Rename internal method LegacyTypeStringParser.parse() to parseString().

### Why are the changes needed?

In Scala 2.13, the parse() definition clashes with supertype declarations.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #26784 from srowen/SPARK-30155.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 16:16:28 -08:00
Dongjoon Hyun 81996f9e4d [SPARK-30152][INFRA] Enable Hadoop-2.7/JDK11 build at GitHub Action
### What changes were proposed in this pull request?

This PR enables JDK11 build with `hadoop-2.7` profile at `GitHub Action`.

**BEFORE (6 jobs including one JDK11 job)**
![before](https://user-images.githubusercontent.com/9700541/70342731-7763f300-180a-11ea-859f-69038b88451f.png)

**AFTER (7 jobs including two JDK11 jobs)**
![after](https://user-images.githubusercontent.com/9700541/70342658-54d1da00-180a-11ea-9fba-507fc087dc62.png)

### Why are the changes needed?

SPARK-29957 makes JDK11 test work with `hadoop-2.7` profile. We need to protect it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is `GitHub Action` only PR. See the result of `GitHub Action` on this PR.

Closes #26782 from dongjoon-hyun/SPARK-GHA-HADOOP-2.7.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 12:01:36 -08:00
wuyi 58be82ad4b [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax
### What changes were proposed in this pull request?

In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0.

And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`.

### Why are the changes needed?

1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent.

2. User may gets confused in some cases. For example:

```
CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);
```

In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.

On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`:

```
CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;
```
And these two cases together can be really confusing.

3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.

### Does this PR introduce any user-facing change?

Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider.

### How was this patch tested?

Added tests in `DDLParserSuite` and `HiveDDlSuite`.

Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-07 02:15:25 +08:00
Liang-Chi Hsieh c1a5f94973 [SPARK-30112][SQL] Allow insert overwrite same table if using dynamic partition overwrite
### What changes were proposed in this pull request?

This patch proposes to allow insert overwrite same table if using dynamic partition overwrite.

### Why are the changes needed?

Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite.

This enables users to read data from a table and insert overwrite to same table by using dynamic partition overwrite. Because this is not allowed for now, users need to write to other temporary location and move it back to the table.

### Does this PR introduce any user-facing change?

Yes. Users can insert overwrite same table if using dynamic partition overwrite.

### How was this patch tested?

Unit test.

Closes #26752 from viirya/dynamic-overwrite-same-table.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 09:22:16 -08:00
Sean Owen c8ed71b3cd [SPARK-30011][SQL] Inline 2.12 "AsIfIntegral" classes, not present in 2.13
### What changes were proposed in this pull request?

Classes like DoubleAsIfIntegral are not found in Scala 2.13, but used in the current build. This change 'inlines' the 2.12 implementation and makes it work with both 2.12 and 2.13.

### Why are the changes needed?

To cross-compile with 2.13.

### Does this PR introduce any user-facing change?

It should not as it copies in 2.12's existing behavior.

### How was this patch tested?

Existing tests.

Closes #26769 from srowen/SPARK-30011.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-06 08:15:38 -08:00
Dongjoon Hyun 1595e46a4e [SPARK-30142][TEST-MAVEN][BUILD] Upgrade Maven to 3.6.3
### What changes were proposed in this pull request?

This PR aims to upgrade Maven from 3.6.2 to 3.6.3.

### Why are the changes needed?

This will bring bug fixes like the following.
- MNG-6759 Maven fails to use <repositories> section from dependency when resolving transitive dependencies in some cases
- MNG-6760 ExclusionArtifactFilter result invalid when wildcard exclusion is followed by other exclusions

The following is the full release note.
- https://maven.apache.org/docs/3.6.3/release-notes.html

### Does this PR introduce any user-facing change?

No. (This is a dev-environment change.)

### How was this patch tested?

Pass the Jenkins with both SBT and Maven.

Closes #26770 from dongjoon-hyun/SPARK-30142.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-06 23:41:59 +09:00
gengjiaan 187f3c1773 [SPARK-28083][SQL] Support LIKE ... ESCAPE syntax
## What changes were proposed in this pull request?

The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL.
For example:

```
select 'abcSpark_13sd' LIKE '%Spark\\_%';             //true
select 'abcSpark_13sd' LIKE '%Spark/_%';              //false
select 'abcSpark_13sd' LIKE '%Spark"_%';              //false
select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/';   //true
select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"';   //true
select 'abcSpark%13sd' LIKE '%Spark\\%%';             //true
select 'abcSpark%13sd' LIKE '%Spark/%%';              //false
select 'abcSpark%13sd' LIKE '%Spark"%%';              //false
select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/';   //true
select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"';   //true
select 'abcSpark\\13sd' LIKE '%Spark\\\\_%';          //true
select 'abcSpark/13sd' LIKE '%Spark//_%';             //false
select 'abcSpark"13sd' LIKE '%Spark""_%';             //false
select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/';  //true
select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"';  //true
```
But Spark SQL only supports 'LIKE predicate'.

Note: If the input string or pattern string is null, then the result is null too.

There are some mainstream database support the syntax.

**PostgreSQL:**
https://www.postgresql.org/docs/11/functions-matching.html

**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape

**MySQL:**
https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html

**Oracle:**
https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A

## How was this patch tested?

Exists UT and new UT.

This PR merged to my production environment and runs above sql:
```
spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%';
true
Time taken: 0.119 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%';
false
Time taken: 0.103 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%';
false
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/';
true
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"';
true
Time taken: 0.092 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%';
true
Time taken: 0.109 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%';
false
Time taken: 0.1 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%';
false
Time taken: 0.081 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/';
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"';
true
Time taken: 0.113 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%';
true
Time taken: 0.078 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%';
false
Time taken: 0.067 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%';
false
Time taken: 0.084 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
```
I create a table and its schema is:
```
spark-sql> desc formatted gja_test;
key     string  NULL
value   string  NULL
other   string  NULL

# Detailed Table Information
Database        test
Table   gja_test
Owner   test
Created Time    Wed Apr 10 11:06:15 CST 2019
Last Access     Thu Jan 01 08:00:00 CST 1970
Created By      Spark 2.4.1-SNAPSHOT
Type    MANAGED
Provider        hive
Table Properties        [transient_lastDdlTime=1563443838]
Statistics      26 bytes
Location        hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat     org.apache.hadoop.mapred.TextInputFormat
OutputFormat    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties      [field.delim=   , serialization.format= ]
Partition Provider      Catalog
Time taken: 0.642 seconds, Fetched 21 row(s)
```
Table `gja_test` exists three rows of data.
```
spark-sql> select * from gja_test;
a       A       ao
b       B       bo
"__     """__   "
Time taken: 0.665 seconds, Fetched 3 row(s)
```
At finally, I test this function:
```
spark-sql> select * from gja_test where key like value escape '"';
"__     """__   "
Time taken: 0.687 seconds, Fetched 1 row(s)
```

Closes #25001 from beliefer/ansi-sql-like.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-12-06 00:07:38 -08:00
Terry Kim b86d4bb931 [SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables
### What changes were proposed in this pull request?

This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation.

### Why are the changes needed?

Currently there are two issues:
1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice).
2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails:
```
scala> sql("create table t using csv as select 1 as i")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+---+
|  i|
+---+
|  1|
+---+

scala> sql("select * from spark_catalog.t").show
org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [spark_catalog, t]
```

### Does this PR introduce any user-facing change?

Yes. Now the catalog name is resolved correctly:
```
scala> sql("create table t using csv as select 1 as i")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
+---+
|  i|
+---+
|  1|
+---+

scala> sql("select * from spark_catalog.t").show
+---+
|  i|
+---+
|  1|
+---+
```

### How was this patch tested?

Added new tests.

Closes #26684 from imback82/resolve_relation.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-06 15:45:13 +08:00
madianjun a5ccbced8c [SPARK-30067][CORE] Fix fragment offset comparison in getBlockHosts
### What changes were proposed in this pull request?

A bug fixed about the code in getBlockHosts() function. In the case "The fragment ends at a position within this block", the end of fragment should be before the end of block,where the "end of block" means `b.getOffset + b.getLength`,not `b.getLength`.

### Why are the changes needed?

When comparing the fragment end and the block end,we should use fragment's `offset + length`,and then compare to the block's `b.getOffset + b.getLength`, not the block's length.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?
No test.

Closes #26650 from mdianjun/fix-getBlockHosts.

Authored-by: madianjun <madianjun@jd.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-05 23:39:49 -08:00
angerszhu da27f91560 [SPARK-29957][TEST] Reset MiniKDC's default enctypes to fit jdk8/jdk11
### What changes were proposed in this pull request?

Hadoop jira: https://issues.apache.org/jira/browse/HADOOP-12911
In this jira, the author said to replace origin Apache Directory project which is not maintained (but not said it won't work well in jdk11) to Apache Kerby which is java binding(fit java version).

And in Flink: https://github.com/apache/flink/pull/9622
Author show the reason why hadoop-2.7.2's  `MminiKdc` failed with jdk11.
Because new encryption types of `es128-cts-hmac-sha256-128` and `aes256-cts-hmac-sha384-192` (for Kerberos 5) enabled by default were added in Java 11.
Spark with `hadoop-2.7's MiniKdc`does not support these encryption types and does not work well when these encryption types are enabled, which results in the authentication failure.

And when I test hadoop-2.7.2's minikdc in local, the kerberos 's debug error message is  read message stream failed, message can't match.

### Why are the changes needed?
Support jdk11 with hadoop-2.7

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
Existed UT

Closes #26594 from AngersZhuuuu/minikdc-3.2.0.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-05 23:12:45 -08:00
Jungtaek Lim (HeartSaVioR) 25431d79f7
[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
### What changes were proposed in this pull request?

This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.

To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.

### Why are the changes needed?

Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added UT.

Closes #26590 from HeartSaVioR/SPARK-29953.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-12-05 21:46:28 -08:00
Liang-Chi Hsieh 755d889448 [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large
### What changes were proposed in this pull request?

This patch adds normalization to word vectors when fitting dataset in Word2Vec.

### Why are the changes needed?

Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.

### Does this PR introduce any user-facing change?

Yes. After this patch, Word2Vec won't produce infinity word vectors.

### How was this patch tested?

Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.

```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```

Before:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-Infinity,-Infin...|
|       Talent|[-Infinity,Infini...|
|    Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
|      Priests|[-1.3570403355926...|
|    Religion:|[-6.7049072282803...|
|           Bu|[5.05591774315586...|
|      Totoro:|[-1.0539840178632...|
|     Trouble,|[-3.5363592836003...|
|       Hatter|[4.90413981352826...|
|          '79|[7.50436471285412...|
|         Vile|[-2.9147142985312...|
|         9/11|[-Infinity,Infini...|
|      Santino|[1.30005911270850...|
|      Motives|[-1.2538958306253...|
|          '13|[-4.5040152427657...|
|       Fierce|[Infinity,Infinit...|
|       Stover|[-2.6326895394029...|
|          'It|[1.66574533864436...|
|        Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```

After:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-0.0454501919448...|
|       Talent|[-0.2657704949378...|
|    Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
|      Priests|[-0.0047509293071...|
|    Religion:|[-0.0411605164408...|
|           Bu|[0.11837736517190...|
|      Totoro:|[0.05258282646536...|
|     Trouble,|[0.09482011198997...|
|       Hatter|[0.06040831282734...|
|          '79|[0.04783720895648...|
|         Vile|[-0.0017210749210...|
|         9/11|[-0.0713915303349...|
|      Santino|[-0.0412711687386...|
|      Motives|[-0.0492418706417...|
|          '13|[-0.0073119504377...|
|       Fierce|[-0.0565455369651...|
|       Stover|[0.06938160210847...|
|          'It|[0.01117012929171...|
|        Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```

Closes #26722 from viirya/SPARK-24666-2.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-12-05 16:32:33 -08:00
Sean Owen 7782b61a31 [SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings
TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748

I told you it'd be iterative!

Closes #26765 from srowen/SPARK-29392.3.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-05 13:48:29 -08:00
Aman Omer 5892bbf447 [SPARK-30124][MLLIB] unnecessary persist in PythonMLLibAPI.scala
### What changes were proposed in this pull request?
Removed unnecessary persist.

### Why are the changes needed?
Persist in `PythonMLLibAPI.scala` is unnecessary because later in `run()` of `gmmAlg` is caching the data.
710ddab39e/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala (L167-L171)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually

Closes #26758 from amanomer/improperPersist.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-12-05 11:54:45 -06:00
Kent Yao 35bab33984 [SPARK-30121][BUILD] Fix memory usage in sbt build script
### What changes were proposed in this pull request?
1. the default memory setting is missing in usage instructions
```
build/sbt -h
```
before
```
-mem    <integer>  set memory options (default: , which is -Xms2048m -Xmx2048m -XX:ReservedCodeCacheSize=256m)
```
after
```
-mem    <integer>  set memory options (default: 2048, which is -Xms2048m -Xmx2048m -XX:ReservedCodeCacheSize=256m)
```
2. the Perm space is not needed anymore, since java7 is removed.

the changes in this pr are based on the main sbt script of the newest stable version 1.3.4.

### Why are the changes needed?

bug fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

manually

Closes #26757 from yaooqinn/SPARK-30121.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-12-05 11:50:55 -06:00
Kent Yao b9cae37750 [SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres
# What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy:
```scala
 /**
   * For [[Add]]:
   * 1. if both side are interval, stays the same;
   * 2. else if one side is interval, turns it to [[TimeAdd]];
   * 3. else if one side is date, turns it to [[DateAdd]] ;
   * 4. else stays the same.
   *
   * For [[Subtract]]:
   * 1. if both side are interval, stays the same;
   * 2. else if the right side is an interval, turns it to [[TimeSub]];
   * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
   * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
   * 5. else if the left side is date, turns it to [[DateSub]];
   * 6. else turns it to stays the same.
   *
   * For [[Multiply]]:
   * 1. If one side is interval, turns it to [[MultiplyInterval]];
   * 2. otherwise, stays the same.
   *
   * For [[Divide]]:
   * 1. If the left side is interval, turns it to [[DivideInterval]];
   * 2. otherwise, stays the same.
   */
```
Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule.
### Why are the changes needed?

Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.

### Does this PR introduce any user-facing change?

1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.

### How was this patch tested?

add ut

Closes #26412 from yaooqinn/SPARK-29774.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 22:03:44 +08:00
Kent Yao 332e252a14 [SPARK-29425][SQL] The ownership of a database should be respected
### What changes were proposed in this pull request?

Keep the owner of a database when executing alter database commands

### Why are the changes needed?

Spark will inadvertently delete the owner of a database for executing databases ddls

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

add and modify uts

Closes #26080 from yaooqinn/SPARK-29425.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 16:14:27 +08:00
turbofei 0ab922c1eb [SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery
### What changes were proposed in this pull request?
There is an issue for InSubquery expression.
For example, there are two tables `ta` and `tb` created by the below statements.
```
 sql("create table ta(id Decimal(18,0)) using parquet")
 sql("create table tb(id Decimal(19,0)) using parquet")
```
This statement below would thrown dataType mismatch exception.

```
 sql("select * from ta where id in (select id from tb)").show()
```
However, this similar statement could execute successfully.

```
 sql("select * from ta where id in ((select id from tb))").show()
```
The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression.
Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt.
In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved.

### Why are the changes needed?
Some InSubquery would throw dataType mismatch exception.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Unit test.

Closes #26485 from turboFei/SPARK-29860-in-subquery.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 16:00:16 +08:00
Aman Omer 0bd8b995d6 [SPARK-30093][SQL] Improve error message for creating view
### What changes were proposed in this pull request?
Improved error message while creating views.

### Why are the changes needed?
Error message should suggest user to use TEMPORARY keyword while creating permanent view referred by temporary view.
https://github.com/apache/spark/pull/26317#discussion_r352377363

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Updated test case.

Closes #26731 from amanomer/imp_err_msg.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 15:28:07 +08:00
Sean Owen ebd83a544e [SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly
### What changes were proposed in this pull request?

Follow up on https://github.com/apache/spark/pull/26654#discussion_r353826162
Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`.

### Why are the changes needed?

Simplification of the previous change, which existed to support Scala 2.13 migration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #26761 from srowen/SPARK-30009.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-05 11:27:25 +08:00
Marcelo Vanzin c5f312a6ac [SPARK-30129][CORE] Set client's id in TransportClient after successful auth
The new auth code was missing this bit, so it was not possible to know which
app a client belonged to when auth was on.

I also refactored the SASL test that checks for this so it also checks the
new protocol (test failed before the fix, passes now).

Closes #26760 from vanzin/SPARK-30129.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-04 17:11:50 -08:00
Nicholas Chammas 29e09a83b7 [SPARK-30084][DOCS] Document how to trigger Jekyll build on Python API doc changes
### What changes were proposed in this pull request?

This PR adds a note to the docs README showing how to get Jekyll to automatically pick up changes to the Python API docs.

### Why are the changes needed?

`jekyll serve --watch` doesn't watch for changes to the API docs. Without the technique documented in this note, or something equivalent, developers have to manually retrigger a Jekyll build any time they update the Python API docs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I tested this PR manually by making changes to Python docstrings and confirming that Jekyll automatically picks them up and serves them locally.

Closes #26719 from nchammas/SPARK-30084-watch-api-docs.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-12-04 17:31:23 -06:00
Sean Owen 2ceed6f32c [SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings
### What changes were proposed in this pull request?

Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent.

### Why are the changes needed?

Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13.

The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings.

While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`.

### Does this PR introduce any user-facing change?

Should not change behavior.

### How was this patch tested?

Existing tests.

Closes #26748 from srowen/SPARK-29392.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-12-04 15:03:26 -08:00
07ARB a2102c81ee [SPARK-29453][WEBUI] Improve tooltips information for SQL tab
### What changes were proposed in this pull request?
Adding tooltip to SQL tab for better usability.

### Why are the changes needed?
There are a few common points of confusion in the UI that could be clarified with tooltips. We
 should add tooltips to explain.

### Does this PR introduce any user-facing change?
yes.
![Screenshot 2019-11-23 at 9 47 41 AM](https://user-images.githubusercontent.com/8948111/69472963-aaec5980-0dd6-11ea-881a-fe6266171054.png)

### How was this patch tested?
Manual test.

Closes #26641 from 07ARB/SPARK-29453.

Authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-12-04 12:33:43 -06:00
zhengruifeng 710ddab39e [SPARK-29914][ML] ML models attach metadata in transform/transformSchema
### What changes were proposed in this pull request?
1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute`
2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses`
3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k`
4, `leafCol` in GBT/RF  add `AttributeGroup` containing vectorsize=`numTrees`
5, `leafCol` in DecisionTree  add `NominalAttribute`
6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize
7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize

### Why are the changes needed?
Appened metadata can be used in downstream ops, like `Classifier.getNumClasses`

There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method.

However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred.

### Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.

### How was this patch tested?
existing testsuites and added testsuites

Closes #26547 from zhengruifeng/add_output_vecSize.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-04 16:39:57 +08:00
Aman Omer 55132ae9c9 [SPARK-30099][SQL] Improve Analyzed Logical Plan
### What changes were proposed in this pull request?
Avoid duplicate error message in Analyzed Logical plan.

### Why are the changes needed?
Currently, when any query throws `AnalysisException`, same error message will be repeated because of following code segment.
04a5b8f5f8/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (L157-L166)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually. Result of `explain extended select * from wrong;`
BEFORE
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>

AFTER
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>

Closes #26734 from amanomer/cor_APlan.

Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-12-04 13:51:40 +08:00
Nicholas Chammas c8922d9145 [SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC APIs
### What changes were proposed in this pull request?

This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs.

### Why are the changes needed?

So the Python API matches the Scala API.

### Does this PR introduce any user-facing change?

Yes, it adds a new option directly in the ORC reader method signatures.

### How was this patch tested?

I tested this manually as follows:

```
>>> spark.range(3).write.orc('test-orc')
>>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested')
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
>>> spark.conf.set('spark.sql.orc.mergeSchema', True)
>>> spark.read.orc('test-orc', recursiveFileLookup=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
```

Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-04 11:44:24 +09:00
Nicholas Chammas e766a323bc [SPARK-30091][SQL][PYTHON] Document mergeSchema option directly in the PySpark Parquet APIs
### What changes were proposed in this pull request?

This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data.

### Why are the changes needed?

The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight.

Before this PR, you'd have to do this to use `mergeSchema`:

```python
spark.read.option('mergeSchema', True).parquet('test-parquet').show()
```

After this PR, you can use the option as (I believe) it was intended to be used:

```python
spark.read.parquet('test-parquet', mergeSchema=True).show()
```

### Does this PR introduce any user-facing change?

Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings.

### How was this patch tested?

Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works.

I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`:

```
>>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
>>> spark.range(3).write.parquet('test-parquet/id')
>>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name')
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show()
+----+----+
|  id|name|
+----+----+
|null|   1|
|null|   2|
|null|   0|
|   1|null|
|   2|null|
|   0|null|
+----+----+
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show()
+----+
|  id|
+----+
|null|
|null|
|null|
|   1|
|   2|
|   0|
+----+
```

Closes #26730 from nchammas/parquet-merge-schema.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-12-04 11:31:57 +09:00
Ilan Filonenko 708cf16be9 [SPARK-30111][K8S] Apt-get update to fix debian issues
### What changes were proposed in this pull request?
Added apt-get update as per [docker best-practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get)

### Why are the changes needed?
Builder is failing because:
Without doing apt-get update, the APT lists get outdated and begins referring to package versions that no longer exist, hence the 404 trying to download them (Debian does not keep old versions in the archive when a package is updated).

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
k8s builder

Closes #26753 from ifilonenko/SPARK-30111.

Authored-by: Ilan Filonenko <ifilonenko@bloomberg.net>
Signed-off-by: shane knapp <incomplete@gmail.com>
2019-12-03 17:59:02 -08:00
zhengruifeng 5496e980e9 [SPARK-30109][ML] PCA use BLAS.gemv for sparse vectors
### What changes were proposed in this pull request?
When PCA was first impled in [SPARK-5521](https://issues.apache.org/jira/browse/SPARK-5521), at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So worked around it by applying a sparse matrix multiplication.

Since [SPARK-7681](https://issues.apache.org/jira/browse/SPARK-7681), BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now.

### Why are the changes needed?
for simplity

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #26745 from zhengruifeng/pca_mul.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-12-04 09:50:00 +08:00