Commit graph

28383 commits

Author SHA1 Message Date
Liang-Chi Hsieh 87b32f65ef [MINOR][DOCS][TESTS] Fix PLAN_CHANGE_LOG_LEVEL document
### What changes were proposed in this pull request?

`PLAN_CHANGE_LOG_LEVEL` config document is wrong. This is to fix it.

### Why are the changes needed?

Fix wrong doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Only doc change.

Closes #30136 from viirya/minor-sqlconf.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-23 13:35:46 +09:00
Ankit Srivastava 3819d39607 [SPARK-32998][BUILD] Add ability to override default remote repos with internal one
### What changes were proposed in this pull request?
- Building spark internally in orgs where access to outside internet is not allowed takes a long time because unsuccessful attempts are made to download artifacts from repositories which are not accessible. The unsuccessful attempts unnecessarily add significant amount of time to the build. I have seen a difference of up-to 1hr for some runs.
- Adding 1 environment variables that should be present that the start of the build and if they exist, override the default repos defined in the code and scripts.
envVariables:
      - DEFAULT_ARTIFACT_REPOSITORY=https://artifacts.internal.com/libs-release/

### Why are the changes needed?

To allow orgs to build spark internally without relying on external repositories for artifact downloads.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Multiple builds with and without env variables set.

Closes #29874 from ankits/SPARK-32998.

Authored-by: Ankit Srivastava <ankit_srivastava@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-22 16:35:55 -07:00
Max Gekk a03d77d326 [SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key org.apache.spark.int96NoRebase by org.apache.spark.legacyINT96
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
    - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
    - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
    - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.

### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified test in `ParquetIOSuite`

Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 15:57:03 +00:00
yangjie01 b38f3a5557 [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
### What changes were proposed in this pull request?

The purpose of this pr is to resolve SPARK-32978.

The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side.

The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct.

### Why are the changes needed?
The the number of dynamic part metric we display on the UI should be correct.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a new test case refer to described in SPARK-32978

Closes #30026 from LuciferYang/SPARK-32978.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 14:01:07 +00:00
angerszhu a1629b4a57 [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location
### What changes were proposed in this pull request?
Support `spark.sql.hive.metastore.jars` use HDFS location.

When user need to use path to set hive metastore jars, you should set
`spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path`
since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths.
To keep both two way supported

### Why are the changes needed?
All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node.

### Does this PR introduce _any_ user-facing change?
User can use HDFS location to store hive metastore jars

### How was this patch tested?
Manuel tested.

Closes #29881 from AngersZhuuuu/SPARK-32852.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 13:53:01 +00:00
Prashant Sharma 8cae7f88b0 [SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
### What changes were proposed in this pull request?

Override the default SQL strings for:
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.

### Why are the changes needed?
Improved code coverage and support mysql dialect for jdbc.

### Does this PR introduce _any_ user-facing change?

Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

### How was this patch tested?

Added tests.

Closes #30025 from ScrapCodes/mysql-dialect.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 13:51:42 +00:00
Xuedong Luan d9ee33cfb9 [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver
### What changes were proposed in this pull request?

Support query auto cancelling when running too long on thriftserver.

This is the rework of #28991 and the credit should be the original author, leoluan2009.

Closes #28991

### Why are the changes needed?

For some cases, we use thriftserver as long-running applications.
Some times we want all the query need not to run more than given time.
In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #29933 from maropu/pr28991.

Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Luan <luanxuedong2009@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-22 17:23:10 +09:00
Dongjoon Hyun a908b67502 [SPARK-33218][CORE] Update misleading log messages for removed shuffle blocks
### What changes were proposed in this pull request?

This updates the misleading log messages for removed shuffle block during migration.

### Why are the changes needed?

1. For the deleted shuffle blocks, `IndexShuffleBlockResolver` shows users WARN message saying `skipping migration`. However, `BlockManagerDecommissioner` shows users INFO message including `Migrated ShuffleBlockInfo(...)` inconsistently. Technically, we didn't migrated. We should not show `Migrated` message in this case.

```
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3)
WARN IndexShuffleBlockResolver: Failed to resolve shuffle block ShuffleBlockInfo(109,18924), skipping migration. This is expected to occur if a block is removed after decommissioning has started.
INFO BlockManagerDecommissioner: Got migration sub-blocks List()
...
INFO BlockManagerDecommissioner: Migrated ShuffleBlockInfo(109,18924) to BlockManagerId(...)
```

2. In addition, if the shuffle file is deleted while the information is in the queue, the above messages are repeated multiple times, `spark.storage.decommission.maxReplicationFailuresPerBlock`. We had better use one line instead of the group of messages for that case.
```
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (0 / 3)
...
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (1 / 3)
...
INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3)
```

3. Skipping or not is a role of `BlockManagerDecommissioner` class. `IndexShuffleBlockResolver.getMigrationBlocks` is used twice differently like the following. We had better inform users at `BlockManagerDecommissioner` once.
    - At the beginning, to get the sub-blocks.
    - In case of `IOException`, to determine whether ignoring it or re-throwing. And, `BlockManagerDecommissioner` shows WARN message (`Skipping block ...`) again.

### Does this PR introduce _any_ user-facing change?

No. This is an update for log message info to be consistent.

### How was this patch tested?

Manually.

Closes #30129 from dongjoon-hyun/SPARK-33218.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-22 01:10:24 -07:00
gengjiaan eb33bcb4b2 [SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE
### What changes were proposed in this pull request?
`REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string.
But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore:
AddressLine1 -> address_line_1
If we support the parameter position, we can do like this(e.g. Oracle):

```
WITH strings as (
  SELECT 'AddressLine1' s FROM dual union all
  SELECT 'ZipCode' s FROM dual union all
  SELECT 'Country' s FROM dual
)
  SELECT s "STRING",
         lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING"
  FROM strings;
```
The output:
```
  STRING               MODIFIED_STRING
-------------------- --------------------
AddressLine1         address_line_1
ZipCode              zip_code
Country              country
```

There are some mainstream database support the syntax.

**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490

**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html

### Why are the changes needed?
The parameter position for `REGEXP_REPLACE` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test.

Closes #29891 from beliefer/add-position-for-regex_replace.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 07:59:49 +00:00
Chao Sun cb3fa6c936 [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile
### What changes were proposed in this pull request?

This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client.

In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:

```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```

which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.

Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).

### Why are the changes needed?

This serves two purposes:
- to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop.
- avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts.

### Does this PR introduce _any_ user-facing change?

When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.

### How was this patch tested?

Relying on existing tests.

Closes #29843 from sunchao/SPARK-29250.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-10-22 03:21:34 +00:00
Max Gekk ba13b94f6b [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default
### What changes were proposed in this pull request?
1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`.
2. Update the SQL migration guide.

### Why are the changes needed?
Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suites like `ParquetIOSuite`.

Closes #30121 from MaxGekk/int96-exception-by-default.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 03:04:29 +00:00
Alessandro Patti 4a33cd928d [SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors
### What changes were proposed in this pull request?

Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment)

### Why are the changes needed?
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with
```
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
AssertionError: False is not true
```
```
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS
Failed example:
    predictions[0]
Expected:
    Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
    Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This path changes a test target. Just executed the tests to verify they pass.

Closes #30104 from AlessandroPatti/apatti/rounding-errors.

Authored-by: Alessandro Patti <ale812@yahoo.it>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-21 18:14:21 -07:00
Max Gekk bbf2d6f6df [SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing
### What changes were proposed in this pull request?
1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata.
2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

### Why are the changes needed?
To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By updating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```

Closes #30118 from MaxGekk/int96-rebase-benchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-22 10:03:41 +09:00
HyukjinKwon 66005a3236 [SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical
### What changes were proposed in this pull request?

This PR is a small followup of https://github.com/apache/spark/pull/28793 and  proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`.

`is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (87a1cc21ca).

### Why are the changes needed?

To avoid using deprecated APIs, and remove warnings.

### Does this PR introduce _any_ user-facing change?

Yes, it will remove warnings that says `is_categorical` is deprecated.

### How was this patch tested?

By running any pandas UDF with pandas 1.1.0+:

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf

def func(x: pd.Series) -> pd.Series:
    return x

spark.range(10).select(pandas_udf(func, "long")("id")).show()
```

Before:

```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
...
```

After:

```
...
```

Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-10-21 14:46:47 -07:00
Dongjoon Hyun 7aed81d492 [SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct migration status
### What changes were proposed in this pull request?

This PR changes `<` into `>` in the following to fix data loss during storage migrations.

```scala
// If we found any new shuffles to migrate or otherwise have not migrated everything.
- newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get()
+ newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()
```

### Why are the changes needed?

`refreshOffloadingShuffleBlocks` should return `true` when the migration is still on-going.

Since `migratingShuffles` is defined like the following, `migratingShuffles.size > numMigratedShuffles.get()` means the migration is not finished.
```scala
// Shuffles which are either in queue for migrations or migrated
protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]()
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI with the updated test cases.

Closes #30116 from dongjoon-hyun/SPARK-33202.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-21 14:37:56 -07:00
Takeshi Yamamuro 1b7367ccd7 [SPARK-33205][BUILD] Bump snappy-java version to 1.1.8
### What changes were proposed in this pull request?

This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8.

### Why are the changes needed?

For performance improvements; the released `snappy-java` bundles the latest `Snappy` v1.1.8 binaries with small performance improvements.
 - snappy-java release note: https://github.com/xerial/snappy-java/releases/tag/1.1.8
 - snappy release note: https://github.com/google/snappy/releases/tag/1.1.8

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA tests.

Closes #30120 from maropu/Snappy1.1.8.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-10-21 13:04:39 -07:00
zhengruifeng 618695b78f [SPARK-33111][ML][FOLLOW-UP] aft transform optimization - predictQuantiles
### What changes were proposed in this pull request?
1, optimize `predictQuantiles` by pre-computing an auxiliary var.

### Why are the changes needed?
In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var.

It is about 56% faster than existing impl.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #30034 from zhengruifeng/aft_quantiles_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-21 08:49:25 -05:00
Kent Yao dcb0820433 [SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for incomplete interval literals
### What changes were proposed in this pull request?

Address comments  https://github.com/apache/spark/pull/29635#discussion_r507241899 to improve migration guide

### Why are the changes needed?

improve migration guide

### Does this PR introduce _any_ user-facing change?

NO,only doc update

### How was this patch tested?

passing GitHub action

Closes #30113 from yaooqinn/SPARK-32785-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-21 15:51:16 +09:00
Bryan Cutler 47a6568265 [SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested timestamps in pyarrow
### What changes were proposed in this pull request?

Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion.

### Why are the changes needed?

The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 09:13:33 +09:00
Dongjoon Hyun 385d5db941 [SPARK-33198][CORE] getMigrationBlocks should not fail at missing files
### What changes were proposed in this pull request?

This PR aims to fix `getMigrationBlocks` error handling and to add test coverage.
1. `getMigrationBlocks` should not fail at indexFile only case.
2. `assert` causes `java.lang.AssertionError` which is not an `Exception`.

### Why are the changes needed?

To handle the exception correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI with the newly added test case.

Closes #30110 from dongjoon-hyun/SPARK-33198.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-20 15:02:36 -07:00
Dongjoon Hyun c824db2d8b [MINOR][CORE] Improve log message during storage decommission
### What changes were proposed in this pull request?

This PR aims to improve the log message for better analysis.

### Why are the changes needed?

Good logs are crucial always.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

Closes #30109 from dongjoon-hyun/k8s_log.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-20 14:55:08 -07:00
Keiji Yoshida 46ad325e56 [MINOR][DOCS] Fix the description about to_avro and from_avro functions
### What changes were proposed in this pull request?
This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)].

### Why are the changes needed?
Same as above.

### Does this PR introduce _any_ user-facing change?
Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro.

### How was this patch tested?
Tested manually by building and checking the document in the local environment.

Closes #30105 from kjmrknsn/fix-docs-sql-data-sources-avro.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 00:36:45 +09:00
HyukjinKwon 2cfd215dc4 [SPARK-33191][YARN][TESTS] Fix PySpark test cases in YarnClusterSuite
### What changes were proposed in this pull request?

This PR proposes to fix:

```
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-client mode
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar
```

it currently fails as below:

```
20/10/16 19:20:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (amp-jenkins-worker-03.amp executor 1): org.apache.spark.SparkException:
Error from python worker:
  Traceback (most recent call last):
    File "/usr/lib64/python2.6/runpy.py", line 104, in _run_module_as_main
      loader, code, fname = _get_module_details(mod_name)
    File "/usr/lib64/python2.6/runpy.py", line 79, in _get_module_details
      loader = get_loader(mod_name)
    File "/usr/lib64/python2.6/pkgutil.py", line 456, in get_loader
      return find_loader(fullname)
    File "/usr/lib64/python2.6/pkgutil.py", line 466, in find_loader
      for importer in iter_importers(fullname):
    File "/usr/lib64/python2.6/pkgutil.py", line 422, in iter_importers
      __import__(pkg)
    File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/__init__.py", line 53, in <module>
      from pyspark.rdd import RDD, RDDBarrier
    File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/rdd.py", line 34, in <module>
      from pyspark.java_gateway import local_connect_and_auth
    File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/java_gateway.py", line 29, in <module>
      from py4j.java_gateway import java_import, JavaGateway, JavaObject, GatewayParameters
    File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 60
      PY4J_TRUE = {"yes", "y", "t", "true"}
                        ^
  SyntaxError: invalid syntax
```

I think this was broken when Python 2 was dropped but was not caught because this specific test does not run when there's no change in YARN codes. See also https://github.com/apache/spark/pull/29843#issuecomment-712540024

The root cause seems like the paths are different, see https://github.com/apache/spark/pull/29843#pullrequestreview-502595199. I _think_ Jenkins uses a different Python executable via Anaconda and the executor side does not know where it is for some reasons.

This PR proposes to fix it just by explicitly specifying the absolute path for Python executable so the tests should pass in any environment.

### Why are the changes needed?

To make tests pass.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

This issue looks specific to Jenkins. It should run the tests on Jenkins.

Closes #30099 from HyukjinKwon/SPARK-33191.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-21 00:31:58 +09:00
HyukjinKwon eb9966b700 [SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in GitHub Actions
### What changes were proposed in this pull request?

PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+:

```
======================================================================
ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key
    .select('id', 'result').collect()
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect
    sock_info = self._jdf.collectToPython()
  File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
    process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process
    serializer.dump_stream(out_iter, outfile)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
    for batch in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches
    for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper
    return f(keys, vals)
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda>
    return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped
    result = f(key, pd.concat(value_series, axis=1))
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper
    return f(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f
    "{} != {}".format(expected_key[i][1], window_range)
AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
```

https://github.com/apache/spark/runs/1278917457

This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189).

### Why are the changes needed?

To make build pass.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions in this build will test it out.

Closes #30098 from HyukjinKwon/hot-fix-test.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 17:35:09 +09:00
Gabor Somogyi fbb6843620 [SPARK-32229][SQL] Fix PostgresConnectionProvider and MSSQLConnectionProvider by accessing wrapped driver
### What changes were proposed in this pull request?
Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`.

### Why are the changes needed?
`DriverWrapper` is not considered.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.

Closes #30024 from gaborgsomogyi/SPARK-32229.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-20 15:14:38 +09:00
Max Gekk a44e008de3 [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing
### What changes were proposed in this pull request?
1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0.
2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`.
3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
    - **No rebasing** in testing when the SQL config `spark.test.forceNoRebase` is set to `true`
    - **No rebasing** if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`.
    - **With rebasing** if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`.
    - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`.

New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of:
- To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
- To have different modes for INT96 and DATE in load (or in save).
- To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default.

### Why are the changes needed?
1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.

### Does this PR introduce _any_ user-facing change?
It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`.

### How was this patch tested?
- Added a test to check the metadata key `org.apache.spark.int96NoRebase`
- By `ParquetIOSuite`

Closes #30056 from MaxGekk/parquet-rebase-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 14:58:59 +09:00
Nan Zhu 35133901f7 [SPARK-32351][SQL] Show partially pushed down partition filters in explain()
### What changes were proposed in this pull request?

Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output.

This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain().

### Why are the changes needed?

without the patch, the explained plan is inconsistent with what is actually executed

<b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters)

```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int>
```

```
   == Physical Plan ==
   *(1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1)))
   +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]]
```

<b> with change </b> the  plan looks like (the actually executed partition filters are exhibited)

```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int>
```

```
== Physical Plan ==
*(1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1)))
+- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))]
```

### Does this PR introduce _any_ user-facing change

no

### How was this patch tested?
Unit test.

Closes #29831 from CodingCat/SPARK-32351.

Lead-authored-by: Nan Zhu <nanzhu@uber.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 11:13:16 +09:00
liaoaoyuan97 f65a24412b [SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQL Select Reference
### What changes were proposed in this pull request?

Add the link to the feature: "Run SQL on files directly" to SQL reference documentation page

### Why are the changes needed?

To make SQL Reference complete

### Does this PR introduce _any_ user-facing change?

yes. Previously, reading in sql from file directly is not included in the documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not listed in from_items. The new link is added to the select statement documentation, like the below:

![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png)
![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png)

### How was this patch tested?

Manually built and tested

Closes #30095 from liaoaoyuan97/master.

Authored-by: liaoaoyuan97 <al3468@columbia.edu>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 10:23:58 +09:00
Fokko Driesprong 6ad75cda1e [SPARK-17333][PYSPARK] Enable mypy
### What changes were proposed in this pull request?

Add MyPy to the CI. Once this is installed on the CI: https://issues.apache.org/jira/browse/SPARK-32797?jql=project%20%3D%20SPARK%20AND%20text%20~%20mypy this wil automatically check the types.

### Why are the changes needed?

We should check if the types are still correct on the CI.

```
MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks passed.

The sphinx-build command was not found. Skipping Sphinx build for now.

all lint-python tests passed!
```

### Does this PR introduce _any_ user-facing change?

No :)

### How was this patch tested?

By running `./dev/lint-python` locally.

Closes #30088 from Fokko/SPARK-17333.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-19 12:50:01 -07:00
Liang-Chi Hsieh 66c5e01322 [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase
### What changes were proposed in this pull request?

This patch proposes to add more optimization to `UpdateFields` expression chain. And optimize `UpdateFields` early in analysis phase.

### Why are the changes needed?

`UpdateFields` can manipulate complex nested data, but using `UpdateFields` can easily create inefficient expression chain. We should optimize it further.

Because when manipulating deeply nested schema, the `UpdateFields` expression tree could be too complex to analyze, this change optimizes `UpdateFields` early in analysis phase.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #29812 from viirya/SPARK-32941.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-19 10:35:34 -07:00
Max Gekk 26b13c70c3 [SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources
### What changes were proposed in this pull request?
1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources.
2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs.
3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`.
4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.

### Why are the changes needed?
To improve test coverage and test all built-in file-based datasources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites.

Closes #30067 from MaxGekk/ds-options-common-test.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 17:47:49 +09:00
HyukjinKwon a7a8dae483 Revert "[SPARK-33069][INFRA] Skip test result report if no JUnit XML files are found"
This reverts commit a0aa8f33a9.
2020-10-19 17:13:47 +09:00
xuewei.linxuewei 388e067a90 [SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py
### What changes were proposed in this pull request?

In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was using reflect "Class.forName" in python code to invoke method in SparkSession which is not recommended. using getattr to access "SparkSession$.Module$" instead.

### Why are the changes needed?

Code refine.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existing tests.

Closes #30092 from leanken/leanken-SPARK-33139-followup.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 16:40:48 +09:00
William Hyun 53783e706d [SPARK-33179][TESTS] Switch default Hadoop profile in run-tests.py
### What changes were proposed in this pull request?

This PR aims to switch the default Hadoop profile from `hadoop2.7` to `hadoop3.2` in `dev/run-tests.py` when it's running in local or GitHub Action environments.

### Why are the changes needed?

The default Hadoop version is 3.2. We had better be consistent.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually.

**BEFORE**
```
% dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop2.7 and Hive profile hive2.3 under environment local
```

**AFTER**
```
% dev/run-tests.py
Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local
```

Closes #30090 from williamhyun/SPARK-33179.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 15:54:52 +09:00
William Hyun e6c53c2c1b [SPARK-33123][INFRA] Ignore GitHub only changes in Amplab Jenkins build
### What changes were proposed in this pull request?

This PR aims to ignore GitHub only changes in Amplab Jenkins build.

### Why are the changes needed?

This will save server resources.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Manually. I used the following doctest during testing and removed it at the clean-up.

E2E tests:

```
cd dev
cat test.py
```

```python
import importlib
runtests = importlib.import_module("run-tests")
print([x.name for x in runtests.determine_modules_for_files([".github/workflows/build_and_test.yml"])])
```

```python
$ GITHUB_ACTIONS=1 python test.py
['root']
$ python test.py
[]
```

Unittests:

```bash
$ GITHUN_ACTIONS=1 python3 -m doctest dev/run-tests.py
$ python3 -m doctest dev/run-tests.py
```

Closes #30020 from williamhyun/SPARK-33123.

Lead-authored-by: William Hyun <williamhyun3@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 14:13:37 +09:00
angerszhu f8277d3aa3 [SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory
### What changes were proposed in this pull request?
Improve error message on reading unexpected directory

### Why are the changes needed?
Improve error message on reading unexpected directory

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ut

Closes #30027 from AngersZhuuuu/SPARK-32069.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-18 19:02:21 -07:00
tanel.kiis@gmail.com ce498943d2 [SPARK-33177][SQL] CollectList and CollectSet should not be nullable
### What changes were proposed in this pull request?

Mark `CollectList` and `CollectSet` as non-nullable.

### Why are the changes needed?

`CollectList` and `CollectSet` SQL expressions never return null value. Marking them as non-nullable can have some performance benefits, because some optimizer rules apply only to non-nullable expressions

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Did not find any existing tests on the nullability of aggregate functions.

Closes #30087 from tanelk/SPARK-33177_collect.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 09:50:59 +09:00
Dongjoon Hyun 97605cd126 [SPARK-33175][K8S] Detect duplicated mountPaths and fail at Spark side
### What changes were proposed in this pull request?

This PR aims to detect duplicate `mountPath`s and stop the job.

### Why are the changes needed?

If there is a conflict on `mountPath`, the pod is created and repeats the following error messages and keeps running. Spark job should not keep running and wasting the cluster resources. We had better fail at Spark side.
```
$ k get pod -l 'spark-role in (driver,executor)'
NAME    READY   STATUS    RESTARTS   AGE
tpcds   1/1     Running   0          33m
```

```
20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: ...
Message: Pod "tpcds-exec-1" is invalid: spec.containers[0].volumeMounts[1].mountPath:
Invalid value: "/data1": must be unique.
...
```

**AFTER THIS PR**
The job will stop with the following error message instead of keeping running.
```
20/10/18 06:58:45 ERROR ExecutorPodsSnapshotsStoreImpl: Going to stop due to IllegalArgumentException
java.lang.IllegalArgumentException: requirement failed: Found duplicated mountPath: `/data1`
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a bug fix.

### How was this patch tested?

Pass the CI with the newly added test case.

Closes #30084 from dongjoon-hyun/SPARK-33175-2.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-18 09:59:50 -07:00
HyukjinKwon ad99f14b42 [SPARK-33109][BUILD][FOLLOW-UP] Remove the obsolete comment about bringing sbt-dependency-graph back
### What changes were proposed in this pull request?

This PR proposes to remove an obsolete comment about adding the `sbt-dependency-graph` back in SBT plugins.

### Why are the changes needed?

sbt-dependency-graph is now built-in from SBT 1.4.0, see https://github.com/sbt/sbt/releases/tag/v1.4.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested `./build/sbt dependencyTree`.

Closes #30085 from HyukjinKwon/SPARK-33109.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-18 09:24:44 -07:00
Dongjoon Hyun 20b7b923ab [SPARK-33176][K8S] Use 11-jre-slim as default in K8s Dockerfile
### What changes were proposed in this pull request?

This PR aims to use `openjdk:11-jre-slim` as default in K8s Dockerfile.

### Why are the changes needed?

Although Apache Spark supports both Java8/Java11, there is a difference.

1. Java8-built distribution can run both Java8/Java11
2. Java11-built distribution can run on Java11, but not Java8.

In short, we had better use Java11 in Dockerfile to embrace both cases without any issues.

### Does this PR introduce _any_ user-facing change?

Yes. This will remove the change of user frustration when they build with JDK11 and build the image without overriding Java base image.

### How was this patch tested?

Pass the K8s IT.

Closes #30083 from dongjoon-hyun/SPARK-33176.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-18 09:21:07 -07:00
Keiji Yoshida d2f328aba6 [MINOR][DOCS] Fix the link to the pickle module page in RDD Programming Guide
### What changes were proposed in this pull request?
This pull request changes the link to the pickle module page from https://docs.python.org/2/library/pickle.html to https://docs.python.org/3/library/pickle.html in RDD Programming Guide.

### Why are the changes needed?
Since Python 2 is no longer supported and it is preferable to refer to the pickle module page of Python 3.

### Does this PR introduce _any_ user-facing change?
Yes.
Before: the `Pickle` link's destination page was https://docs.python.org/2/library/pickle.html
After: the `Pickle` link's destination page is https://docs.python.org/3/library/pickle.html

### How was this patch tested?
By building the documentation site and check the link's destination page is changed correctly in the local environment.

Closes #30081 from kjmrknsn/docs-fix-pickle-link.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-18 17:13:55 +09:00
Keiji Yoshida 7766a6fb5f [MINOR][DOCS][EXAMPLE] Fix the Python manual_load_options_csv example
### What changes were proposed in this pull request?
This pull request changes the `sep` parameter's value from `:` to `;` in the example of `examples/src/main/python/sql/datasource.py`. This code snippet is shown on the Spark SQL Guide documentation. The `sep` parameter's value should be `;` since the data in https://github.com/apache/spark/blob/master/examples/src/main/resources/people.csv is separated by `;`.

### Why are the changes needed?
To fix the example code so that it can be executed properly.

### Does this PR introduce _any_ user-facing change?
Yes.
This code snippet is shown on the Spark SQL Guide documentation: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options

### How was this patch tested?
By building the documentation and checking the Spark SQL Guide documentation manually in the local environment.

Closes #30082 from kjmrknsn/fix-example-python-datasource.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-18 16:47:04 +09:00
Liang-Chi Hsieh 3010e9044e [SPARK-33170][SQL] Add SQL config to control fast-fail behavior in FileFormatWriter
### What changes were proposed in this pull request?

This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default.

### Why are the changes needed?

In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry.

Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases.

We are going to add a config we can control this behavior and set it false for fast-fail by default.

### Does this PR introduce _any_ user-facing change?

Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it.

### How was this patch tested?

Unit test.

Closes #30073 from viirya/SPARK-33170.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-17 21:02:25 -07:00
Liang-Chi Hsieh 2c4599db4b [MINOR][SS][DOCS] Update Structured Streaming guide doc and update code typo
### What changes were proposed in this pull request?

This is a minor change to update structured-streaming-programming-guide and typos in code.

### Why are the changes needed?

Keep the user-facing document correct and updated.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #30074 from viirya/ss-minor.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 22:18:12 -07:00
Dongjoon Hyun 911dcd3983 [SPARK-33173][CORE][TESTS] Use eventually to check numOnTaskFailed in PluginContainerSuite
### What changes were proposed in this pull request?

This PR aims to use `eventually` to fix the flakiness of the test case `SPARK-33088: executor failed tasks trigger plugin calls`.

### Why are the changes needed?

The test case checks like the following.
```scala
assert(TestSparkPlugin.executorPlugin.numOnTaskStart == 2)
assert(TestSparkPlugin.executorPlugin.numOnTaskSucceeded == 0)
assert(TestSparkPlugin.executorPlugin.numOnTaskFailed == 2)
```

Although first and second passed, the third can fail.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/lastCompletedBuild/testReport/org.apache.spark.internal.plugin/PluginContainerSuite/SPARK_33088__executor_failed_tasks_trigger_plugin_calls/
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129919/testReport/
```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1 did not equal 2
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.internal.plugin.PluginContainerSuite.$anonfun$new$8(PluginContainerSuite.scala:161)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This only improves the robustness.

Closes #30072 from dongjoon-hyun/SPARK-33173.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 21:23:21 -07:00
Denis Pyshev 0411def0b1 [SPARK-33109][BUILD] Upgrade to sbt 1.4.0
### What changes were proposed in this pull request?

Upgrade sbt to release 1.4.0

### Why are the changes needed?

Bring built-in `dependencyTree` instead of removed `sbt-dependency-graph` plugin, that doesn't work with sbt used in build.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Should pass all the tests.

Closes #30070 from gemelen/feature/sbt-1.4.

Authored-by: Denis Pyshev <git@gemelen.net>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 16:32:09 -07:00
Liang-Chi Hsieh e574fcd230 [SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns
### What changes were proposed in this pull request?

SPARK-29358 added support for `unionByName` to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns.

The behavior before this PR:

```scala
scala> val df1 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2, 'a', id + 3) c1")
scala> val df2 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2) c1")
scala> df1.unionByName(df2, true).printSchema
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c:bigint,b:bigint> <> struct<c:bigint,b:bigint,a:bigint> at the second column of the second table;;
'Union false, false
:- Project [id#0L AS c0#2L, named_struct(c, (id#0L + cast(1 as bigint)), b, (id#0L + cast(2 as bigint)), a, (id#0L + cast(3 as bigint))) AS c1#3]
:  +- Range (0, 1, step=1, splits=Some(12))
+- Project [c0#8L, c1#9]
   +- Project [id#6L AS c0#8L, named_struct(c, (id#6L + cast(1 as bigint)), b, (id#6L + cast(2 as bigint))) AS c1#9]
      +- Range (0, 1, step=1, splits=Some(12))
```

The behavior after this PR:

```scala
scala> df1.unionByName(df2, true).printSchema
root
 |-- c0: long (nullable = false)
 |-- c1: struct (nullable = false)
 |    |-- a: long (nullable = true)
 |    |-- b: long (nullable = false)
 |    |-- c: long (nullable = false)
scala> df1.unionByName(df2, true).show()
+---+-------------+
| c0|           c1|
+---+-------------+
|  0|    {3, 2, 1}|
|  0|{ null, 2, 1}|
+---+-------------+
```

### Why are the changes needed?

The `allowMissingColumns` of `unionByName` is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage.

### Does this PR introduce _any_ user-facing change?

Yes, after this change users can union two datasets with different schema with different structs.

### How was this patch tested?

Unit tests.

Closes #29587 from viirya/SPARK-32376.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-10-16 14:48:14 -07:00
Holden Karau ce6180c8c3 [SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration
### What changes were proposed in this pull request?

If a block is removed between discovery to transfer fo the block, we short circuit that block and remove it from the list to transfer and increment the transferred blocks. This is complicated since both RPC errors and local read errors may be reported with the same exception class.

### Why are the changes needed?

Slow shuffle refreshes could waste time when decommissioning has already finished. Decommissioning might avoid transferring some some blocks to an otherwise live host which is marked as "full" if a deleted block fails to transfer to that host.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit and integration tests.

Closes #30046 from holdenk/handle-cleaned-shuffles-during0migration.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 14:47:46 -07:00
Max Gekk acb79f52db [MINOR][SQL] Re-use binaryToSQLTimestamp() in ParquetRowConverter
### What changes were proposed in this pull request?
The function `binaryToSQLTimestamp()` is used by Parquet Vectorized reader. Parquet MR reader has similar code for de-serialization of INT96 timestamps. In this PR, I propose to de-duplicate code and re-use `binaryToSQLTimestamp()`.

### Why are the changes needed?
This should improve maintenance, and should allow to avoid errors while changing Vectorized and regular parquet readers.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites, for instance `ParquetIOSuite`.

Closes #30069 from MaxGekk/int96-common-serde.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 14:27:27 -07:00
Dongjoon Hyun ab0bad9544 [SPARK-33171][INFRA] Mark ParquetV*FilterSuite/ParquetV*SchemaPruningSuite as ExtendedSQLTest
### What changes were proposed in this pull request?

This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`.
- ParquetV1FilterSuite/ParquetV2FilterSuite
- ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite

### Why are the changes needed?

Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job.

**BEFORE**
- https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes)

**AFTER**
- https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Github Action with the reduced time.

Closes #30068 from dongjoon-hyun/MOVE3.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 12:52:45 -07:00