Commit graph

30785 commits

Author SHA1 Message Date
Hyukjin Kwon d6b974f8ce [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
### What changes were proposed in this pull request?

Test is flaky (https://github.com/apache/spark/runs/3109815586):

```
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 391, in test_parameter_convergence
    eventually(condition, catch_assertions=True)
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in eventually
    raise lastValue
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in eventually
    lastValue = condition()
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 387, in condition
    self.assertEqual(len(model_weights), len(batches))
AssertionError: 9 != 10
```

Should probably increase timeout

### Why are the changes needed?

To avoid flakiness in the test.

### Does this PR introduce _any_ user-facing change?

Nope, dev-only.

### How was this patch tested?

CI should test it out.

Closes #33427 from HyukjinKwon/SPARK-36216.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 13:17:05 +09:00
Kent Yao 0c76fb9c01 [SPARK-36179][SQL] Support TimestampNTZType in SparkGetColumnsOperation
### What changes were proposed in this pull request?

Support TimestampNTZType in SparkGetColumnsOperation

### Why are the changes needed?

TimestampNTZType coverage

### Does this PR introduce _any_ user-facing change?

yes, jdbc end-users will be aware of TimestampNTZType

### How was this patch tested?

add new test

Closes #33393 from yaooqinn/SPARK-36179.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:48:58 +09:00
Dominik Gehl d7d961fabe [SPARK-36176][PYTHON] Expose tableExists in pyspark.sql.catalog
### What changes were proposed in this pull request?
exposing tableExists in pyspark.sql.catalog

### Why are the changes needed?
avoids pyspark users having to go through listTables

### Does this PR introduce _any_ user-facing change?
Yes, additional tableExists method available in pyspark

### How was this patch tested?
test added

Closes #33388 from dominikgehl/feature/SPARK-36176.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:28:52 +09:00
Takuya UESHIN c459c707c5 [SPARK-36167][PYTHON][FOLLOWUP] Fix test failures with older versions of pandas
### What changes were proposed in this pull request?

Fix test failures with `pandas < 1.2`.

### Why are the changes needed?

There are some test failures with `pandas < 1.2`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Fixed tests.

Closes #33398 from ueshin/issues/SPARK-36167/test.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:21:46 +09:00
Xinrong Meng 8dd43351d5 [SPARK-36127][PYTHON] Support comparison between a Categorical and a scalar
### What changes were proposed in this pull request?
Support comparison between a Categorical and a scalar.
There are 3 main changes:
- Modify `==` and `!=` from comparing **codes** of the Categorical to the scalar to comparing **actual values** of the Categorical to the scalar.
- Support `<`, `<=`, `>`, `>=` between a Categorical and a scalar.
- TypeError message fix.

### Why are the changes needed?
pandas supports comparison between a Categorical and a scalar, we should follow pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0     True
1    False
2    False
dtype: bool
>>> psser <= 1
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0    False
1     True
2    False
dtype: bool
>>> psser <= 1
0    True
1    True
2    True
dtype: bool

```

### How was this patch tested?
Unit tests.

Closes #33373 from xinrong-databricks/categorical_eq.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-19 15:06:44 -07:00
Kousuke Saruta c7ccc602db [SPARK-36166][TESTS][FOLLOWUP] Add BLOCK_SCALA_VERSION to sparktestssupport/__init__.py
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`.

### Why are the changes needed?

The following command fails due to the definition is missing.
```
SCALA_PROFILE=scala2.12 dev/run-tests.py
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The command shown above works.

Closes #33421 from sarutak/followup-SPARK-36166.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 22:47:03 +09:00
gengjiaan 7aa01798c5 [SPARK-36091][SQL] Support TimestampNTZ type in expression TimeWindow
### What changes were proposed in this pull request?
The current implement of `TimeWindow` only supports `TimestampType`. Spark added a new type `TimestampNTZType`, so we should support `TimestampNTZType` in expression `TimeWindow`.

### Why are the changes needed?
 `TimestampNTZType` similar to `TimestampType`, we should support `TimestampNTZType` in expression `TimeWindow`.

### Does this PR introduce _any_ user-facing change?
'Yes'.
`TimeWindow` will accepts `TimestampNTZType`.

### How was this patch tested?
New tests.

Closes #33341 from beliefer/SPARK-36091.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-19 19:23:39 +08:00
itholic 2f42afc53a [SPARK-35806][PYTHON] Mapping the mode argument to pandas in DataFrame.to_csv
### What changes were proposed in this pull request?

The `DataFrame.to_csv` has `mode` arguments both in pandas and pandas API on Spark.

However, pandas allows the string "w", "w+", "a", "a+" where as pandas-on-Spark allows "append", "overwrite", "ignore", "error" or "errorifexists".

We should map them while `mode` can still accept the existing parameters("append", "overwrite", "ignore", "error" or "errorifexists") as well.

### Why are the changes needed?

APIs in pandas-on-Spark should follows the behavior of pandas for preventing the existing pandas code break.

### Does this PR introduce _any_ user-facing change?

`DataFrame.to_csv` now can accept "w", "w+", "a", "a+" as well, same as pandas.

### How was this patch tested?

Add the unit test and manually write the file with the new acceptable strings.

Closes #33414 from itholic/SPARK-35806.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:58:11 +09:00
Dominik Gehl 2ef8ced27a [SPARK-36181][PYTHON] Update pyspark sql readwriter documentation
### What changes were proposed in this pull request?
Updating the pyspark sql readwriter documentation to the level of detail provided by the scala documentation

### Why are the changes needed?
documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33394 from dominikgehl/feature/SPARK-36181.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:50:42 +09:00
Dominik Gehl fe4db74da4 [SPARK-36178][PYTHON] List pyspark.sql.catalog APIs in documentation
### What changes were proposed in this pull request?
The pyspark.sql.catalog APIs were missing from the documentation. PR fixes this omission.

### Why are the changes needed?
Documentation consistency

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Documentation change only.

Closes #33392 from dominikgehl/feature/SPARK-36178.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:49:09 +09:00
Hyukjin Kwon c92790a101 [SPARK-36205][INFRA] Use set-env instead of set-output in GitHub Actions
### What changes were proposed in this pull request?

This PR is more a cleanup. It removes unused `sync-branch` id in some steps, and use `set-env` instead of `set-output` to set an env.
This can be backported to branch-3.2 too.

### Why are the changes needed?

Cleanup.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #33412 from HyukjinKwon/minor-cleanup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:43:19 +09:00
Hyukjin Kwon 506b333a2f Revert "[SPARK-34806][SQL] Add Observation helper for Dataset.observe"
This reverts commit cc940ff3f8.
2021-07-19 19:32:54 +09:00
Enrico Minack cc940ff3f8 [SPARK-34806][SQL] Add Observation helper for Dataset.observe
### What changes were proposed in this pull request?
This pull request introduces a helper class that simplifies usage of `Dataset.observe()` for batch datasets:

    val observation = Observation("name")
    val observed = ds.observe(observation, max($"id").as("max_id"))
    observed.count()
    val metrics = observation.get

### Why are the changes needed?
Currently, users are required to implement the `QueryExecutionListener` interface to retrieve the metrics, as well as apply some knowledge on threading and locking to pull the metrics over to the main thread. With the helper class, metrics can be retrieved from batch dataset processing with three lines of code (the action on the observed dataset does not count as a line of code here).

### Does this PR introduce _any_ user-facing change?
Yes, one new class and one `Dataset`` method.

### How was this patch tested?
Adds a unit test to `DataFrameSuite`, similar to `"get observable metrics by callback"` in `DataFrameCallbackSuite`.

Closes #31905 from EnricoMi/branch-observation.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-19 09:16:42 +00:00
Hyukjin Kwon 8ee199ef42 [SPARK-36166][TESTS][FOLLOW-UP] Add Scala version change logic into testing script
### What changes were proposed in this pull request?

This PR is a simple followup from https://github.com/apache/spark/pull/33376:
- It simplifies a bit by removing the default Scala version in the testing script (so we don't have to change here in the future when we change the Scala default version).
- Call `change-scala-version.sh` script (when `SCALA_PROFILE` is explicitly specified)

### Why are the changes needed?

More refactoring. In addition, this change will be used at https://github.com/apache/spark/pull/33410

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in this PR should test it out.

Closes #33411 from HyukjinKwon/SPARK-36166.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 18:01:02 +09:00
Ivan Sadikov 4036ad9ad9 [SPARK-36163][SQL] Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option
### What changes were proposed in this pull request?

This PR fixes two issues highlighted in https://issues.apache.org/jira/browse/SPARK-36163:
- JDBC connection provider propagates incorrect connection properties.
- Ambiguity when more than one JDBC connection provider is available.

I updated `BasicConnectionProvider` to use `jdbcOptions.asConnectionProperties` to remove JDBC data source specific options.

I also added `connectionProvider` data source option that specifies the name of the provider, e.g. `db2`, `presto`, to allow enforcing this specific provider in case of ambiguity.

### Why are the changes needed?
Users can leverage `spark.sql.sources.disabledJdbcConnProviderList` but it is cumbersome and requires them to disable all other providers which could be problematic when using ambiguous providers in two or more different JDBC queries.

### Does this PR introduce _any_ user-facing change?

Yes

PROBLEM DESCRIPTION:
This introduces new JDBC data source option `connectionProvider` that allows users to select a specific JDBC connection provider based on the short name. I updated the SQL guide doc and README.

Before this change, the only way to resolve ambiguity was SQL conf to blacklist all of the other JDBC connection providers. After this change users will be able to specify the exact connection provider they need per data source.

### How was this patch tested?

I updated the existing `ConnectionProviderSuite` and added a new `BasicConnectionProviderSuite`.

Closes #33370 from sadikovi/fix-jdbc-conn-provider.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 17:48:32 +09:00
Angerszhuuuu 313f3c5460 [SPARK-36093][SQL] RemoveRedundantAliases should not change Command's parameter's expression's name
### What changes were proposed in this pull request?
RemoveRedundantAliases may change DataWritingCommand's parameter's attribute name.
In the UT's case before RemoveRedundantAliases the partitionColumns is `CAL_DT`, and change by RemoveRedundantAliases and change to `cal_dt` then case the error case

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
For below SQL case
```
sql("create table t1(cal_dt date) using parquet")
sql("insert into t1 values (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
sql("create view t1_v as select * from t1")
sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'")
```

Before this pr
```
sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show
+----+------+
|FLAG|CAL_DT|
+----+------+
+----+------+
sql("SELECT * FROM t2 ").show
+----+----------+
|FLAG|    CAL_DT|
+----+----------+
|   1|2021-06-27|
|   1|2021-06-28|
+----+----------+
```

After this pr
```
sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show
+----+------+
|FLAG|CAL_DT|
+----+------+
|   2|2021-06-29|
|   2|2021-06-30|
+----+------+
sql("SELECT * FROM t2 ").show
+----+----------+
|FLAG|    CAL_DT|
+----+----------+
|   1|2021-06-27|
|   1|2021-06-28|
|   2|2021-06-29|
|   2|2021-06-30|
+----+----------+
```

### How was this patch tested?
Added UT

Closes #33324 from AngersZhuuuu/SPARK-36093.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-19 16:22:31 +08:00
Kent Yao ef80356614 [SPARK-36197][SQL] Use PartitionDesc instead of TableDesc for reading hive partitioned tables
### What changes were proposed in this pull request?

A hive partition can have different `PartitionDesc`s from `TableDesc` for describing Serde/InputFormatClass/OutputFormatClass, for a hive partitioned table, we shall respect those in `PartitionDesc`.

### Why are the changes needed?

in many cases, that Spark reads hive tables could result in surprise because of this issue.

### Does this PR introduce _any_ user-facing change?

yes, hive partition table that contains different serde/input/output could be recognized by Spark

### How was this patch tested?

new test added

Closes #33406 from yaooqinn/SPARK-36197.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-07-19 15:59:36 +08:00
Wenchen Fan 8396a70ddc [SPARK-36184][SQL] Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles
### What changes were proposed in this pull request?

Currently, two AQE rules `OptimizeLocalShuffleReader` and `OptimizeSkewedJoin` run `EnsureRequirements` at the end to check if there are extra shuffles in the optimized plan and revert the optimization if extra shuffles are introduced.

This PR proposes to run `ValidateRequirements` instead, which is much simpler than `EnsureRequirements`. This PR also moves this check to `AdaptiveSparkPlanExec`, so that it's centralized instead of in each rule. After centralization, the batch name of optimizing the final stage is the same as normal stages, which makes more sense.

### Why are the changes needed?

`EnsureRequirements` is a big rule and even contains optimizations (remove unnecessary shuffles). `ValidateRequirements` is much faster to run and can avoid potential bugs as it has no optimization and is a pure check.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests.

Closes #33396 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-19 14:14:40 +08:00
Dongjoon Hyun fd3e9ce0b9 [SPARK-36193][CORE] Recover SparkSubmit.runMain not to stop SparkContext in non-K8s env
### What changes were proposed in this pull request?

According to the discussion on https://github.com/apache/spark/pull/32283 , this PR aims to limit the feature of SPARK-34674 to K8s environment only.

### Why are the changes needed?

To reduce the behavior change in non-K8s environment.

### Does this PR introduce _any_ user-facing change?

The change behavior is consistent with 3.1.1 and older Spark releases.

### How was this patch tested?

N/A

Closes #33403 from dongjoon-hyun/SPARK-36193.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 22:26:23 -07:00
William Hyun df8bae0689 [SPARK-36199][BUILD] Bump scalatest-maven-plugin to 2.0.2
### What changes were proposed in this pull request?
This PR aims to upgrade scalatest-maven-plugin to version 2.0.2.

### Why are the changes needed?
2.0.2 supports build on JDK 11 officially.
- f45ce192f3

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33408 from williamhyun/SMP.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 22:14:24 -07:00
itholic 67e6120a85 [SPARK-35810][PYTHON] Deprecate ps.broadcast API
### What changes were proposed in this pull request?

The `broadcast` functions in `pyspark.pandas` is duplicated to `DataFrame.spark.hint` with `"broadcast"`.

```python
# The below 2 lines are the same
df.spark.hint("broadcast")
ps.broadcast(df)
```

So, we should remove `broadcast` in the future, and show deprecation warning for now.

### Why are the changes needed?

For deduplication of functions

### Does this PR introduce _any_ user-facing change?

They see the deprecation warning when using `broadcast` in `pyspark.pandas`.

```python
>>> ps.broadcast(df)
FutureWarning: `broadcast` has been deprecated and will be removed in a future version. use `DataFrame.spark.hint` with 'broadcast' for `name` parameter instead.
  warnings.warn(
```

### How was this patch tested?

Manually check the warning message and see the build passed.

Closes #33379 from itholic/SPARK-35810.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 10:44:59 +09:00
William Hyun c336f73ccd [SPARK-36198][TESTS] Skip UNIDOC generation in PySpark GHA job
### What changes were proposed in this pull request?
This PR aims to skip UNIDOC generation in PySpark GHA job.

### Why are the changes needed?

PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total.
-https://github.com/apache/spark/runs/3098268973?check_suite_focus=true
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc
...
[info] Main Java API documentation successful.
[success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the GHA.

Closes #33407 from williamhyun/SKIP_UNIDOC.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 17:52:28 -07:00
Yikun Jiang f85855c115 [SPARK-36075][K8S] Support for specifiying executor/driver node selector
### What changes were proposed in this pull request?

Add the support for specifiying executor/driver node selector:
- spark.kubernetes.driver.node.selector.
- spark.kubernetes.executor.node.selector.

### Why are the changes needed?
Now we can only use "spark.kubernetes.node.selector" to set lable for executor/driver. Sometimes, we need set executor/driver pods to different selector separately.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- KubernetesConfSuite for new added configure
- BasicDriverFeatureStepSuite to make sure driver pods node selector set properly
- BasicExecutorFeatureStepSuite to make sure excutor pods node selector set properly

Closes #33283 from Yikun/SPARK-36075.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 15:59:34 -07:00
Kent Yao a9e2156ee5 [SPARK-35460][K8S] verify the content ofspark.kubernetes.executor.podNamePrefix before post it to k8s api-server
### What changes were proposed in this pull request?

```logtalk
21/05/20 21:41:21 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.docker.internal:6443/api/v1/namespaces/default/pods. Message: Pod "spark_exec-exec-688" is invalid: [metadata.name: Invalid value: "spark_exec-exec-688": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.hostname: Invalid value: "spark_exec-exec-688": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, message=Invalid value: "spark_exec-exec-688": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=spec.hostname, message=Invalid value: "spark_exec-exec-688": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark_exec-exec-688, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark_exec-exec-688" is invalid: [metadata.name: Invalid value: "spark_exec-exec-688": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.hostname: Invalid value: "spark_exec-exec-688": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')], metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:583)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:522)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:487)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:448)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:263)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:870)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:365)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
```

When `spark.kubernetes.executor.podNamePrefix` contains invalid characters, the driver will continuously fail to request executors from k8s master, which causes the app to hang with the above message - `'Message: Pod "spark_exec-exec-688" is invalid'`.

In this PR we fail the app when the setting is wrong.

### Why are the changes needed?

`spark.kubernetes.executor.podNamePrefix` is used when users may want full control of executor pod names. It will hang apps w/ wrong characters, it's better to fail directly.

### Does this PR introduce _any_ user-facing change?

yes, invalid `spark.kubernetes.executor.podNamePrefix` cause app to fail not to hang.
### How was this patch tested?

new tests

Closes #32610 from yaooqinn/SPARK-35460.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 15:50:20 -07:00
ulysses-you fe94bf07f9 [SPARK-36014][K8S] Use uuid as app id in kubernetes client mode
### What changes were proposed in this pull request?

Use uuid instead of `System. currentTimeMillis` as app id in kubernetes client mode.

### Why are the changes needed?

Currently, spark on kubernetes with client mode would use `"spark-application-" + System.currentTimeMillis` as app id by default. It would cause app id conflict if submit several spark applications to kubernetes cluster in a short time.

Unfortunately, the event log use app id as the file name. With the conflict event log file, the exception was thrown.

```
Caused by: java.io.FileNotFoundException: File does not exist: xxx/spark-application-1624766876324.lz4.inprogress (inode 5984170846) Holder does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2697)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
```

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

manual test

![image](https://user-images.githubusercontent.com/12025282/124435341-7a88e180-dda7-11eb-8e62-bdfec6a0ee3b.png)

Closes #33211 from ulysses-you/k8s-appid.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 15:41:47 -07:00
yoda-mon eea69c122f [SPARK-36040][DOCS][K8S] Add reference to kubernetes-client's version
### What changes were proposed in this pull request?

Add reference to kubernetes-client's version

### Why are the changes needed?

Running Spark on Kubernetes potentially has upper limitation of Kubernetes version.
I think it is better for users to notice it because Kubernetes update speed is so fast that users tends to run Spark Jobs on unsupported version.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

SKIP_API=1 bundle exec jekyll build

Closes #33255 from yoda-mon/add-reference-kubernetes-client.

Authored-by: yoda-mon <yodal@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 14:26:15 -07:00
Bessenyei Balázs Donát 92d4563124 [MINOR][SQL] Fix typo for config hint in SQLConf.scala
### What changes were proposed in this pull request?

This PR fixes typo for `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` in `SQLConf.scala`.

### Why are the changes needed?

This is a [Broken windows theory](https://en.wikipedia.org/wiki/Broken_windows_theory) change.

### Does this PR introduce _any_ user-facing change?

Yes, after merging this PR, the error message for commands such as
```python
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation", "true")
```
, users will get a typo-free exception.

### How was this patch tested?

This is a trivial change.

Closes #33389 from bessbd/patch-1.

Authored-by: Bessenyei Balázs Donát <9086834+bessbd@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-18 15:33:26 -05:00
gengjiaan 42275bb20d [SPARK-36090][SQL] Support TimestampNTZType in expression Sequence
### What changes were proposed in this pull request?
The current implement of `Sequence` accept `TimestampType`, `DateType` and `IntegralType`. This PR will let `Sequence` accepts `TimestampNTZType`.

### Why are the changes needed?
We can generate sequence for timestamp without time zone.

### Does this PR introduce _any_ user-facing change?
'Yes'.
This PR will let `Sequence` accepts `TimestampNTZType`.

### How was this patch tested?
New tests.

Closes #33360 from beliefer/SPARK-36090.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-18 20:46:23 +03:00
Dongjoon Hyun d7df7a805f [SPARK-36195][BUILD] Set MaxMetaspaceSize JVM option to 2g
### What changes were proposed in this pull request?

This PR aims to set `MaxMetaspaceSize` to `2g` because it's increasing the native memory consumption unlimitedly by default. The unlimited increasing memory causes GitHub Action flakiness. The value I observed during `hive` module test was over 1.8G and growing.

- https://docs.oracle.com/javase/10/gctuning/other-considerations.htm#JSGCT-GUID-BFB89453-60C0-42AC-81CA-87D59B0ACE2E
> Starting with JDK 8, the permanent generation was removed and the class metadata is allocated in native memory. The amount of native memory that can be used for class metadata is by default unlimited. Use the option -XX:MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata.

In addition, I increased the following memory limit to 4g consistently from two places.
```xml
- <jvmArg>-Xms2048m</jvmArg>
- <jvmArg>-Xmx2048m</jvmArg>
+ <jvmArg>-Xms4g</jvmArg>
+ <jvmArg>-Xmx4g</jvmArg>
```

```scala
- javaOptions += "-Xmx3g",
+ javaOptions ++= "-Xmx4g -XX:MaxMetaspaceSize=2g".split(" ").toSeq,
```

### Why are the changes needed?

This will reduce the flakiness in CI environment by limiting the memory usage explicitly.

When we limit it with `1g`, Hive module fails with `OOM` like the following.
```
java.lang.OutOfMemoryError: Metaspace
Error: Exception in thread "dispatcher-event-loop-110" java.lang.OutOfMemoryError: Metaspace
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33405 from dongjoon-hyun/SPARK-36195.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Kyle Bendickson <kbendickson@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-18 10:15:15 -07:00
skhandrikagmail bfdde9635d [SPARK-36122][CORE] Passing on needClientAuth to Jetty SSLContextFactory
SPARK-36122: Spark does not passon needClientAuth to Jetty SSLContextFactory. Does not allow to configure mTLS authentication.

passing needClientAuth to sslContextFactory would help enable mTLS authentication for Jetty through x509 certificates.

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #33301 from skhandrikagmail/patch-1.

Authored-by: skhandrikagmail <87313842+skhandrikagmail@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-17 08:59:42 -05:00
Kousuke Saruta 71ea25d4f5 [SPARK-36170][SQL] Change quoted interval literal (interval constructor) to be converted to ANSI interval types
### What changes were proposed in this pull request?

This PR changes the behavior of the quoted interval literals like `SELECT INTERVAL '1 year 2 month'` to be converted to ANSI interval types.

### Why are the changes needed?

The tnit-to-unit interval literals and the unit list interval literals are converted to ANSI interval types but quoted interval literals are still converted to CalendarIntervalType.

```
-- Unit list interval literals
spark-sql> select interval 1 year 2 month;
1-2
-- Quoted interval literals
spark-sql> select interval '1 year 2 month';
1 years 2 months
```

### Does this PR introduce _any_ user-facing change?

Yes but the following sentence in `sql-migration-guide.md` seems to cover this change.
```
  - In Spark 3.2, the unit list interval literals can not mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND).
For example, `INTERVAL 1 day 1 hour` is invalid in Spark 3.2. In Spark 3.1 and earlier,
there is no such limitation and the literal returns value of `CalendarIntervalType`.
To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`.
```

### How was this patch tested?

Modified existing tests and add new tests.

Closes #33380 from sarutak/fix-interval-constructor.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-17 12:23:37 +03:00
Liang-Chi Hsieh 8009f0dd92 [SPARK-35785][SS][FOLLOWUP] Remove ignored test from RocksDBSuite
### What changes were proposed in this pull request?

This patch removes an ignored test from `RocksDBSuite`.

### Why are the changes needed?

The removed test is now ignored. The test itself doesn't look making sense. For example, the condition for capturing exception is never matched. The test runs updates to RocksDB instances at same remote dir with same versions. This doesn't look like a case it will run through in practice.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33401 from viirya/remove-ignore-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-17 02:04:55 -07:00
Chandni Singh 6d2cbadcfe [SPARK-32922][SHUFFLE][CORE][FOLLOWUP] Fixes few issues when the executor tries to fetch push-merged blocks
### What changes were proposed in this pull request?
Below 2 bugs were introduced with https://github.com/apache/spark/pull/32140
1. Instead of requesting the local-dirs for push-merged-local blocks from the ESS, `PushBasedFetchHelper` requests it from other executors. Push-based shuffle is only enabled when the ESS is enabled so it should always fetch the dirs from the ESS and not from other executors which is not yet supported.
2. The size of the push-merged blocks is logged incorrectly.

### Why are the changes needed?
This fixes the above mentioned bugs and is needed for push-based shuffle to work properly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested this by running an application on the cluster. The UTs mock the call `hostLocalDirManager.getHostLocalDirs` which is why didn't catch (1) with the UT. However, the fix is trivial and checking this in the UT will require a lot more effort so I haven't modified it in the UT.
Logs of the executor with the bug
```
21/07/15 15:42:46 WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [shuffle-push-merger]
21/07/15 15:42:46 WARN PushBasedFetchHelper: Error while fetching the merged dirs for push-merged-local blocks: shuffle_0_-1_13. Fetch the original blocks instead
java.lang.RuntimeException: java.lang.IllegalStateException: Invalid executor id: shuffle-push-merger, expected 92.
	at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:130)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
```
After the fix, the executors were able to fetch the local push-merged blocks.

Closes #33378 from otterc/SPARK-32922-followup.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-07-17 00:26:46 -05:00
yi.wu 4783fb72af [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file
### What changes were proposed in this pull request?

This is the initial work of add checksum support of shuffle. This is a piece of https://github.com/apache/spark/pull/32385. And this PR only adds checksum functionality at the shuffle writer side.

Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation:

* `BypassMergeSortShuffleWriter` -  wrap on each partition file
* `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting
* `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting

\* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime.

And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes, added a new conf: `spark.shuffle.checksum`.

### How was this patch tested?

Added unit tests.

Closes #32401 from Ngone51/add-checksum-files.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-07-17 00:23:14 -05:00
Chao Sun 37dc3f9ea7 [SPARK-36128][SQL] Apply spark.sql.hive.metastorePartitionPruning for non-Hive tables that uses Hive metastore for partition management
### What changes were proposed in this pull request?

In `CatalogFileIndex.filterPartitions`, check the config `spark.sql.hive.metastorePartitionPruning` and don't pushdown predicates to remote HMS if it is false. Instead, fallback to the `listPartitions` API and do the filtering on the client side.

### Why are the changes needed?

Currently the config `spark.sql.hive.metastorePartitionPruning` is only effective for Hive tables, and for non-Hive tables we'd always use the `listPartitionsByFilter` API from HMS client. On the other hand, by default all data source tables also manage their partitions through HMS, when the config `spark.sql.hive.manageFilesourcePartitions` is turned on. Therefore, it seems reasonable to extend the above config for non-Hive tables as well.

In certain cases the remote HMS service could throw exceptions when using the `listPartitionsByFilter` API, which, on the Spark side, is unrecoverable at the current state. Therefore it would be better to allow users to disable the API by using the above config.

For instance, HMS only allow pushdown date column when direct SQL is used instead of JDO for interacting with the underlying RDBMS, and will throw exception otherwise. Even though the Spark Hive client will attempt to recover itself when the exception happens, it only does so when the config `hive.metastore.try.direct.sql` from remote HMS is `false`. There could be cases where the value of `hive.metastore.try.direct.sql` is true but remote HMS still throws exception.

### Does this PR introduce _any_ user-facing change?

Yes now the config `spark.sql.hive.metastorePartitionPruning` is extended for non-Hive tables which use HMS to manage their partition metadata.

### How was this patch tested?

Added a new unit test:
```
build/sbt "hive/testOnly *PruneFileSourcePartitionsSuite -- -z SPARK-36128"
```

Closes #33348 from sunchao/SPARK-36128-by-filter.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-16 13:32:25 -07:00
Dongjoon Hyun 3218e4e14b [SPARK-36152][INFRA][TESTS] Add Scala 2.13 daily build and test GitHub Action job
### What changes were proposed in this pull request?

This PR aims to add a new GitHub Action daily workflow for Scala 2.13 build and test.

### Why are the changes needed?

Apache Spark 3.2.0 aims to support Scala 2.13 officially. We need a test coverage for master/3.2.

The following is the test result on my repository. The daily schedule triggered correctly for both master/3.2 branches.

- https://github.com/dongjoon-hyun/spark/actions/runs/1036083268
![Screen Shot 2021-07-15 at 10 09 22 PM](https://user-images.githubusercontent.com/9700541/125894950-a5a2ff9c-48f9-4184-913c-5422da5305a3.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a daily job. Since there is no way to see this, I tested this in my repository first as described in the above.

Closes #33358 from dongjoon-hyun/SPARK-36152.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-16 07:59:39 -07:00
Dominik Gehl 2d8d7b4aae [SPARK-36160][PYTHON][DOCS] Clarifying documentation for pyspark sql/column
### What changes were proposed in this pull request?
Adapting documentation of `between`, `getField`, `dropFields` and `cast` to the corresponding scala doc

### Why are the changes needed?
Documentation clarity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only documentation change

Closes #33369 from dominikgehl/feature/SPARK-36160.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 21:32:53 +09:00
Jungtaek Lim f2bf8b051b [SPARK-34893][SS] Support session window natively
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR).

The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped).

This PR leverages two different approaches on merging session windows:

1. merging session windows with Spark's aggregation logic (a variant of sort aggregation)
2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards

First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases.

This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge.

The state format is versioned, so that we can bring a new state format if we find a better one.

### Why are the changes needed?

For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks:

1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows
2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves
3. (flat)MapGroupsWithState is only available in Scala/Java.

With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations.

Quoting the query example from test suite:

```
    val inputData = MemoryStream[(String, Long)]

    // Split the lines into words, treat words as sessionId of events
    val events = inputData.toDF()
      .select($"_1".as("value"), $"_2".as("timestamp"))
      .withColumn("eventTime", $"timestamp".cast("timestamp"))
      .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime")
      .withWatermark("eventTime", "30 seconds")

    val sessionUpdates = events
      .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId)
      .agg(count("*").as("numEvents"))
      .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)",
        "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs",
        "numEvents")
```

which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes).

39542bb81f/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (L66-L105)

(Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.)

### Does this PR introduce _any_ user-facing change?

Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window".

### How was this patch tested?

New test suites. Also tested with benchmark code.

Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5.

Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-07-16 20:38:16 +09:00
Ke Jia c1b3f86c58 [SPARK-35710][SQL] Support DPP + AQE when there is no reused broadcast exchange
### What changes were proposed in this pull request?
This PR add the DPP + AQE support when spark can't reuse the broadcast but executing the DPP subquery is cheaper.

### Why are the changes needed?
Improve AQE + DPP

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Adding new ut

Closes #32861 from JkSelf/supportDPP3.

Lead-authored-by: Ke Jia <ke.a.jia@intel.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-16 16:01:07 +08:00
Steven Aerts f06aa4a3f3 [SPARK-35985][SQL] push partitionFilters for empty readDataSchema
this commit makes sure that for File Source V2 partition filters are
also taken into account when the readDataSchema is empty.
This is the case for queries like:

    SELECT count(*) FROM tbl WHERE partition=foo
    SELECT input_file_name() FROM tbl WHERE partition=foo

### What changes were proposed in this pull request?

As described in SPARK-35985 there is bug in the File Datasource V2 which prevents it to push down to the FileScanner for queries like the ones listed above.

### Why are the changes needed?

If partitions filters are not pushed down, the whole dataset will be scanned while only one partition is interesting.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

An extra test was added which relies on the output of explain, as is done in other places.

Closes #33191 from steven-aerts/SPARK-35985.

Authored-by: Steven Aerts <steven.aerts@airties.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-16 04:52:46 +00:00
Kousuke Saruta ad744fb4bf [SPARK-36171][BUILD] Upgrade GenJavadoc to 0.18
### What changes were proposed in this pull request?

This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`.

### Why are the changes needed?

`0.18` includes a bug fix for `Scala 2.13`.
```
This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases.
```
https://github.com/lightbend/genjavadoc/releases/tag/v0.18

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the doc for Scala 2.13.
```
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```

Closes #33383 from sarutak/upgrade-genjavadoc-0.18.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 20:24:45 -07:00
Hyukjin Kwon fba61ad68b [SPARK-36169][SQL] Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)
### What changes were proposed in this pull request?

This PR proposes to move `spark.sql.sources.disabledJdbcConnProviderList` from SQLConf to StaticSQLConf which disallows to set in runtime.

### Why are the changes needed?

It's documented as a static configuration. we should make it as a static configuration properly.

### Does this PR introduce _any_ user-facing change?

Previously, the configuration can be set to different value but not effective.
Now it throws an exception if users try to set in runtime.

### How was this patch tested?

Existing unittest was fixed. That should verify the change.

Closes #33381 from HyukjinKwon/SPARK-36169.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 11:43:22 +09:00
Dongjoon Hyun f66153de78 [SPARK-36166][TESTS] Support Scala 2.13 test in dev/run-tests.py
### What changes were proposed in this pull request?

For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`.

In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release.

### Why are the changes needed?

To test Scala 2.13 with `dev/run-tests.py`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual. The following is the result. Note that this PR aims to **run** Scala 2.13 tests instead of **passing** them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists.
```
$ dev/change-scala-version.sh 2.13

$ SCALA_PROFILE=scala2.13 dev/run-tests.py
...
========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl
...
========================================================================
Building Spark
========================================================================
[info] Building Spark using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly
...

[info] Building Spark assembly using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package
...

========================================================================
Running Java style checks
========================================================================
[info] Checking Java style using SBT with these profiles:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud
...

========================================================================
Building Unidoc API Documentation
========================================================================
[info] Building Spark unidoc using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc
...

========================================================================
Running Spark unit tests
========================================================================
[info] Running Spark tests using SBT with these arguments:  -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test
...
```

Closes #33376 from dongjoon-hyun/SPARK-36166.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 19:26:07 -07:00
Takuya UESHIN c22f7a4834 [SPARK-36167][PYTHON] Revisit more InternalField managements
### What changes were proposed in this pull request?

Revisit and manage `InternalField` in more places.

### Why are the changes needed?

There are other places we can manage `InternalField`, and we can keep extension dtypes or `CategoricalDtype`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests.

Closes #33377 from ueshin/issues/SPARK-36167/internal_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-15 19:25:20 -07:00
Dongjoon Hyun 5f41a2752f [SPARK-36164][INFRA][FOLLOWUP] Add empty string check back
### What changes were proposed in this pull request?

This is a follow-up of #33371.
At the branch commit GitHub run, we have an empty environment variable.
This PR adds back the empty string check logic.

### Why are the changes needed?

Currently, the failure happens when we use `--modules` in GitHub Action.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
fatal: ambiguous argument '': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main
    changed_files = identify_changed_files_from_git_commits(
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits
    raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target],
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already.
```
$ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module>
    main()
  File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main
    os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"])
  File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'GITHUB_SHA'
```

Closes #33374 from dongjoon-hyun/SPARK-36164.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 13:44:17 -07:00
Max Gekk b09b7f7cc0 [SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet
### What changes were proposed in this pull request?
In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib.

Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values.

After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description https://github.com/apache/spark/pull/28067 shows the diffs between two calendars.

### Why are the changes needed?
The changes fix the bug portrayed by the following example from SPARK-36034:
```scala
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
>>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show()
+----+
|date|
+----+
+----+
```
The result must have the date value `0001-01-01`.

### Does this PR introduce _any_ user-facing change?
In some sense, yes. Query results can be different in some cases. For the example above:
```scala
scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")
scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false)
+----------+
|date      |
+----------+
|0001-01-01|
+----------+
```

### How was this patch tested?
By running the modified test suite `ParquetFilterSuite`:
```
$ build/sbt "test:testOnly *ParquetV1FilterSuite"
$ build/sbt "test:testOnly *ParquetV2FilterSuite"
```

Closes #33347 from MaxGekk/fix-parquet-ts-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 22:21:57 +03:00
William Hyun c8a3c22628 [SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF is not defined
### What changes were proposed in this pull request?
This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

### Why are the changes needed?
Currently, the run-test.py ends with an error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33371 from williamhyun/SPARK-36164.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 11:43:30 -07:00
Dongjoon Hyun d69f981869 [SPARK-36165][INFRA] Fix SQL doc generation in GitHub Action
### What changes were proposed in this pull request?

This PR aims to fix SQL doc generation in GitHub Action by specifying the mkdocs-installed python version explicitly.

### Why are the changes needed?

Currently, the SQL doc generation is using `spark-submit` and picked up another `Python 3` binaries.
```
Generating SQL configuration table HTML file.
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-config-docs.py", line 25, in <module>
    from mkdocs.structure.pages import markdown
ModuleNotFoundError: No module named 'mkdocs'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action linter job.

Closes #33372 from dongjoon-hyun/fix_mkdocs.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 11:41:48 -07:00
Gengliang Wang 96c2919988 [SPARK-36135][SQL] Support TimestampNTZ type in file partitioning
### What changes were proposed in this pull request?

Support TimestampNTZ type in file partitioning
* When there is no provided schema and the default Timestamp type is TimestampNTZ , Spark should infer and parse the timestamp value partitions as TimestampNTZ.
* When the provided Partition schema is TimestampNTZ, Spark should be able to parse the TimestampNTZ type partition column.

### Why are the changes needed?

File partitioning is an important feature and Spark should support TimestampNTZ type in it.

### Does this PR introduce _any_ user-facing change?

Yes, Spark supports TimestampNTZ type in file partitioning

### How was this patch tested?

Unit tests

Closes #33344 from gengliangwang/partition.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-16 01:13:32 +08:00
Jungtaek Lim 1ceb753ef5 [SPARK-36157][SQL][SS] TimeWindow expression: apply filter before project
### What changes were proposed in this pull request?

This PR proposes to change the application of the operators for TimeWindow, from project -> filter, to filter -> project.

Currently Spark applies project, and filter, while filter is not dependent on project. That said, if the input rows are going to be filtered out via filter predicate, applying projection on these input rows are simply waste of time.

### Why are the changes needed?

This is a simple improvement requiring changes from a couple of lines.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33367 from HeartSaVioR/SPARK-36157.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-15 09:47:25 -07:00