Commit graph

29989 commits

Author SHA1 Message Date
gengjiaan cf2c4ba584 [SPARK-35020][SQL] Group exception messages in catalyst/util
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32367 from beliefer/SPARK-35020.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 08:30:30 +00:00
Wenchen Fan e83910f1f8 [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32198

Before https://github.com/apache/spark/pull/32198, in `WriteTaskStatsTracker.newRow`, we know that the row is written to the current file. After https://github.com/apache/spark/pull/32198 , we no longer know this connection.

This PR adds the file path parameter in `WriteTaskStatsTracker.newRow` to bring back the connection.

### Why are the changes needed?

To not break some custom `WriteTaskStatsTracker` implementations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32459 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 08:28:42 +00:00
Terry Kim 33c1034315 [SPARK-34701][SQL][FOLLOW-UP] Children/innerChildren should be mutually exclusive for AnalysisOnlyCommand
### What changes were proposed in this pull request?

This is a follow up to https://github.com/apache/spark/pull/32032#discussion_r620928086. Basically, `children`/`innerChildren` should be mutually exclusive for `AlterViewAsCommand` and `CreateViewCommand`, which extend `AnalysisOnlyCommand`. Otherwise, there could be an issue in the `EXPLAIN` command. Currently, this is not an issue, because these commands will be analyzed (children will always be empty) when the `EXPLAIN` command is run.

### Why are the changes needed?

To be future-proof where these commands are directly used.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new tsts

Closes #32447 from imback82/SPARK-34701-followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-07 06:07:53 +00:00
Cheng Su 42f59caf73 [SPARK-35133][SQL] Explain codegen works with AQE
### What changes were proposed in this pull request?

`EXPLAIN CODEGEN <query>` (and Dataset.explain("codegen")) prints out the generated code for each stage of plan. The current implementation is to match `WholeStageCodegenExec` operator in query plan and prints out generated code (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118 ). This does not work with AQE as we wrap the whole query plan inside `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior change for EXPLAIN query (and Dataset.explain), as we enable AQE by default now.

The change is to explain code-gen for the current executed plan of AQE.

### Why are the changes needed?

Make `EXPLAIN CODEGEN` work same as before.

### Does this PR introduce _any_ user-facing change?

No (when comparing with latest Spark release 3.1.1).

### How was this patch tested?

Added unit test in `ExplainSuite.scala`.

Closes #32430 from c21/explain-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 20:44:31 -07:00
byungsoo 94bbca3e55 [SPARK-35306][MLLIB][TESTS] Add benchmark results for BLASBenchmark created by GitHub Actions machines
### What changes were proposed in this pull request?
This PR adds benchmark results for `BLASBenchmark` created by GitHub Actions machines.
Benchmark result files are added for both JDK 8 (`BLASBenchmark-result.txt`) and 11 (`BLASBenchmark-jdk11-result.txt`) in `{SPARK_HOME}/mllib-local/benchmarks/`.

### Why are the changes needed?
In [SPARK-34950](https://issues.apache.org/jira/browse/SPARK-34950), benchmark results were updated to the ones created by Github Actions machines.
As benchmark results for `BLASBenchmark` (added at [SPARK-33882](https://issues.apache.org/jira/browse/SPARK-33882) and [SPARK-35150](https://issues.apache.org/jira/browse/SPARK-35150)) are not currently available at the repository, this PR adds them.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The benchmark results were obtained by running tests with GitHub Actions workflow in my forked repository.
You can refer to the test results and output files from the link below.
- https://github.com/byungsoo-oh/spark/actions/runs/809900377
- https://github.com/byungsoo-oh/spark/actions/runs/810084610

Closes #32435 from byungsoo-oh/SPARK-35306.

Authored-by: byungsoo <byungsoo@byungsoo-pc.tn.corp.samsungelectronics.net>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-07 11:08:10 +09:00
Takeshi Yamamuro e834ef74dc [SPARK-35293][SQL][TESTS][FOLLOWUP] Update the hash key to refresh TPC-DS cache data in forked GA jobs
### What changes were proposed in this pull request?

This is a follow-up PRi of #32420 and it intends to update the hash key to refresh TPC-DS cache data in forked GA jobs.

### Why are the changes needed?

To recover GA jobs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32460 from maropu/SPARK-35293-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 16:06:50 -07:00
Dongjoon Hyun 482b43d78d [SPARK-35326][BUILD][FOLLOWUP] Update dependency manifest files
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/32453.

### Why are the changes needed?

Jenkins doesn't check dependency manifest files.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action or manually.

Closes #32458 from dongjoon-hyun/SPARK-35326.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 09:08:10 -07:00
Kousuke Saruta bb93547cdf [SPARK-35326][BUILD] Upgrade Jersey to 2.34
### What changes were proposed in this pull request?

This PR upgrades Jersey to 2.34.

### Why are the changes needed?

CVE-2021-28168, a local information disclosure vulnerability, is reported (https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168).
Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.

### Does this PR introduce _any_ user-facing change?

It's not clear how much the impact is but Spark uses an affected version of Jersey so I think it's better to upgrade it just in case.

### How was this patch tested?

CI.

Closes #32453 from sarutak/upgrade-jersey.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-06 08:36:32 -07:00
Yuanjian Li dfb3343423 [SPARK-34526][SS] Ignore the error when checking the path in FileStreamSink.hasMetadata
### What changes were proposed in this pull request?
When checking the path in `FileStreamSink.hasMetadata`, we should ignore the error and assume the user wants to read a batch output.

### Why are the changes needed?
Keep the original behavior of ignoring the error.

### Does this PR introduce _any_ user-facing change?
Yes.
The path checking will not throw an exception when checking file sink format

### How was this patch tested?
New UT added.

Closes #31638 from xuanyuanking/SPARK-34526.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-05-06 22:48:53 +09:00
Liang-Chi Hsieh 6cd5cf5722 [SPARK-35215][SQL] Update custom metric per certain rows and at the end of the task
### What changes were proposed in this pull request?

This patch changes custom metric updating to update per certain rows (currently 100), instead of per row.

### Why are the changes needed?

Based on previous discussion https://github.com/apache/spark/pull/31451#discussion_r605413557, we should only update custom metrics per certain (e.g. 100) rows and also at the end of the task. Updating per row doesn't make too much benefit.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit test.

Closes #32330 from viirya/metric-update.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-06 13:21:08 +00:00
Liang-Chi Hsieh c6d3f3778f [SPARK-35240][SS] Use CheckpointFileManager for checkpoint file manipulation
### What changes were proposed in this pull request?

This patch changes a few places using `FileSystem` API to manipulate checkpoint file to `CheckpointFileManager`.

### Why are the changes needed?

`CheckpointFileManager` is designed to handle checkpoint file manipulation. However, there are a few places exposing `FileSystem` from checkpoint files/paths. We should use `CheckpointFileManager` to manipulate checkpoint files. For example, we may want to have one storage system for checkpoint file. If all checkpoint file manipulation is performed through `CheckpointFileManager`, we can only implement `CheckpointFileManager` for the storage system, and don't need to implement `FileSystem` API for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

Closes #32361 from viirya/checkpoint-manager.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-06 00:49:37 -07:00
Linhong Liu 3f5a20919c [SPARK-35318][SQL] Hide internal view properties for describe table cmd
### What changes were proposed in this pull request?
Hide internal view properties for describe table command, because those
properties are generated by spark and should be transparent to the end-user.

### Why are the changes needed?
Avoid internal properties confusing the users.

### Does this PR introduce _any_ user-facing change?
Yes
Before this change, the user will see below output for `describe formatted test_view`
```
....
Table Properties       [view.catalogAndNamespace.numParts=2, view.catalogAndNamespace.part.0=spark_catalog, view.catalogAndNamespace.part.1=default, view.query.out.col.0=c, view.query.out.col.1=v, view.query.out.numCols=2, view.referredTempFunctionsNames=[], view.referredTempViewNames=[]]
...
```
After this change, the internal properties will be hidden for `describe formatted test_view`
```
...
Table Properties        []
...
```

### How was this patch tested?
existing UT

Closes #32441 from linhongliu-db/hide-properties.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-06 07:31:34 +00:00
Takeshi Yamamuro 5c67d0c8f7 [SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite
### What changes were proposed in this pull request?

This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.

### Why are the changes needed?

For better testing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed.

Closes #32420 from maropu/UseTpcdsKit.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-06 15:25:46 +09:00
Dongjoon Hyun 19661f6ae2 [SPARK-35325][SQL][TESTS] Add nested column ORC encryption test case
### What changes were proposed in this pull request?

This PR aims to enrich ORC encryption test coverage for nested columns.

### Why are the changes needed?

This will provide a test coverage for this feature.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #32449 from dongjoon-hyun/SPARK-35325.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-05 22:29:54 -07:00
Dongjoon Hyun a0c76a8755 [SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1
### What changes were proposed in this pull request?

This PR aims to upgrade K8s client to 5.3.1.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

K8s IT is manually tested like the following.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 18 minutes, 33 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.959 s]
[INFO] Spark Project Tags ................................. SUCCESS [  7.830 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  3.457 s]
[INFO] Spark Project Networking ........................... SUCCESS [  5.496 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  3.239 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  9.006 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  2.422 s]
[INFO] Spark Project Core ................................. SUCCESS [02:17 min]
[INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  23:59 min
[INFO] Finished at: 2021-05-05T11:59:19-07:00
[INFO] ------------------------------------------------------------------------
```

Closes #32443 from dongjoon-hyun/SPARK-35319.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-05 19:50:37 -07:00
Dongjoon Hyun 0126924568 [SPARK-35323][BUILD] Remove unused libraries from LICENSE-binary
### What changes were proposed in this pull request?

This PR removes unused libraries from `LICENSE-binary` file.

### Why are the changes needed?

SPARK-33212 removes many `Hadoop 3`-only transitive libraries like `dnsjava-2.1.7.jar`. We can simplify Apache Spark LICENSE file by removing them.

### Does this PR introduce _any_ user-facing change?

Yes, but this is only LICENSE file change.

### How was this patch tested?

Manual.

Closes #32445 from dongjoon-hyun/SPARK-35323.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-05 18:27:56 -07:00
Yingyi Bu 7970318296 [SPARK-35155][SQL] Add rule id pruning to Analyzer rules
### What changes were proposed in this pull request?

Added rule id based pruning to Analyzer rules in fixed point batches:

- org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns
- org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator
- org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions
- org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast
- org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns
- org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution
- org.apache.spark.sql.catalyst.analysis.DeduplicateRelations
- org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
- org.apache.spark.sql.catalyst.analysis.EliminateUnions
- org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct
- org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints
- org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints
- org.apache.spark.sql.catalyst.analysis.ResolveInlineTables
- org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables
- org.apache.spark.sql.catalyst.analysis.ResolveTimeZone
- org.apache.spark.sql.catalyst.analysis.ResolveUnion
- org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals
- org.apache.spark.sql.catalyst.analysis.TimeWindowing

Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load.

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32425 from sigmod/analyzer.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-06 08:55:29 +08:00
Chao Sun 4fe4b65d9e [SPARK-35315][TESTS] Keep benchmark result consistent between spark-submit and SBT
### What changes were proposed in this pull request?

Set `IS_TESTING` to true in `BenchmarkBase`, before running benchmarks.

### Why are the changes needed?

Currently benchmark can be done via 2 ways: `spark-submit`, or SBT command. However in the former Spark will miss some properties such as `IS_TESTING`, which is necessary to turn on/off certain behavior like codegen (`spark.sql.codegen.factoryMode`). Therefore, the result could differ between the two. In addition, the benchmark GitHub workflow is using the spark-submit approach.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32440 from sunchao/SPARK-35315.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-05-05 18:30:51 +08:00
Yijia Cui bbdbe0f734 [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay
### What changes were proposed in this pull request?
This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest.

A public interface is added.

`metrics`: returns the metrics reported by the streaming source with given offset.

### Why are the changes needed?
The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test for Kafka micro batch source v2 are added to test the Kafka use case.

Closes #31944 from yijiacui-db/SPARK-34297.

Authored-by: Yijia Cui <yijia.cui@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-05-05 17:26:07 +09:00
dsolow f550e03b96 [SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions
### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <dsolow@sayari.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: dmsolow <dsolow@sayarianalytics.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-05 12:46:13 +09:00
Yingyi Bu 7fd3f8f9ec [SPARK-35294][SQL] Add tree traversal pruning in rules with dedicated files under optimizer
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- CREATE_NAMED_STRUCT
- EXTRACT_VALUE
- JSON_TO_STRUCT
- OUTER_REFERENCE
- AGGREGATE
- LOCAL_RELATION
- EXCEPT
- LIMIT
- WINDOW

Used them in the following rules:
- DecorrelateInnerQuery
- LimitPushDownThroughWindow
- OptimizeCsvJsonExprs
- PropagateEmptyRelation
- PullOutGroupingExpressions
- PushLeftSemiLeftAntiThroughJoin
- ReplaceExceptWithFilter
- RewriteDistinctAggregates
- SimplifyConditionalsInPredicate
- UnwrapCastInBinaryComparison

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32421 from sigmod/opt.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-04 19:17:22 +08:00
byungsoo 9b387a1718 [SPARK-35308][TESTS] Fix bug in SPARK-35266 that creates benchmark files in invalid path with wrong name
### What changes were proposed in this pull request?
This PR fixes a bug in [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266) that creates benchmark files in the invalid path with the wrong name.
e.g. For `BLASBenchmark`,
- AS-IS: Creates `benchmarksBLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/`
- TO-BE: Creates `BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/`

### Why are the changes needed?
As you can see in the above example, new benchmark files cannot be created as intended due to this bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
After building Spark, manually tested with the following command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
    org.apache.spark.benchmark.Benchmarks --jars \
    "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
    "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
    "org.apache.spark.ml.linalg.BLASBenchmark"
```
It successfully generated the benchmark files as intended (`BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/`).

Closes #32432 from byungsoo-oh/SPARK-35308.

Lead-authored-by: byungsoo <byungsoo@byungsoo-pc.tn.corp.samsungelectronics.net>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 19:40:57 +09:00
HyukjinKwon a2927cb28b [SPARK-35302][INFRA] Benchmark workflow should create new files for new benchmarks
### What changes were proposed in this pull request?

Currently, it fails at `git diff --name-only` when new benchmarks are added, see https://github.com/HyukjinKwon/spark/actions/runs/808870999

We should include untracked files (new benchmark result files) to upload so developers download the results.

### Why are the changes needed?

So the new benchmark results can be added and uploaded.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

Tested at:

https://github.com/HyukjinKwon/spark/actions/runs/808867285

Closes #32428 from HyukjinKwon/include-new-benchmarks.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 19:02:52 +09:00
Xinrong Meng 5ecb112410 [SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst
### What changes were proposed in this pull request?

Use full names of modules in `install.rst` when specifying dependencies.

### Why are the changes needed?

Using full names makes it more clear.
In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual verification.

Closes #32427 from xinrong-databricks/nameDoc.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 11:02:57 +09:00
Xinrong Meng 120c389b00 [SPARK-34887][PYTHON] Port Koalas dependencies into PySpark
### What changes were proposed in this pull request?

Port Koalas dependencies appropriately to PySpark dependencies.

### Why are the changes needed?

pandas-on-Spark has its own required dependency and optional dependencies.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #32386 from xinrong-databricks/portDeps.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 09:04:23 +09:00
garawalid 176218b6b8 [SPARK-35292][PYTHON] Delete redundant parameter in mypy configuration
### What changes were proposed in this pull request?

The parameter **no_implicit_optional** is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105.

### Why are the changes needed?

We would like to keep the mypy configuration clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This patch can be tested with `dev/lint-python`

Closes #32418 from garawalid/feature/clean-mypy-config.

Authored-by: garawalid <gwalid94@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 09:01:34 +09:00
HyukjinKwon 8aaa9e890a [SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation
### What changes were proposed in this pull request?

This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes #32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 08:44:18 +09:00
Tobias Hermann 54e0aa10c8 [MINOR][SS][DOCS] Fix a typo in the documentation of GroupState
### What changes were proposed in this pull request?

Fixing some typos in the documenting comments.

### Why are the changes needed?

To make reading the docs more pleasant.

### Does this PR introduce _any_ user-facing change?

Yes, since the user sees the docs.

### How was this patch tested?

It was not tested, because no code was changed.

Closes #32400 from Dobiasd/patch-1.

Authored-by: Tobias Hermann <editgym@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 19:35:38 +09:00
byungsoo be6ecb6d19 [SPARK-35266][TESTS] Fix error in BenchmarkBase.scala that occurs when creating benchmark files in non-existent directory
### What changes were proposed in this pull request?
This PR fixes an error in `BenchmarkBase.scala` that occurs when creating a benchmark file in a non-existent directory.

### Why are the changes needed?
When submitting a benchmark job using `org.apache.spark.benchmark.Benchmarks` class with `SPARK_GENERATE_BENCHMARK_FILES=1` option, an exception is raised if the directory where the benchmark file will be generated does not exist.
For more information, please refer to [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
After building Spark, manually tested with the following command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
    org.apache.spark.benchmark.Benchmarks --jars \
    "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
    "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
    "org.apache.spark.ml.linalg.BLASBenchmark"
```
It successfully generated the benchmark result files.

**Why it is sufficient:**
As illustrated in the comments in `Benchmarks.scala`, the command below runs all benchmarks and generates the results:
```
SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
    org.apache.spark.benchmark.Benchmarks --jars \
    "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
    "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
    "*"
```
Of all the benchmarks (55 benchmarks in total), only `BLASBenchmark` fails due to the proposed issue for the current code in the master branch. Thus, it is currently sufficient to test `BLASBenchmark` to validate this change.

Closes #32394 from byungsoo-oh/SPARK-35266.

Authored-by: byungsoo <byungsoo@byungsoo-pc.tn.corp.samsungelectronics.net>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 18:06:06 +09:00
Yikun Jiang 44b7931936 [SPARK-35176][PYTHON] Standardize input validation error type
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas

### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.

[1] https://docs.python.org/3/library/exceptions.html#TypeError

Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.

### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.

### How was this patch tested?
Existing test case and UT

Closes #32368 from Yikun/SPARK-35176.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 15:34:24 +09:00
Chao Sun 2a8d7ed4bf [SPARK-35281][SQL] StaticInvoke should not apply boxing if return type is primitive
### What changes were proposed in this pull request?

In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive.

### Why are the changes needed?

It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a UT.

Closes #32416 from sunchao/SPARK-35281.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 14:55:35 +09:00
Max Gekk 335f00b19b [SPARK-35285][SQL] Parse ANSI interval types in SQL schema
### What changes were proposed in this pull request?
1. Extend Spark SQL parser to support parsing of:
    - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType`
    - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType`
2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`.

### Why are the changes needed?
To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules:
```
<interval type> ::= INTERVAL <interval qualifier>
<interval qualifier> ::= <start field> TO <end field> | <single datetime field>
<start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ]
<end field> ::= <non-second primary datetime field> | SECOND [ <left paren> <interval fractional seconds precision> <right paren> ]
<primary datetime field> ::= <non-second primary datetime field | SECOND
<non-second primary datetime field> ::= YEAR | MONTH | DAY | HOUR | MINUTE
<interval fractional seconds precision> ::= <unsigned integer>
<interval leading field precision> ::= <unsigned integer>
```
Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`.

### Does this PR introduce _any_ user-facing change?
Should not since the types has not been released yet.

### How was this patch tested?
By running the affected tests such as:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "test:testOnly *ExpressionTypeCheckingSuite"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z windowFrameCoercion.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql"
```

Closes #32409 from MaxGekk/parse-ansi-interval-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 13:50:35 +09:00
Takeshi Yamamuro cd689c942c [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf
### What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
juliuszsompolski
npoggi
wangyum

### Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
 - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
 - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala

 I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

### Does this PR introduce _any_ user-facing change?

dev only.

### How was this patch tested?

Manually checked and GA passed.

Closes #32243 from maropu/tpcdsDatagen.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-03 12:04:42 +09:00
Angerszhuuuu caa46ce0b6 [SPARK-35112][SQL] Support Cast string to day-second interval
### What changes were proposed in this pull request?
Support Cast string to day-seconds interval

### Why are the changes needed?
Users can cast day-second interval string to DayTimeIntervalType.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32271 from AngersZhuuuu/SPARK-35112.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-05-02 09:28:51 +03:00
Peter Toth cfc0495f9c [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function
### What changes were proposed in this pull request?
This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions.

### Why are the changes needed?
If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:
```
SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL
```
In this case the `BooleanSimplification` rule does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
 +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
```
where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.

Before this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
```

After this PR:
```
== Optimized Logical Plan ==
Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L]
+- Project [isnull(value#219) AS _groupingexpression#233]
   +- LocalRelation [value#219]
```
and the query works.

### Does this PR introduce _any_ user-facing change?
Yes, the query works.

### How was this patch tested?
Added new UT.

Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-02 05:53:09 +00:00
Liang-Chi Hsieh 6ce1b161e9 [SPARK-35278][SQL] Invoke should find the method with correct number of parameters
### What changes were proposed in this pull request?

This patch fixes `Invoke` expression when the target object has more than one method with the given method name.

### Why are the changes needed?

`Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method.

### Does this PR introduce _any_ user-facing change?

Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name.

### How was this patch tested?

Unit test.

Closes #32404 from viirya/verify-invoke-param-len.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-01 10:20:46 -07:00
Yuming Wang 72e238a790 [SPARK-35273][SQL] CombineFilters support non-deterministic expressions
### What changes were proposed in this pull request?

This pr makes `CombineFilters` support non-deterministic expressions. For example:
```sql
spark.sql("CREATE TABLE t1(id INT, dt STRING) using parquet PARTITIONED BY (dt)")
spark.sql("CREATE VIEW v1 AS SELECT * FROM t1 WHERE dt NOT IN ('2020-01-01', '2021-01-01')")
spark.sql("SELECT * FROM v1 WHERE dt = '2021-05-01' AND rand() <= 0.01").explain()
```

Before this pr:
```
== Physical Plan ==
*(1) Filter (isnotnull(dt#1) AND ((dt#1 = 2021-05-01) AND (rand(-6723800298719475098) <= 0.01)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [NOT dt#1 IN (2020-01-01,2021-01-01)], PushedFilters: [], ReadSchema: struct<id:int>
```

After this pr:
```
== Physical Plan ==
*(1) Filter (rand(-2400509328955813273) <= 0.01)
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), NOT dt#1 IN (2020-01-01,2021-01-01), (dt#1 = 2021-05-01)], PushedFilters: [], ReadSchema: struct<id:int>
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32405 from wangyum/SPARK-35273.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-01 06:02:11 +00:00
Dongjoon Hyun 4e8701a77d [SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi
### What changes were proposed in this pull request?

Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0.

### Why are the changes needed?

Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users.

In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0.

| Version | Function Name |
|-|-|
| 2.3.0 | parsePrefixedKeyValuePairs |
| 2.3.0 | requireNandDefined |
| 2.3.0 | parsePrefixedKeyValuePairs |
| 2.4.0 | parseMasterUrl |
| 3.0.0 | requireBothOrNeitherDefined |
| 3.0.0 | requireSecondIfFirstIsDefined |
| 3.0.0 | selectSparkContainer |
| 3.0.0 | formatPairsBundle |
| 3.0.0 | formatPodState |
| 3.0.0 | containersDescription |
| 3.0.0 | containerStatusDescription |
| 3.0.0 | formatTime |
| 3.0.0 | uniqueID |
| 3.0.0 | buildResourcesQuantities |
| 3.0.0 | uploadAndTransformFileUris |
| 3.0.0 | uploadFileUri |
| 3.0.0 | requireBothOrNeitherDefined |
| 3.0.0 | buildPodWithServiceAccount |
| 3.0.0 | isLocalAndResolvable |
| 3.1.1 | renameMainAppResource |
| 3.1.1 | addOwnerReference |
| 3.2.0 | loadPodFromTemplate |

### Does this PR introduce _any_ user-facing change?

Yes, but this is new API additions.

### How was this patch tested?

Pass the CIs.

Closes #32406 from dongjoon-hyun/SPARK-35280.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-30 11:39:18 -07:00
ulysses-you 39889df32a [SPARK-35264][SQL] Support AQE side broadcastJoin threshold
### What changes were proposed in this pull request?

~~This PR aims to add a new AQE optimizer rule `DynamicJoinSelection`. Like other AQE partition number configs, this rule add a new broadcast threshold config `spark.sql.adaptive.autoBroadcastJoinThreshold`.~~
This PR amis to add a flag in `Statistics` to distinguish AQE stats or normal stats, so that we can make some sql configs isolation between AQE and normal.

### Why are the changes needed?

The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path.

Actually we do not very trust using the static stats to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config.

### Does this PR introduce _any_ user-facing change?

Yes, a new config `spark.sql.adaptive.autoBroadcastJoinThreshold` added.

### How was this patch tested?

Add new test.

Closes #32391 from ulysses-you/SPARK-35264.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-30 09:16:21 +00:00
Angerszhuuuu 11ea255283 [SPARK-35111][SQL] Support Cast string to year-month interval
### What changes were proposed in this pull request?
Support Cast string to year-month interval
Supported format as below
```
ANSI_STYLE, like
INTERVAL -'-10-1' YEAR TO MONTH
HIVE_STYLE like
10-1 or -10-1

Rules from the SQL standard about ANSI_STYLE:

<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<year-month literal> ::=
  <years value> [ <minus sign> <months value> ]
  | <months value>
<years value> ::=
  <datetime value>
<months value> ::=
  <datetime value>
<datetime value> ::=
  <unsigned integer>
<unsigned integer> ::= <digit>...
```
### Why are the changes needed?
Support Cast string to year-month interval

### Does this PR introduce _any_ user-facing change?
User can cast year month interval string to YearMonthIntervalType

### How was this patch tested?
Added UT

Closes #32266 from AngersZhuuuu/SPARK-SPARK-35111.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-30 08:03:07 +03:00
William Hyun ac8813e37c [SPARK-35277][BUILD] Upgrade snappy to 1.1.8.4
### What changes were proposed in this pull request?
This PR aims to upgrade snappy to version 1.1.8.4.

### Why are the changes needed?
This will bring the latest bug fixes and improvements.
- https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-1183-2021-01-20

    - Make pure-java Snappy thread-safe
    - Improved SnappyFramedInput/OutputStream performance by using java.util.zip.CRC32C

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the CIs.

Closes #32402 from williamhyun/snappy1184.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 21:26:16 -07:00
lipzhu 77e9152898 [SPARK-35255][BUILD] Automated formatting for Scala Code for Blank Lines
### What changes were proposed in this pull request?

https://github.com/databricks/scala-style-guide#blanklines
https://scalameta.org/scalafmt/docs/configuration.html#newlinestoplevelstatements

### How was this patch tested?

Manually tested by modifying a few files and running ./dev/scalafmt then checking that ./dev/scalastyle still passed.

Closes #32383 from lipzhu/SPARK-35255.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-30 11:45:58 +09:00
Kousuke Saruta e8bf8fe213 [SPARK-35047][SQL] Allow Json datasources to write non-ascii characters as codepoints
### What changes were proposed in this pull request?

This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints.
To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources.

### Why are the changes needed?

JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it.
It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation.

### Does this PR introduce _any_ user-facing change?

Yes. Users can write non-ascii characters as codepoints with JSON datasources.

### How was this patch tested?

New test.

Closes #32147 from sarutak/json-unicode-write.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 09:50:15 -07:00
Kousuke Saruta 8a5af37c25 [SPARK-35268][BUILD] Upgrade GenJavadoc to 0.17
### What changes were proposed in this pull request?

This PR upgrades `GenJavadoc` to `0.17`.

### Why are the changes needed?

This version seems to include a fix for an issue which can happen with Scala 2.13.5.
https://github.com/lightbend/genjavadoc/releases/tag/v0.17

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed build succeed with the following commands.
```
# For Scala 2.12
$ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc

# For Scala 2.13
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```

Closes #32392 from sarutak/upgrade-genjavadoc-0.17.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 09:47:14 -07:00
attilapiros 738cf7f8ff [SPARK-35009][CORE] Avoid creating multiple python worker monitor threads for the same worker and same task context
### What changes were proposed in this pull request?

With this PR Spark avoids creating multiple monitor threads for the same worker and same task context.

### Why are the changes needed?

Without this change unnecessary threads will be created. It even can cause job failure for example when a coalesce (without shuffle) from high partition number goes to very low one. This exception is exactly comes for such a run:

```
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.210 executor driver): java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at scala.collection.AbstractIterator.to(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2262)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2211)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2210)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2210)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1083)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1083)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1083)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2449)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2391)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2380)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:872)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2220)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2260)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at scala.collection.AbstractIterator.to(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually I used a the following Python script used (`reproduce-SPARK-35009.py`):

```
import pyspark

conf = pyspark.SparkConf().setMaster("local[*]").setAppName("Test1")
sc = pyspark.SparkContext.getOrCreate(conf)

rows = 70000
data = list(range(rows))
rdd = sc.parallelize(data, rows)
assert rdd.getNumPartitions() == rows
rdd0 = rdd.filter(lambda x: False)
data = rdd0.coalesce(1).collect()
assert data == []
```

Spark submit:
```
$ ./bin/spark-submit reproduce-SPARK-35009.py
```

#### With this change

Checking the number of monitor threads with jcmd:
```
$ jcmd
85273 sun.tools.jcmd.JCmd
85227 org.apache.spark.deploy.SparkSubmit reproduce-SPARK-35009.py
41020 scala.tools.nsc.MainGenericRunner
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
...
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
$ jcmd 85227 Thread.print | grep -c "Monitor for python"
2
```
<img width="859" alt="Screenshot 2021-04-14 at 16 06 51" src="https://user-images.githubusercontent.com/2017933/114731755-4969b980-9d42-11eb-8ec5-f60b217bdd96.png">

#### Without this change

```
...
$ jcmd 90052 Thread.print | grep -c "Monitor for python"                                                                                                      [INSERT]
5645
..
```

<img width="856" alt="Screenshot 2021-04-14 at 16 30 18" src="https://user-images.githubusercontent.com/2017933/114731724-4373d880-9d42-11eb-9f9b-d976bf2530e2.png">

Closes #32169 from attilapiros/SPARK-35009.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
2021-04-29 18:38:31 +02:00
lipzhu 4e3daa5994 [SPARK-35254][BUILD] Upgrade SBT to 1.5.1
### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.5.1.

### Why are the changes needed?

https://github.com/sbt/sbt/releases/tag/v1.5.1

### Does this PR introduce _any_ user-facing change?

NO.

### How was this patch tested?

Pass the SBT CIs (Build/Test/Docs/Plugins).

Closes #32382 from lipzhu/SPARK-35254.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 09:32:43 -07:00
yangjie01 7b78e34417 [SPARK-35269][BUILD] Upgrade commons-lang3 to 3.12.0
### What changes were proposed in this pull request?

This pr aims to upgrade Apache commons-lang3 to 3.12.0

### Why are the changes needed?
This version will bring the latest bug fixes as follows:

- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32393 from LuciferYang/lang3-to-312.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-29 09:27:28 -07:00
yi.wu 068b6c8be6 [SPARK-35234][CORE] Reserve the format of stage failureMessage
### What changes were proposed in this pull request?

`failureMessage` is already formatted, but `replaceAll("\n", " ")` destroyed the format. This PR fixed it.

### Why are the changes needed?

The formatted error message is easier to read and debug.

### Does this PR introduce _any_ user-facing change?

Yes, users see the clear error message in the application log.

(Note I changed a little bit to let the test throw exception intentionally. The test itself is good.)

Before:
![2141619490903_ pic_hd](https://user-images.githubusercontent.com/16397174/116177970-5a092f00-a747-11eb-9a0f-017391e80c8b.jpg)

After:

![2151619490955_ pic_hd](https://user-images.githubusercontent.com/16397174/116177981-5ecde300-a747-11eb-90ef-fd16e906beeb.jpg)

### How was this patch tested?

Manually tested.

Closes #32356 from Ngone51/format-stage-error-message.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
2021-04-29 16:33:36 +02:00
Kousuke Saruta 132cbf0c8c [SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE commands
### What changes were proposed in this pull request?

This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive.

### Why are the changes needed?

To make those commands more useful.

### Does this PR introduce _any_ user-facing change?

Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths.
I've note this incompatibility in the migration guide.

### How was this patch tested?

New tests.

Closes #32205 from sarutak/add-multiple-files.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-29 13:58:51 +09:00
Kousuke Saruta 529b875901 [SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources
### What changes were proposed in this pull request?

This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`.

### Why are the changes needed?

In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`.
So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected.

The similar issue happens when we run multiple `*KrbIntegrationSuites` at the same time.
`MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`.
Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail.
You can easily confirm with the following command.
```
build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite"
```
### Does this PR introduce _any_ user-facing change?

Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration.

### How was this patch tested?

New test.

Closes #32344 from sarutak/kerberos-refresh-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-29 13:55:53 +09:00