Commit graph

30256 commits

Author SHA1 Message Date
itholic fe09def323 [SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents
### What changes were proposed in this pull request?

This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents.

### Why are the changes needed?

If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below:
<img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png">

### Does this PR introduce _any_ user-facing change?

Yes, the `# noqa` is no more shown in the documents as below:
<img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png">

### How was this patch tested?

Manually build docs and check.

Closes #32728 from itholic/SPARK-35582.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 15:24:04 +09:00
Yingyi Bu 1dd0ca23f6 [SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- AGGREGATE_EXPRESSION
- ALIAS
- GROUPING_ANALYTICS
- GENERATOR
- HIGH_ORDER_FUNCTION
- LAMBDA_FUNCTION
- NEW_INSTANCE
- PIVOT
- PYTHON_UDF
- TIME_WINDOW
- TIME_ZONE_AWARE_EXPRESSION
- UP_CAST
- COMMAND
- EVENT_TIME_WATERMARK
- UNRESOLVED_RELATION
- WITH_WINDOW_DEFINITION
- UNRESOLVED_ALIAS
- UNRESOLVED_ATTRIBUTE
- UNRESOLVED_DESERIALIZER
- UNRESOLVED_ORDINAL
- UNRESOLVED_FUNCTION
- UNRESOLVED_HINT
- UNRESOLVED_SUBQUERY_COLUMN_ALIAS
- UNRESOLVED_FUNC

Added tree pattern pruning to the following Analyzer rules:
- ResolveBinaryArithmetic
- WindowsSubstitution
- ResolveAliases
- ResolveGroupingAnalytics
- ResolvePivot
- ResolveOrdinalInOrderByAndGroupBy
- LookupFunction
- ResolveSubquery
- ResolveSubqueryColumnAliases
- ApplyCharTypePadding
- UpdateOuterReferences
- ResolveCreateNamedStruct
- TimeWindowing
- CleanupAliases
- EliminateUnions
- EliminateSubqueryAliases
- HandleAnalysisOnlyCommand
- ResolveNewInstances
- ResolveUpCast
- ResolveDeserializer
- ResolveOutputRelation
- ResolveEncodersInUDF
- HandleNullInputsForUDF
- ResolveGenerate
- ExtractGenerator
- GlobalAggregates
- ResolveAggregateFunctions

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.
Performance diff:
<google-sheets-html-origin><style type="text/css"></style>
&nbsp; | Baseline | Experiment | Experiment/Baseline
-- | -- | -- | --
ResolveBinaryArithmetic | 43264874 | 34707117 | 0.80
WindowsSubstitution | 3322996 | 2734192 | 0.82
ResolveAliases | 24859263 | 21359941 | 0.86
ResolveGroupingAnalytics | 39249143 | 25417569 | 0.80
ResolvePivot | 6393408 | 2843314 | 0.44
ResolveOrdinalInOrderByAndGroupBy | 10750806 | 3386715 | 0.32
LookupFunction | 22087384 | 15481294 | 0.70
ResolveSubquery | 1129139340 | 944402323 | 0.84
ResolveSubqueryColumnAliases | 5055038 | 2808210 | 0.56
ApplyCharTypePadding | 76285576 | 63785681 | 0.84
UpdateOuterReferences | 6548321 | 3092539 | 0.47
ResolveCreateNamedStruct | 38111477 | 17350249 | 0.46
TimeWindowing | 41694190 | 3739134 | 0.09
CleanupAliases | 48683506 | 39584921 | 0.81
EliminateUnions | 3405069 | 2372506 | 0.70
EliminateSubqueryAliases | 9626649 | 9518216 | 0.99
HandleAnalysisOnlyCommand | 2562123 | 2661432 | 1.04
ResolveNewInstances | 16208966 | 1982314 | 0.12
ResolveUpCast | 14067843 | 1868615 | 0.13
ResolveDeserializer | 17991103 | 2320308 | 0.13
ResolveOutputRelation | 5815277 | 2088787 | 0.36
ResolveEncodersInUDF | 14182892 | 1045113 | 0.07
HandleNullInputsForUDF | 19850838 | 1329528 | 0.07
ResolveGenerate | 5587345 | 1953192 | 0.35
ExtractGenerator | 120378046 | 3386286 | 0.03
GlobalAggregates | 16510455 | 13553155 | 0.82
ResolveAggregateFunctions | 1041848509 | 828049280 | 0.79

</google-sheets-html-origin>

Closes #32686 from sigmod/analyzer.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-01 11:39:42 +08:00
itholic 73d4f67145 [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move CSV data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "CSV Files" page
<img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png">

- Python
<img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png">

- Scala
<img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png">

- Java
<img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32658 from itholic/SPARK-35433.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:58:49 +09:00
Wenchen Fan bb2a0747d2 [SPARK-35578][SQL][TEST] Add a test case for a bug in janino
### What changes were proposed in this pull request?

This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries.

A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark.

### Why are the changes needed?

make it easy for people to debug janino, as I'm not a janino expert.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32716 from cloud-fan/janino.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:51:05 +09:00
Hyukjin Kwon 1ba1b70cfe [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+
### What changes were proposed in this pull request?

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: https://github.com/apache/spark/pull/32709#discussion_r642458108
- Treat the empty timezone and local timezone as equivalent in SparkR: https://github.com/apache/spark/pull/32709#discussion_r642464454
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is https://github.com/apache/spark/pull/32709#discussion_r642510089

### Why are the changes needed?

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

### Does this PR introduce _any_ user-facing change?

Yes, SparkR will work together with R 4.1.0+

### How was this patch tested?

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes #32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:35:52 +09:00
itholic 7e2717333b [SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor
### What changes were proposed in this pull request?

This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor".

### Why are the changes needed?

Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark".

So, the related code bases are all need to be changed.

### Does this PR introduce _any_ user-facing change?

Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`.

**Note:** `df.koalas.[...]` is still available but with deprecated warnings.

### How was this patch tested?

Manually tested in local and checked one by one.

Closes #32674 from itholic/SPARK-35453.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:33:10 +09:00
Gengliang Wang 8e11f5f007 [SPARK-35576][SQL] Redact the sensitive info in the result of Set command
### What changes were proposed in this pull request?

Currently, the results of following SQL queries are not redacted:
```
SET [KEY];
SET;
```
For example:

```
scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+

scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+

scala> spark.sql("set").show()
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|javax.jdo.option....|              123456|

```

We should hide the sensitive information and redact the query output.

### Why are the changes needed?

Security.

### Does this PR introduce _any_ user-facing change?

Yes, the sensitive information in the output of Set commands are redacted

### How was this patch tested?

Unit test

Closes #32712 from gengliangwang/redactSet.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-31 14:50:18 -07:00
shahid cd2ef9cb43 [SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes
### What changes were proposed in this pull request?
Explain cost command in spark currently doesn't show statistics for all the nodes. It misses some nodes in almost all the TPCDS queries.
In this PR, we are collecting all the plan nodes including the subqueries and computing  the statistics for each node, if it doesn't exists in stats cache,

### Why are the changes needed?
**Before Fix**
For eg: Query1,  Project node doesn't have statistics
![image](https://user-images.githubusercontent.com/23054875/120123442-868feb00-c1cc-11eb-9af9-3a87bf2117d2.png)

Query15, Aggregate node doesn't have statistics

![image](https://user-images.githubusercontent.com/23054875/120123296-a4108500-c1cb-11eb-89df-7fddd651572e.png)

**After Fix:**
Query1:
![image](https://user-images.githubusercontent.com/23054875/120123559-1df53e00-c1cd-11eb-938a-53704f5240e6.png)
Query 15:
![image](https://user-images.githubusercontent.com/23054875/120123665-bb507200-c1cd-11eb-8ea2-84c732215bac.png)
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual testing

Closes #32704 from shahidki31/shahid/fixshowstats.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-06-01 00:55:29 +08:00
Tengfei Huang 1603775934 [SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON
### What changes were proposed in this pull request?
Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values.
Fallback to use reflection to get all the values for constructor parameters when we  meet such cases.

### Why are the changes needed?
Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT case added.

Closes #32713 from ivoson/SPARK-35411-followup.

Authored-by: Tengfei Huang <tengfei.h@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-31 22:15:26 +08:00
Hyukjin Kwon 14e12c64d3 [SPARK-35575][INFRA] Recover updating build status in GitHub Actions
### What changes were proposed in this pull request?

This PR fixes the logic to be fault tolerant when it gets the status of the workflow run from PR author's forked repository.

Looks like https://github.com/apache/spark/pull/32483 removed and disabled (see also https://github.com/apache/spark/pull/32486/checks?check_run_id=2648696751) the GitHub actions workflow runs in the forked repositories, and the detection logic in the main repo fails because the runs don't exist anymore.

See also https://github.com/apache/spark/runs/2709537998?check_suite_focus=true

### Why are the changes needed?

To recover the status update of GitHub Actions in PRs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It cannot be tested without being merged.

Closes #32711 from HyukjinKwon/SPARK-35575.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-31 19:29:54 +09:00
Yuming Wang 6cd6c438f2 [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side
### What changes were proposed in this pull request?

This pr add new rule to removes outer join if it only has distinct on streamed side. For example:
```scala
spark.range(200L).selectExpr("id AS a").createTempView("t1")
spark.range(300L).selectExpr("id AS b").createTempView("t2")
spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
```

Before this pr:
```
== Optimized Logical Plan ==
Aggregate [a#2L], [a#2L]
+- Project [a#2L]
   +- Join LeftOuter, (a#2L = b#6L)
      :- Project [id#0L AS a#2L]
      :  +- Range (0, 200, step=1, splits=Some(2))
      +- Project [id#4L AS b#6L]
         +- Range (0, 300, step=1, splits=Some(2))
```

After this pr:
```
== Optimized Logical Plan ==
Aggregate [a#2L], [a#2L]
+- Project [id#0L AS a#2L]
   +- Range (0, 200, step=1, splits=Some(2))
```

### Why are the changes needed?

Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature:
![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31908 from wangyum/SPARK-34808.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-05-31 18:14:15 +08:00
Shiqi Sun 8c69e9cd94 [SPARK-35562][DOC] Fix docs about Kubernetes and Yarn
Fixed some places in cluster-overview that are obsolete (i.e. not mentioning Kubernetes), and also fixed the Yarn spark-submit sample command in submitting-applications.

### What changes were proposed in this pull request?

This is to fix the docs in "Cluster Overview" and "Submitting Applications" for places where Kubernetes is missed (mostly due to obsolete docs that haven't got updated) and where Yarn sample spark-submit command is incorrectly written.

### Why are the changes needed?

To help the Spark users who uses Kubernetes as cluster manager to have a correct idea when reading the "Cluster Overview" doc page. Also to make the sample spark-submit command for Yarn actually runnable in the "Submitting Applications" doc page, by removing the invalid comment after line continuation char `\`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No test, as this is doc fix.

Closes #32701 from huskysun/doc-fix.

Authored-by: Shiqi Sun <s.sun@salesforce.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-31 02:43:58 -07:00
Liang-Chi Hsieh 73ba4492b1 [SPARK-35566][SS] Fix StateStoreRestoreExec output rows
### What changes were proposed in this pull request?

This is a minor change to update how `StateStoreRestoreExec` computes its number of output rows. Previously we only count input rows, but the optionally restored rows are not counted in.

### Why are the changes needed?

Currently the number of output rows of `StateStoreRestoreExec` only counts the each input row. But it actually outputs input rows + optional restored rows. We should provide correct number of output rows.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #32703 from viirya/fix-outputrows.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-31 16:45:56 +09:00
Dongjoon Hyun c225196be0 [SPARK-35507][INFRA] Add Python 3.9 in the docker image for GitHub Action
### What changes were proposed in this pull request?

This PR aims to add `Python 3.9.5` and updates the docker image references except SparkR job.

### Why are the changes needed?

To save GitHub Action resource and be more robust on the the Python and R library changes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #32706 from dongjoon-hyun/SPARK-35507.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-31 05:56:47 +00:00
allisonwang-db 806da9d6fa [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions
### What changes were proposed in this pull request?
This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`.

### Why are the changes needed?
Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up.

For example:
`SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2`

During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output:
29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)

This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #32687 from allisonwang-db/refactor-subquery-expr.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-31 04:57:24 +00:00
Dongjoon Hyun 1a55019b1f [SPARK-31168][BUILD][FOLLOWUP] Update scala-2.12 profile
### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/32697 to update the missed part.
After SPARK-34774, we have Scala 2.12 version in `scala-2.12` profile.

### Why are the changes needed?

To be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manual.

**BEFORE**
```
$ build/mvn help:evaluate -Pscala-2.12 -Dexpression=scala.version | grep "^2.12"
Using `mvn` from path: /usr/local/bin/mvn
2.12.10
```

**AFTER**
```
$ build/mvn help:evaluate -Pscala-2.12 -Dexpression=scala.version | grep "^2.12"
Using `mvn` from path: /usr/local/bin/mvn
2.12.14
```

Closes #32707 from dongjoon-hyun/SPARK-31168-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-30 21:27:24 -07:00
yangjie01 ff27264ae5 [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3
### What changes were proposed in this pull request?
This pr upgrade Jackson version to 2.12.3.
Jackson Release 2.12.3: [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.3](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.3)

### Why are the changes needed?
Upgrade to a new version to bring potential bug fixes like [https://github.com/FasterXML/jackson-modules-java8/issues/207](https://github.com/FasterXML/jackson-modules-java8/issues/207)  and avro's master has been upgraded to Jackson to 2.12.3

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32688 from LuciferYang/SPARK-35550.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-31 10:28:43 +09:00
yangjie01 09d039da56 [SPARK-35526][CORE][SQL][ML][MLLIB] Re-Cleanup procedure syntax is deprecated compilation warning in Scala 2.13
### What changes were proposed in this pull request?
After SPARK-29291 and SPARK-33352, there are still some compilation warnings about `procedure syntax is deprecated` as follows:

```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s return type
[WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testSimpleSpillingForAllCodecs`'s return type
[WARNING] [Warn] /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type
[WARNING] [Warn] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return type
[WARNING] [Warn] /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: [deprecation   | origin= | version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `executeCTASWithNonEmptyLocation`'s return type
```

So the main change of this pr is cleanup these compilation warnings.

### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32669 from LuciferYang/re-clean-procedure-syntax.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-30 16:49:47 -07:00
yangjie01 16d9de815e [SPARK-35532][TESTS] Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13
### What changes were proposed in this pull request?
Before this pr, when we execute maven test command to test `mllib` and `kafka-0-10` module independently, there are some Java UTs failed, the key error messages are as follows:

```
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
```

and

```
java.lang.NoClassDefFoundError: scala/collection/parallel/immutable/ParVector
```

The UTs need `scala-parallel-collections_2.13`,  but it not in classpath when we run `mvn test -pl mllib -Pscala-2.13` and `mvn test -pl external/kafka-0-10 -Pscala-2.13`.

So the main change of this pr is add `scala-2.13` profile to `mllib/pom.xml` and `external/kafka-0-10/pom.xml`, the `scala-2.13` profile include dependency on `scala-parallel-collections_2.13`, then these two modules can maven test independently.

### Why are the changes needed?
Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the GitHub Action Scala 2.13 job
- Manual test:

1. Execute
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13
```

2. Execute

```
mvn test -pl mllib -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13
```

**Before**

6 Java UTs failed:

```
[ERROR] Errors:
[ERROR]   JavaStreamingLogisticRegressionSuite.javaAPI:78 » TestFailed 20005 was not les...
[ERROR]   JavaStreamingKMeansSuite.javaAPI:78 » TestFailed 20040 was not less than 20000...
[ERROR]   JavaPrefixSpanSuite.runPrefixSpan:45 » NoClassDefFound scala/collection/parall...
[ERROR]   JavaPrefixSpanSuite.runPrefixSpanSaveLoad:67 » NoClassDefFound scala/collectio...
[ERROR]   JavaStreamingLinearRegressionSuite.javaAPI:77 » TestFailed 20014 was not less ...
[ERROR]   JavaStatisticsSuite.streamingTest:112 » TestFailed 20043 was not less than 200...
[INFO]
[ERROR] Tests run: 122, Failures: 0, Errors: 6, Skipped: 0
```

**After**

```
[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0

Run completed in 28 minutes, 32 seconds.
Total number of tests run: 1654
Suites: completed 208, aborted 0
Tests: succeeded 1654, failed 0, canceled 0, ignored 7, pending 0
All tests passed.
```

3. Execute

```
mvn test -pl external/kafka-0-10 -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13
```

**Before**

2 Java UTs failed:

```
[ERROR] Errors:
[ERROR] org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream
[ERROR]   Run 1: JavaDirectKafkaStreamSuite.testKafkaStream:170 expected:<[topic1-1, topic1-2, topic2-1, topic1-3, topic2-2, topic2-3]> but was:<[]>
[ERROR]   Run 2: JavaDirectKafkaStreamSuite.tearDown:57 » NoClassDefFound scala/collection/para...
[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0
```

**After**

```
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0

Run completed in 1 minute, 3 seconds.
Total number of tests run: 21
Suites: completed 4, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

Closes #32676 from LuciferYang/mllib-kafka-mvn-test.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-30 16:36:17 -07:00
Dongjoon Hyun 6c4b60f3b3 [SPARK-31168][BUILD] Upgrade Scala to 2.12.14
### What changes were proposed in this pull request?

This PR is the 4th try to upgrade Scala 2.12.x in order to see the feasibility.
- https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum )
- https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya )
- https://github.com/apache/spark/pull/31223 (Upgrade Scala to 2.12.13, dongjoon-hyun )

Note that Scala 2.12.14 has the following fix for Apache Spark community.
- Fix cyclic error in runtime reflection (protobuf), a regression that prevented Spark upgrading to 2.12.13

REQUIREMENTS:
- [x] `silencer` library is released via https://github.com/ghik/silencer/pull/66
- [x] `genjavadoc` library is released via https://github.com/lightbend/genjavadoc/issues/282

### Why are the changes needed?

Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11/2.12.12/2.12.13. This will bring all the bug fixes.
- https://github.com/scala/scala/releases/tag/v2.12.14
- https://github.com/scala/scala/releases/tag/v2.12.13
- https://github.com/scala/scala/releases/tag/v2.12.12
- https://github.com/scala/scala/releases/tag/v2.12.11

### Does this PR introduce _any_ user-facing change?

Yes, but this is a bug-fixed version.

### How was this patch tested?

Pass the CIs.

Closes #32697 from dongjoon-hyun/SPARK-31168.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-30 16:08:13 -07:00
Yingyi Bu 5c8a141d03 [SPARK-35538][SQL] Migrate transformAllExpressions call sites to use transformAllExpressionsWithPruning
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- EXCHANGE
- IN_SUBQUERY_EXEC
- UPDATE_FIELDS

Migrated `transformAllExpressions` call sites to use `transformAllExpressionsWithPruning`

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.
Perf diff:
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
OptimizeUpdateFields | 54646396 | 27444424 | 0.5
ReplaceUpdateFieldsExpression  | 24694303 | 2087517 | 0.08

Closes #32643 from sigmod/all_expressions.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2021-05-28 15:36:25 -07:00
Wenchen Fan 678592a612 [SPARK-35559][TEST] Speed up one test in AdaptiveQueryExecSuite
### What changes were proposed in this pull request?

I just noticed that `AdaptiveQueryExecSuite.SPARK-34091: Batch shuffle fetch in AQE partition coalescing` takes more than 10 minutes to finish, which is unacceptable.

This PR sets the shuffle partitions to 10 in that test, so that the test can finish with 5 seconds.

### Why are the changes needed?

speed up the test

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32695 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-28 12:39:34 -07:00
Kousuke Saruta b763db3efd [SPARK-35194][SQL][FOLLOWUP] Recover build error with Scala 2.13 on GA
### What changes were proposed in this pull request?

This PR fixes a build error with Scala 2.13 on GA.
#32301 seems to bring this error.

### Why are the changes needed?

To recover CI.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA

Closes #32696 from sarutak/followup-SPARK-35194.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-05-29 00:11:16 +09:00
Karen Feng e8631660ec [SPARK-35194][SQL] Refactor nested column aliasing for readability
### What changes were proposed in this pull request?

Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability.

### Why are the changes needed?

Improves readability for future maintenance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32301 from karenfeng/refactor-nested-column-aliasing.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-28 13:18:44 +00:00
ulysses-you 3b94aad5e7 [SPARK-35552][SQL] Make query stage materialized more readable
### What changes were proposed in this pull request?

Add a new method `isMaterialized` in `QueryStageExec`.

### Why are the changes needed?

Currently, we use `resultOption().get.isDefined` to check if a query stage has materialized. The code is not readable at a glance. It's better to use a new method like `isMaterialized` to define it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass CI.

Closes #32689 from ulysses-you/SPARK-35552.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-05-28 20:42:11 +08:00
Hyukjin Kwon 7eb74482a7 [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true
### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes #32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 17:35:01 +09:00
Kousuke Saruta 2de19e460b [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.

Diff between this change and 692d95d145 merged in #32631 is as follows.

```
       if: github.repository != 'apache/spark'
       id: sync-branch
       run: |
+        apache_spark_ref=`git rev-parse HEAD`
         git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+        echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
     - name: Cache Scala, SBT and Maven
       uses: actions/cachev2
       with:
```

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32691 from sarutak/docker-integration-test-ga-take2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 16:54:47 +09:00
Hyukjin Kwon d189cf75f9 Revert "[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA"
This reverts commit 0a74ad66b3.
2021-05-28 14:29:12 +09:00
Kousuke Saruta 3267b17713 [SPARK-35530][ML][TESTS] Fix rounding error in DifferentiableLossAggregatorSuite with Java 11
### What changes were proposed in this pull request?

This PR fixes an test failure of `DifferentiableLossAggregatorSuite` with Java 11.

### Why are the changes needed?

I'm personally checking whether all the tests pass with Java 11 for the current master and I found DifferentiableLossAggregatorSuite fails.
https://github.com/sarutak/spark/runs/2661859541?check_suite_focus=true#step:9:13895

The reason seems that the implementation of Blas.daxpy is different between for Java 8 and Java 11. For Java 11, `Math.fma` is used.

https://github.com/luhenry/netlib/blob/v2.2.0/blas/src/main/java/dev/ludovic/netlib/blas/Java8BLAS.java#L92
0053ea30b1/blas/src/main/java/dev/ludovic/netlib/blas/Java11BLAS.java (L40)

To remove the rounding error, this PR changes `TestAggregator.add` with fma.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed `DifferentiableLossAggregatorSuite` passes with both Java 8 and Java 11.

Closes #32673 from sarutak/fix-rounding-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 11:39:24 +09:00
Kousuke Saruta 0a74ad66b3 [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA
### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32631 from sarutak/docker-integration-test-ga.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 07:56:37 +09:00
Wenchen Fan 29ed1a2de4 [SPARK-35541][SQL] Simplify OptimizeSkewedJoin
### What changes were proposed in this pull request?

Various small code simplification/cleanup for OptimizeSkewedJoin

### Why are the changes needed?

code refactor

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #32685 from cloud-fan/skew-join.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-27 09:17:28 -07:00
Yuanjian Li f98a063a4b [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
### What changes were proposed in this pull request?
Initial implementation of RocksDBCheckpointMetadata. It persists the metadata for RocksDBFileManager.

### Why are the changes needed?
The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys.
The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - [Directory Structure and Format for Files stored in DFS](https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2).

### Does this PR introduce _any_ user-facing change?
No. Internal implementation only.

### How was this patch tested?
New UT added.

Closes #32272 from xuanyuanking/SPARK-35172.

Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-05-27 22:56:50 +09:00
dgd-contributor 52a1f8c000 [SPARK-33428][SQL] Match the behavior of conv function to MySQL's
### What changes were proposed in this pull request?
Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it.

However, seems Spark has different behavior in two cases:

MySQL allows leading spaces but Spark does not.
If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException

This patch now help conv follow behavior in those two cases
conv allows leading spaces
conv will return the max unsigned long when the input string is way too long

### Why are the changes needed?
fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL

### Does this PR introduce _any_ user-facing change?
Yes, as pointed out above

### How was this patch tested?
Add test

Closes #32684 from dgd-contributor/SPARK-33428.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-27 12:12:39 +00:00
Gengliang Wang 5bcd1c29f0 [SPARK-35535][SQL] New data source V2 API: LocalScan
### What changes were proposed in this pull request?

Add a new data source V2 API: `LocalScan`. It is a special Scan that will happen on Driver locally instead of Executors.

### Why are the changes needed?

The new API improves the flexibility of the DSV2 API. It allows developers to implement connectors for data sources of small data sizes.
For example, we can build a data source for Spark History applications from Spark History Server RESTFUL API. The result set is small and fetching all the results from the Spark driver is good enough. Making it a data source allows us to operate SQL queries with filters or table joins.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test

Closes #32678 from gengliangwang/LocalScan.

Lead-authored-by: Gengliang Wang <ltnwgl@gmail.com>
Co-authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-27 19:31:56 +09:00
gengjiaan 3e190807bc [SPARK-35057][SQL] Group exception messages in hive/thriftserver
### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32646 from beliefer/SPARK-35057.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-27 07:31:14 +00:00
Cheng Su 5cc17ba0c7 [SPARK-35351][SQL][FOLLOWUP] Avoid using loaded variable for LEFT ANTI SMJ code-gen
### What changes were proposed in this pull request?

This is a followup from https://github.com/apache/spark/pull/32547#discussion_r639916474, where for LEFT ANTI join, we do not need to depend on `loaded` variable, as in `codegenAnti` we only load `streamedAfter` no more than once (i.e. assign column values from streamed row which are not used in join condition).

### Why are the changes needed?

Avoid unnecessary processing in code-gen (though it's just `boolean $loaded = false;`, and `if (!$loaded) { $loaded = true; }`).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unite tests in `ExistenceJoinSuite`.

Closes #32681 from c21/join-followup.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-27 04:59:54 +00:00
Xinrong Meng 79a2a46cdb [SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases
### What changes were proposed in this pull request?

Re-enable some pandas-on-Spark test cases.

### Why are the changes needed?

pandas version in GitHub Actions is upgraded now so we can re-enable  some pandas-on-Spark test cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32682 from xinrong-databricks/enable_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-27 12:33:30 +09:00
Takuya UESHIN d6d3209c2f [SPARK-35537][PYTHON] Introduce a util function spark_column_equals
### What changes were proposed in this pull request?

Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not.

### Why are the changes needed?

In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one.
We should introduce a util function for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The existing tests.

Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-27 12:14:43 +09:00
Xinrong Meng 8cc7232ffa [SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType
### What changes were proposed in this pull request?

BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.

### Why are the changes needed?

The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.

```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0    [49, 49]
1    [50, 50]
2    [51, 51]
dtype: object
>>> psser + b'1'
0    [49, 49]
1    [50, 49]
2    [51, 49]
dtype: object
```

### How was this patch tested?

Unit tests.

Closes #32665 from xinrong-databricks/datatypeops_binary.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 14:30:24 -07:00
Xinrong Meng 266608d50e [SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps
### What changes were proposed in this pull request?

The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.

### Why are the changes needed?

StructType, ArrayType, and MapType are not accepted by DataTypeOps now.

We should handle these complex types. Among them:

- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.

- StructOps will be helpful to make to/from pandas conversion data-type-based.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0    [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0    [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```

### How was this patch tested?

Unit tests.

Closes #32626 from xinrong-databricks/datatypeop_complex.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 10:40:01 -07:00
ulysses-you dc7b5a99f0 [SPARK-35282][SQL] Support AQE side shuffled hash join formula using rule
### What changes were proposed in this pull request?

The main code change is:
* Change rule `DemoteBroadcastHashJoin` to `DynamicJoinSelection` and add shuffle hash join selection code.
* Specify a join strategy hint `SHUFFLE_HASH` if AQE think a join can be converted to SHJ.
* Skip `preferSortMerge` config check in AQE side if a join can be converted to SHJ.

### Why are the changes needed?

Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number.

Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely.

### Does this PR introduce _any_ user-facing change?

Yes, add a new config.

### How was this patch tested?

Add test.

Closes #32550 from ulysses-you/SPARK-35282-2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-26 14:16:04 +00:00
Cheng Su dd677770d8 [SPARK-35529][SQL] Add fallback metrics for hash aggregate
### What changes were proposed in this pull request?

Add the metrics to record how many tasks fallback to sort-based aggregation for hash aggregation. This will help developers and users to debug and optimize query. Object hash aggregation has similar metrics already.

### Why are the changes needed?

Help developers and users to debug and optimize query with hash aggregation.

### Does this PR introduce _any_ user-facing change?

Yes, the added metrics will show up in Spark web UI.
Example:
<img width="604" alt="Screen Shot 2021-05-26 at 12 17 08 AM" src="https://user-images.githubusercontent.com/4629931/119618437-bf3c5880-bdb7-11eb-89bb-5b88db78639f.png">

### How was this patch tested?

Changed unit test in `SQLMetricsSuite.scala`.

Closes #32671 from c21/agg-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-26 11:28:12 +00:00
Kousuke Saruta 116a97e153 [SPARK-35501][SQL][TESTS] Add a feature for removing pulled container image for docker integration tests
### What changes were proposed in this pull request?

This PR adds a feature for removing pulled container image after every docker integration test finish.
This feature is enabled by the new propoerty `spark.tes.docker.removePulledImage`.

### Why are the changes needed?

For idempotent.
I'm trying to add docker integration tests to GA in SPARK-35483 (#32631) but I noticed that `jdbc.OracleIntegrationSuite` consistently fails(https://github.com/sarutak/spark/runs/2646707235?check_suite_focus=true).
I investigated the reason and I found it's short of the storage capacity of the host on GA.
```
 ORACLE PASSWORD FOR SYS AND SYSTEM: oracle
The location '/opt/oracle' specified for database files has insufficient space.
Database creation needs at least '4.5GB' disk space.
Specify a different database file destination that has enough space in the configuration file '/etc/sysconfig/oracle-xe-18c.conf'.
mv: cannot stat '/opt/oracle/product/18c/dbhomeXE/dbs/spfileXE.ora': No such file or directory
mv: cannot stat '/opt/oracle/product/18c/dbhomeXE/dbs/orapwXE': No such file or directory
ORACLE_HOME = [/home/oracle] ? ORACLE_BASE environment variable is not being set since this
information is not available for the current user ID .
You can set ORACLE_BASE manually if it is required.
Resetting ORACLE_BASE to its previous value or ORACLE_HOME
The Oracle base remains unchanged with value /opt/oracle
#####################################
########### E R R O R ###############
DATABASE SETUP WAS NOT SUCCESSFUL!
Please check output for further info!
########### E R R O R ###############
#####################################
The following output is now a tail of the alert.log:
tail: cannot open '/opt/oracle/diag/rdbms/*/*/trace/alert*.log' for reading: No such file or directory
tail: no files remaining
```

With this feature, pulled container image is removed and keep the capacity for `jdbc.OracleIntegrationSuite` in GA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the following things.

* A container image which is absent in the local repository is removed after test finished if `spark.test.container.removePulledImage` is `true`.
* A container image which is present in the local repository is not removed after the finished even if `spark.test.container.removePulledImage` is `true`.
* A container image is not removed regardless of presence of the container image in the local repository even if `spark.test.container.removePulledImage` is `true`.

Closes #32652 from sarutak/docker-image-rm.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 17:24:29 +09:00
Kousuke Saruta 50fefc6447 [SPARK-35527][SQL][TESTS] Fix HiveExternalCatalogVersionsSuite to pass with Java 11
### What changes were proposed in this pull request?

This PR fixes `HiveExternalCatalogVersionsSuite`.
With this change, only <major>.<minor> version is set to `spark.sql.hive.metastore.version`.

### Why are the changes needed?

I'm personally checking whether all the tests pass with Java 11 for the current `master` and I found `HiveExternalCatalogVersionsSuite` fails.
The reason is that Spark 3.0.2 and 3.1.1 doesn't accept `2.3.8` as a hive metastore version.

`HiveExternalCatalogVersionsSuite` downloads Spark releases from https://dist.apache.org/repos/dist/release/spark/ and run test for each release. The Spark releases are `3.0.2` and `3.1.1` for the current `master` for now.
e47e615c0e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (L239-L259)

With Java 11, the suite run with a hive metastore version which corresponds to the builtin Hive version and it's `2.3.8` for the current `master`.
20750a3f9e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (L62-L66)

But `branch-3.0` and `branch-3.1` doesn't accept `2.3.8`, the suite with Java 11 fails.

Another solution would be backporting SPARK-34271 (#31371) but after [a discussion](https://github.com/apache/spark/pull/32668#issuecomment-848435170), we prefer to fix the test,

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests with CI.

Closes #32670 from sarutak/fix-version-suite-for-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 17:20:51 +09:00
itholic 79a6b0cc8a [SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move text data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Text Files" page
<img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png">

- Python
<img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png">

- Scala
<img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png">

- Java
<img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32660 from itholic/SPARK-35509.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 17:12:49 +09:00
Vinod KC e3c6907c99 [SPARK-35490][BUILD] Update json4s to 3.7.0-M11
### What changes were proposed in this pull request?
This PR aims to upgrade json4s from   3.7.0-M5  to 3.7.0-M11

Note: json4s version greater than 3.7.0-M11 is not binary compatible with Spark third party jars

### Why are the changes needed?
Multiple defect fixes and improvements  like

https://github.com/json4s/json4s/issues/750
https://github.com/json4s/json4s/issues/554
https://github.com/json4s/json4s/issues/715

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran with the existing UTs

Closes #32636 from vinodkc/br_build_upgrade_json4s.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-05-26 11:10:14 +03:00
Linhong Liu af1dba7ca5 [SPARK-35440][SQL] Add function type to ExpressionInfo for UDF
### What changes were proposed in this pull request?
Add the function type, such as "scala_udf", "python_udf", "java_udf", "hive", "built-in" to the `ExpressionInfo` for UDF.

### Why are the changes needed?
Make the `ExpressionInfo` of UDF more meaningful

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing and newly added UT

Closes #32587 from linhongliu-db/udf-language.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-26 04:40:53 +00:00
Hyukjin Kwon 20750a3f9e [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception
### What changes were proposed in this pull request?

This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.

While I am here, I fixed another minor issue at `DataFrams.schema` together:

```diff
- except AttributeError as e:
-     raise Exception(
-         "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+     raise ValueError(
+         "Unable to parse datatype from schema. %s" % e) from e
```

Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.

### Why are the changes needed?

For users to expect the proper exceptions.

### Does this PR introduce _any_ user-facing change?

Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.

### How was this patch tested?

Existing unittests should cover,

Closes #31238

Closes #32650 from HyukjinKwon/SPARK-32194.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 11:54:40 +09:00
Hyukjin Kwon e47e615c0e [SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions
### What changes were proposed in this pull request?

This PR enables GitHub Actions to test PySpark with Python 3.9.

### Why are the changes needed?

To verify the support of Python 3.9.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Existing tests should cover.

Closes #32657 from HyukjinKwon/SPARK-35506.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 09:25:51 +09:00
Vinod KC 4ba1db91f0 [SPARK-35513][BUILD] Update joda-time to 2.10.10
### What changes were proposed in this pull request?
This PR aims to upgrade joda-time from 2.10.5 to 2.10.10

### Why are the changes needed?
Improvement and bug fixes in joda-time
https://www.joda.org/joda-time/changes-report.html#a2.10.10

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran with the existing UTs

Closes #32661 from vinodkc/br_build_upgrade_joda_time.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-25 11:29:03 -07:00