Commit graph

30393 commits

Author SHA1 Message Date
Felix Cheung 1530876615 [SPARK-35495][R] Change SparkR maintainer for CRAN
### What changes were proposed in this pull request?

As discussed, update SparkR maintainer for future release.

### Why are the changes needed?

Shivaram will not be able to work with this in the future, so we would like to migrate off the maintainer contact email.

shivaram

Closes #32642 from felixcheung/sparkr-maintainer.

Authored-by: Felix Cheung <felixcheung@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 19:08:54 -07:00
Takuya UESHIN 1b75c2494c [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks
### What changes were proposed in this pull request?

Adds more type annotations in the files:

- `python/pyspark/pandas/spark/accessors.py`
- `python/pyspark/pandas/typedef/typehints.py`
- `python/pyspark/pandas/utils.py`

and fixes the mypy check failures.

### Why are the changes needed?

We should enable more `disallow_untyped_defs` mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32627 from ueshin/issues/SPARK-35467_35468_35477/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-24 09:31:00 +09:00
Liang-Chi Hsieh 9e1b204bcc [SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression
### What changes were proposed in this pull request?

This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination.

For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions.

### Why are the changes needed?

If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests.

Closes #32559 from viirya/SPARK-35410.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 08:24:44 -07:00
Vinod KC d5868ebc39 [SPARK-35492][BUILD] Upgrade httpcore to 4.4.14
### What changes were proposed in this pull request?
This PR aims to upgrade Apache HttpCore from 4.4.12 to 4.4.14.

### Why are the changes needed?
Stability improvements in httpcore 4.4.14

- Bug fix: Non-blocking TLSv1.3 connections can end up in an infinite event spin when closed concurrently by the local and the remote endpoints.
- HTTPCORE-647: Non-blocking connection terminated due to 'java.io.IOException: Broken pipe' can enter an infinite loop flushing buffered output data.
- PR #201, HTTPCORE-634: Fix race condition in AbstractConnPool that can cause internal state
-   corruption
- HTTPCORE-612: DefaultConnectionReuseStrategy incorrectly used int to represent Content-Length value
-   instead of long

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
 With Jenkins Tests

Closes #32638 from vinodkc/br_build_upgrade_httpcore.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 08:16:50 -07:00
Kent Yao 96b0548ab6 [SPARK-35493][K8S] make spark.blockManager.port fallback for spark.driver.blockManager.port as same as other cluster managers
### What changes were proposed in this pull request?

`spark.blockManager.port` does not work for k8s driver pods now, we should make it work as other cluster managers.

### Why are the changes needed?

`spark.blockManager.port` should be able to work for spark driver pod

### Does this PR introduce _any_ user-facing change?

yes, `spark.blockManager.port` will be respect iff it is present  && `spark.driver.blockManager.port` is absent

### How was this patch tested?

new tests

Closes #32639 from yaooqinn/SPARK-35493.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 08:07:57 -07:00
Kousuke Saruta 1a43415d8d [SPARK-35226][SQL][FOLLOWUP] Fix test added in SPARK-35226 for DB2KrbIntegrationSuite
### What changes were proposed in this pull request?

This PR fixes an test added in SPARK-35226 (#32344).

### Why are the changes needed?

`SELECT 1` seems non-valid query for DB2.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

DB2KrbIntegrationSuite passes on my laptop.

I also confirmed all the KrbIntegrationSuites pass with the following command.
```
build/sbt -Phive -Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite"
```

Closes #32632 from sarutak/followup-SPARK-35226.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-22 22:31:43 -07:00
Kousuke Saruta a59a214610 [MINOR][FOLLOWUP] Update SHA for the oracle docker image
### What changes were proposed in this pull request?

This PR updates SHA for the oracle docker image in the comment in `OracleIntegrationSuite` and `v2.OracleIntegrationSuite`.
The SHA for the latest image is `3f422c4a35b423dfcdbcc57a84f01db6c82eb6c1`

### Why are the changes needed?

The script name for creating the oracle docker image is changed in #32629, following the latest image so we also need to update the corresponding SHA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The value is from `git log`.

Closes #32630 from sarutak/followup-oracle-script-name.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-22 19:52:24 -07:00
Liang-Chi Hsieh 594ffd2db2 [SPARK-35463][BUILD][FOLLOWUP] Redirect output for skipping checksum check
### What changes were proposed in this pull request?

This patch is a followup of SPARK-35463. In SPARK-35463, we output a message to stdout and now we redirect it to stderr.

### Why are the changes needed?

All `echo` statements in `build/mvn` should redirect to stderr if it is not followed by `exit`. It is because we use `build/mvn` to get stdout output by other scripts. If we don't redirect it, we can get invalid output, e.g. got "Skipping checksum because shasum is not installed." as `commons-cli` version.

### Does this PR introduce _any_ user-facing change?

No. Dev only.

### How was this patch tested?

Manually test on internal system.

Closes #32637 from viirya/fix-build.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-22 19:13:33 -07:00
Hyukjin Kwon 1d9f09decb [SPARK-35480][SQL] Make percentile_approx work with pivot
### What changes were proposed in this pull request?

This PR proposes to avoid wrapping if-else to the constant literals for `percentage` and `accuracy` in `percentile_approx`. They are expected to be literals (or foldable expressions).

Pivot works by two phrase aggregations, and it works with manipulating the input to `null` for non-matched values (pivot column and value).

Note that pivot supports an optimized version without such logic with changing input to `null` for some types (non-nested types basically). So the issue fixed by this PR is only for complex types.

```scala
val df = Seq(
  ("a", -1.0), ("a", 5.5), ("a", 2.5), ("b", 3.0), ("b", 5.2)).toDF("type", "value")
  .groupBy().pivot("type", Seq("a", "b")).agg(
    percentile_approx(col("value"), array(lit(0.5)), lit(10000)))
df.show()
```

**Before:**

```
org.apache.spark.sql.AnalysisException: cannot resolve 'percentile_approx((IF((type <=> CAST('a' AS STRING)), value, CAST(NULL AS DOUBLE))), (IF((type <=> CAST('a' AS STRING)), array(0.5D), NULL)), (IF((type <=> CAST('a' AS STRING)), 10000, CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage provided must be a constant literal;
'Aggregate [percentile_approx(if ((type#7 <=> cast(a as string))) value#8 else cast(null as double), if ((type#7 <=> cast(a as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS a#16, percentile_approx(if ((type#7 <=> cast(b as string))) value#8 else cast(null as double), if ((type#7 <=> cast(b as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(b as string))) 10000 else cast(null as int), 0, 0) AS b#18]
+- Project [_1#2 AS type#7, _2#3 AS value#8]
   +- LocalRelation [_1#2, _2#3]
```

**After:**

```
+-----+-----+
|    a|    b|
+-----+-----+
|[2.5]|[3.0]|
+-----+-----+
```

### Why are the changes needed?

To make percentile_approx work with pivot as expected

### Does this PR introduce _any_ user-facing change?

Yes. It threw an exception but now it returns a correct result as shown above.

### How was this patch tested?

Manually tested and unit test was added.

Closes #32619 from HyukjinKwon/SPARK-35480.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-23 07:35:43 +09:00
Dongjoon Hyun fa424ac2b8 [SPARK-35489][BUILD] Upgrade ORC to 1.6.8
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 1.6.8.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://orc.apache.org/news/2021/05/21/ORC-1.6.8/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the existing CIs.

Closes #32635 from dongjoon-hyun/SPARK-35489.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-22 10:35:40 -07:00
Vinod KC 003294ce1d [SPARK-35488][BUILD] Upgrade ASM to 7.3.1
### What changes were proposed in this pull request?
This PR aims to upgrade ASM to 7.3.1.

- https://issues.apache.org/jira/browse/XBEAN-323
- https://asm.ow2.io/versions.html

### Why are the changes needed?
ASM 7.3.1 bring following changes

- new V15 constant
- experimental support for PermittedSubtypes and RecordComponent
- bug fixes
- - 317885: SKIP_DEBUG now skips MethodParameters attributes

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran with the existing UTs

Closes #32634 from vinodkc/br_build_upgrade_asm.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-05-23 02:33:15 +09:00
Kousuke Saruta 0549caf07a [MINOR][SQL] Change the script name for creating oracle docker image
### What changes were proposed in this pull request?

This PR changes the name in the comment in `jdbc.OracleIntegrationSuite` and `v2.OracleIntegrationSuite`.
The script is for creating oracle docker image.

### Why are the changes needed?

The name of the script is `buildContainerImage`, not `buildDockerImage` now.
- d918f5a4c6 (diff-be303ab32e74192aca829e5ea259a0aec07aac23a6049120fb337ec4efa601b0)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that I can build the image with `./buildContainerImage.sh -v 18.4.0 -x`.

Closes #32629 from sarutak/change-oracle-container-script-name.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 22:58:44 -07:00
Kousuke Saruta 6bd6e46aec [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
### What changes were proposed in this pull request?

This PR upgrades Dropwizard metrics to 4.2.0.
I also modified the corresponding links in `docs/monitoring.md`.

### Why are the changes needed?

The latest version was released last week and it contains some improvements.
https://github.com/dropwizard/metrics/releases/tag/v4.2.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Build succeeds and all the modified links are reachable.

Closes #32628 from sarutak/upgrade-dropwizard-4.2.0.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 22:53:32 -07:00
Wenchen Fan b624b7e93f [SPARK-28551][SQL][FOLLOWUP] Use the corrected hadoop conf
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/32411, to fix a mistake and use `sparkSession.sessionState.newHadoopConf` which includes SQL configs instead of `sparkSession.sparkContext.hadoopConfiguration` .

### Why are the changes needed?

fix mistake

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #32618 from cloud-fan/follow1.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
2021-05-22 10:33:57 +08:00
Takuya UESHIN 2616d5cc1d [SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module
### What changes were proposed in this pull request?

Sets up the `mypy` configuration to enable `disallow_untyped_defs` check for pandas APIs on Spark module.

### Why are the changes needed?

Currently many functions in the main codes in pandas APIs on Spark module are still missing type annotations and disabled `mypy` check `disallow_untyped_defs`.

We should add more type annotations and enable the mypy check.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32614 from ueshin/issues/SPARK-35465/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-21 11:03:35 -07:00
Liang-Chi Hsieh 066944c1bd [SPARK-35439][SQL] Children subexpr should come first than parent subexpr
### What changes were proposed in this pull request?

This patch sorts equivalent expressions based on their child-parent relation.

### Why are the changes needed?

`EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`.

Case 1: child subexpr comes first.
```scala
addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))
```

Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.
```
addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))
```

As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests.

Closes #32586 from viirya/use-listhashmap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-05-21 10:49:35 -07:00
Dongjoon Hyun cc05daa884 [SPARK-34558][SQL][TESTS][FOLLOWUP] Fix a test to use a unknown filesystem
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/32622 to fix a test case.

### Why are the changes needed?

Fix a wrong test case name and fix the test case to cause the expected error correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32623 from dongjoon-hyun/SPARK-34558.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 10:13:49 -07:00
Wenchen Fan 7274e3a4d2 [SPARK-34558][SQL][FOLLOWUP] Do not fail Spark startup with a broken FileSystem
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/31671

https://github.com/apache/spark/pull/31671 qualifies the warehouse at the beginning, which may fail Spark startup if something goes wrong, like the underlying FileSystem can't be initialized.

This PR falls back to the old behavior and leave the warehouse path unqualified if qualifying fails.

### Why are the changes needed?

Fix a regression. It's important to be always able to start Spark app (e.g. spark-shell), so that we can debug.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

a new test case

Closes #32622 from cloud-fan/follow2.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 08:51:10 -07:00
Kent Yao d957426351 [SPARK-35482][K8S] Use spark.blockManager.port not the wrong spark.blockmanager.port in BasicExecutorFeatureStep
### What changes were proposed in this pull request?

most spark conf keys are case sensitive, including `spark.blockManager.port`, we can not get the correct port number with `spark.blockmanager.port`.

This PR changes the wrong key to `spark.blockManager.port` in `BasicExecutorFeatureStep`.

This PR also ensures a fast fail when the port value is invalid for executor containers. When 0 is specified(it is valid as random port, but invalid as a k8s request), it should not be put in the `containerPort` field of executor pod desc. We do not expect executor pods to continuously fail to create because of invalid requests.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #32621 from yaooqinn/SPARK-35482.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 08:27:49 -07:00
itholic d2bdd6595e [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move Parquet data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Parquet Files" page
![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png)

- Python
![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png)

- Scala
![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png)

- Java
![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32161 from itholic/SPARK-34491.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:05:49 +09:00
itholic 419ddcb2a4 [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move JSON data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JSON Files" page
<img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png">

- Python
<img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png">

- Scala
<img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png">

- Java
<img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32204 from itholic/SPARK-35081.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:05:13 +09:00
itholic 0fe65b5365 [SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move ORC data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "ORC Files" page
![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png)

- Python
![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png)

- Scala
![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png)

- Java
![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32546 from itholic/SPARK-35395.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 18:03:57 +09:00
yi.wu 34284c0649 [SPARK-35454][SQL] One LogicalPlan can match multiple dataset ids
### What changes were proposed in this pull request?

Change the type of `DATASET_ID_TAG` from `Long` to `HashSet[Long]` to allow the logical plan to match multiple datasets.

### Why are the changes needed?

During the transformation from one Dataset to another Dataset, the DATASET_ID_TAG of logical plan won't change if the plan itself doesn't change:

b5241c97b1/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (L234-L237)

However, dataset id always changes even if the logical plan doesn't change:
b5241c97b1/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (L207-L208)

And this can lead to the mismatch between dataset's id and col's __dataset_id. E.g.,

```scala
  test("SPARK-28344: fail ambiguous self join - Dataset.colRegex as column ref") {
    // The test can fail if we change it to:
    // val df1 = spark.range(3).toDF()
    // val df2 = df1.filter($"id" > 0).toDF()
    val df1 = spark.range(3)
    val df2 = df1.filter($"id" > 0)

    withSQLConf(
      SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
      SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
      assertAmbiguousSelfJoin(df1.join(df2, df1.colRegex("id") > df2.colRegex("id")))
    }
  }
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #32616 from Ngone51/fix-ambiguous-join.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-21 16:15:17 +08:00
ulysses-you 83737852f0 [SPARK-35063][SQL][FOLLOWUP] Fix scala 2.13 error
### What changes were proposed in this pull request?

### Why are the changes needed?

Fix scala compile error.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #32617 from ulysses-you/scala2-13.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-21 07:59:26 +00:00
Max Gekk 2b08070e79 [SPARK-35427][SQL][TESTS] Check the EXCEPTION rebase mode for Avro/Parquet
### What changes were proposed in this pull request?
Add tests to check the `EXCEPTION` rebase mode explicitly in the datasources:
- Parquet: `DATE` type and `TIMESTAMP`: `INT96`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS`
- Avro: `DATE` type and `TIMESTAMP`: `timestamp-millis` and `timestamp-micros`.

### Why are the changes needed?
1. To improve test coverage
2. The `EXCEPTION` rebase mode should be checked independently from the default settings.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *AvroV2Suite"
$ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
```

Closes #32574 from MaxGekk/test-rebase-exception.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-21 06:18:06 +00:00
gengjiaan c740c097e0 [SPARK-35063][SQL] Group exception messages in sql/catalyst
### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32478 from beliefer/SPARK-35063.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-21 06:15:26 +00:00
Takeshi Yamamuro 1a923f5319 [SPARK-35479][SQL] Format PartitionFilters IN strings in scan nodes
### What changes were proposed in this pull request?

This PR proposes to format strings correctly for `PushedFilters`. For example, `explain()` for a query below prints `v in (array('a'))` as `PushedFilters: [In(v, [WrappedArray(a)])]`;

```
scala> sql("create table t (v array<string>) using parquet")
scala> sql("select * from t where v in (array('a'), null)").explain()
== Physical Plan ==
*(1) Filter v#4 IN ([a],null)
+- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [WrappedArray(a),null])], ReadSchema: struct<v:array<string>>
```
This PR makes `explain()` print it as `PushedFilters: [In(v, [[a]])]`;
```
scala> sql("select * from t where v in (array('a'), null)").explain()
== Physical Plan ==
*(1) Filter v#4 IN ([a],null)
+- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [[a],null])], ReadSchema: struct<v:array<string>>
```
NOTE: This PR includes a bugfix caused by #32577 (See the cloud-fan comment: https://github.com/apache/spark/pull/32577/files#r636108150).

### Why are the changes needed?

To improve explain strings.

### Does this PR introduce _any_ user-facing change?

Yes, this PR improves the explain strings for pushed-down filters.

### How was this patch tested?

Added tests in `SQLQueryTestSuite`.

Closes #32615 from maropu/ExplainPartitionFilters.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-21 05:45:45 +00:00
yi.wu e1296eab5f [SPARK-35445][SQL] Reduce the execution time of DeduplicateRelations
### What changes were proposed in this pull request?

This PR reduces the execution time of `DeduplicateRelations` by:

1) use `Set` instead `Seq` to check duplicate relations

2) avoid plan output traverse and attribute rewrites when there are no changes in the children plan

### Why are the changes needed?

Rule `DeduplicateRelations` is slow.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run `TPCDSQuerySuite` and checked the run time of `DeduplicateRelations`. The time has been reduced by 77.9% after this PR.

Closes #32590 from Ngone51/improve-dedup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-05-21 13:25:37 +08:00
Kent Yao 2e9936db93 [SPARK-35456][CORE] Print the invalid value in config validation error message
### What changes were proposed in this pull request?

Print the invalid value in config validation error message for `checkValue` just like `checkValues`

### Why are the changes needed?

Invalid configuration values may come in many ways, this PR can help different kinds of users or developers to identify what the config the error is related to

### Does this PR introduce _any_ user-facing change?

yes, but only error msg
### How was this patch tested?

yes, modified tests

Closes #32600 from yaooqinn/SPARK-35456.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-21 14:22:29 +09:00
itholic 6b912e4179 [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
### What changes were proposed in this pull request?

There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark.
- kdf -> psdf
- kser -> psser
- kidx -> psidx
- kmidx -> psmidx
- to_koalas() -> to_pandas_on_spark()

### Why are the changes needed?

This is because the name Koalas is no longer used in PySpark.

### Does this PR introduce _any_ user-facing change?

`to_koalas()` function is renamed to `to_pandas_on_spark()`

### How was this patch tested?

Tested in local manually.
After changing the related naming, I checked them one by one.

Closes #32516 from itholic/SPARK-35364.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-20 15:08:30 -07:00
Dongjoon Hyun 8e13b8c3d2 [SPARK-35463][BUILD] Skip checking checksum on a system without shasum
### What changes were proposed in this pull request?

Not every build system has `shasum`. This PR aims to skip checksum checks on a system without `shasum`.

### Why are the changes needed?

**PREPARE**
```
$ docker run -it --rm -v $PWD:/spark openjdk:11-slim /bin/bash
roota0e001a6e50f:/# cd /spark/
roota0e001a6e50f:/spark# apt-get update
roota0e001a6e50f:/spark# apt-get install curl
roota0e001a6e50f:/spark# build/mvn clean
```

**BEFORE (Failure due to `command not found`)**
```
roota0e001a6e50f:/spark# build/mvn clean
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz
exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download
exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
Veryfing checksum from /spark/build/apache-maven-3.6.3-bin.tar.gz.sha512
build/mvn: line 81: shasum: command not found
Bad checksum from https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
```

**AFTER**
```
roota0e001a6e50f:/spark# build/mvn clean
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz
Skipping checksum because shasum is not installed.
exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download
exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
Skipping checksum because shasum is not installed.
Using `mvn` from path: /spark/build/apache-maven-3.6.3/bin/mvn
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the build.

### How was this patch tested?

Manually with the above process.

Closes #32613 from dongjoon-hyun/SPARK-35463.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-20 14:36:50 -07:00
Dongjoon Hyun 3757c1803d [SPARK-35462][BUILD][K8S] Upgrade Kubernetes-client to 5.4.0 to support K8s 1.21 models
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` from 5.3.1 to 5.4.0 to support K8s 1.21 models officially.

### Why are the changes needed?

`kubernetes-client` 5.4.0 has `Kubernetes Model v1.21.0`
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.0

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change.

### How was this patch tested?

Pass the CIs including Jenkins K8s IT.
- https://github.com/apache/spark/pull/32612#issuecomment-845456039

I tested K8s IT with the following versions.
- minikube version: v1.20.0
- K8s Client Version: v1.21.0
- Server Version: v1.21.0

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 18 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32612 from dongjoon-hyun/SPARK-35462.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-20 14:34:58 -07:00
Yikun Jiang 38fbc0b4f7 [SPARK-35458][BUILD] Use > /dev/null to replace -q in shasum
## What changes were proposed in this pull request?
Use ` > /dev/null` to replace `-q` in shasum validation.

### Why are the changes needed?
In PR https://github.com/apache/spark/pull/32505 , added the shasum check on maven. The `shasum -a 512 -q -c xxx.sha` is used to validate checksum, the `-q` args is for "don't print OK for each successfully verified file", but `-q` arg is introduce in shasum 6.x version.

So we got the `Unknown option: q`.

```
➜  ~ uname -a
Darwin MacBook.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64 x86_64
➜  ~ shasum -v
5.84
➜  ~ shasum -q
Unknown option: q
Type shasum -h for help
```

it makes ARM CI failed:
[1] https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`shasum -a 512 -c wrong.sha  > /dev/null` return code 1 without print
`shasum -a 512 -c right.sha  > /dev/null` return code 0 without print
e2e test:
```
rm -f build/apache-maven-3.6.3-bin.tar.gz
rm -r build/apache-maven-3.6.3-bin
mvn -v
```

Closes #32604 from Yikun/patch-5.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-05-20 15:59:13 -05:00
Yikun Jiang 3c3533d845 [SPARK-35373][BUILD][FOLLOWUP] Fix "binary operator expected" error on build/mvn
### What changes were proposed in this pull request?
change $(command -v curl) to "$(command -v curl)"

### Why are the changes needed?
We need change $(command -v curl) to "$(command -v curl)" to make sure it work when `curl` or `wget` is uninstall. othewise raised:
`build/mvn: line 56: [: /root/spark/build/apache-maven-3.6.3-bin.tar.gz: binary operator expected`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```
apt remove curl
rm -f build/apache-maven-3.6.3-bin.tar.gz
rm -r build/apache-maven-3.6.3-bin
mvn -v
```

Closes #32608 from Yikun/patch-6.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-05-20 12:25:23 -05:00
Max Gekk 2bd32548f5 [SPARK-35459][SQL][TESTS] Move AvroRowReaderSuite to a separate file
### What changes were proposed in this pull request?
Move `AvroRowReaderSuite` out from `AvroSuite.scala` and place it to `AvroRowReaderSuite.scala`.

### Why are the changes needed?
To improve code maintenance. Usually, independent test suites are placed to separate files.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *AvroRowReaderSuite"
$ build/sbt "test:testOnly *AvroV1Suite"
$ build/sbt "test:testOnly *AvroV2Suite"
```

Closes #32607 from MaxGekk/move-AvroRowReaderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 20:04:10 +09:00
weixiuli 4869e437da [SPARK-35424][SHUFFLE] Remove some useless code in the ExternalBlockHandler
### What changes were proposed in this pull request?

Remove some useless code in the ExternalBlockHandler.

### Why are the changes needed?
There is some useless code in the ExternalBlockHandler, so we may remove it.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Exist unittests.

Closes #32571 from weixiuli/SPARK-35424.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 19:03:14 +09:00
Bo Zhang e170e63955 [SPARK-35457][BUILD] Bump ANTLR runtime version to 4.8
### What changes were proposed in this pull request?
This PR changes the antlr4-runtime version from 4.8-1 to 4.8.

### Why are the changes needed?
Version 4.8 is the official release version, with a proper release note (see https://github.com/antlr/antlr4/releases) and artifiacts listed in https://www.antlr.org/download/index.html.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Will rely on tests in the PR.

Closes #32603 from bozhang2820/antlr-4.8.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-20 17:24:40 +09:00
Vinod KC bdd8e1dbb1 [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory
### What changes were proposed in this pull request?

CTAS with location clause acts as an insert overwrite. This can cause problems when there are subdirectories within a location directory.
This causes some users to accidentally wipe out directories with very important data. We should not allow CTAS with location to a non-empty directory.

### Why are the changes needed?

Hive already handled this scenario: HIVE-11319

Steps to reproduce:

```scala
sql("""create external table  `demo_CTAS`( `comment` string) PARTITIONED BY (`col1` string, `col2` string) STORED AS parquet location '/tmp/u1/demo_CTAS'""")
sql("""INSERT OVERWRITE TABLE demo_CTAS partition (col1='1',col2='1') VALUES ('abc')""")
sql("select* from demo_CTAS").show
sql("""create table ctas1 location '/tmp/u2/ctas1' as select * from demo_CTAS""")
sql("select* from ctas1").show
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```

Before the fix: Both create table operations will succeed. But values in table ctas1 will be replaced by ctas2 accidentally.

After the fix: `create table ctas2...` will throw `AnalysisException`:

```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```

### Does this PR introduce _any_ user-facing change?
Yes, if the location directory is not empty, CTAS with location will throw AnalysisException

```
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```
```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```

`CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, need to  set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `true`. , default value is `false`.
Updated SQL migration guide.

### How was this patch tested?
Test case added in SQLQuerySuite.scala

Closes #32411 from vinodkc/br_fixCTAS_nonempty_dir.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-20 06:13:18 +00:00
yi.wu 00b63c8dc2 [SPARK-27991][CORE] Defer the fetch request on Netty OOM
### What changes were proposed in this pull request?

This PR proposes a workaround to address the Netty OOM issue (SPARK-24989, SPARK-27991):

Basically, `ShuffleBlockFetcherIterator` would catch the `OutOfDirectMemoryError` from Netty and then set a global flag for the shuffle module. Any pending fetch requests would be deferred if there're in-flight requests until the flag is unset. And the flag will be unset when there's a fetch request succeed.

Note that catching the Netty OOM rather than abort the application is feasible because Netty manage its own memory region (offheap by default) separately. So Netty OOM doesn't mean the memory shortage of Spark.

### Why are the changes needed?

The Netty OOM issue is a very corner case. It usually happens in the large-scale cluster, where a reduce task could fetch shuffle blocks from hundreds of nodes concurrently in a short time. Internally, we found a cluster that has created 260+ clients within 6s before throwing Netty OOM.

Although Spark has configurations, e.g., `spark.reducer.maxReqsInFlight` to tune the number of concurrent requests, it's usually not a easy decision for the user to set a reasonable value regarding the workloads, machine resources, etc. But with this fix, Spark would heal the Netty memory issue itself without any specific configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #32287 from Ngone51/SPARK-27991.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-05-20 04:26:56 +00:00
Ashray Jain de59e01aa4 [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable
Kubernetes supports marking secrets and config maps as immutable to gain performance.

https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable
https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable

For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver.

From the K8s docs:

> For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages:
> - protects you from accidental (or unwanted) updates that could cause applications outages
> - improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable.

For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map
```
.withImmutable(true)
```
This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21

### What changes were proposed in this pull request?
All K8s secrets and config maps created by Spark are marked "immutable".

### Why are the changes needed?
See description above.

### Does this PR introduce _any_ user-facing change?
Don't think so

### How was this patch tested?
Augmented existing unit tests.

Closes #32588 from ashrayjain/patch-1.

Authored-by: Ashray Jain <ashrayjain@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-19 21:25:33 -07:00
Xinrong Meng a970f8505d [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32596 from xinrong-databricks/datatypeop_arith_fix.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-19 19:47:00 -07:00
Hyukjin Kwon 7eaabf4df5 [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format
### What changes were proposed in this pull request?

This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5.

This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message.

**NOTE** that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message.

### Why are the changes needed?

To keep unofficial Python 3.5 support

### Does this PR introduce _any_ user-facing change?

Officially nope.

### How was this patch tested?

Ran the linters.

Closes #32598 from HyukjinKwon/SPARK-35408=followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 10:47:31 +09:00
Takuya UESHIN c06480519e [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows
### What changes were proposed in this pull request?

Follows checkout-merge way to use the latest commit for linter, or other workflows.

### Why are the changes needed?

For linter or other workflows besides build-and-tests, we should follow checkout-merge way to use the latest commit; otherwise, those could work on the old settings.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Existing tests.

Closes #32597 from ueshin/issues/SPARK-35450/infra.

Lead-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-20 10:07:28 +09:00
Takuya UESHIN d44e6c7f10 Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures"
This reverts commit d1b24d8aba.
2021-05-19 16:49:47 -07:00
Cheng Su 586caae3cc [SPARK-35438][SQL][DOCS] Minor documentation fix for window physical operator
### What changes were proposed in this pull request?

As title. Fixed two places where the documentation for window operator has some error.

### Why are the changes needed?

Help people read code for window operator more easily in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32585 from c21/minor-doc.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-20 08:47:19 +09:00
Xinrong Meng d1b24d8aba [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32469 from xinrong-databricks/datatypeop_arith.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-19 15:05:32 -07:00
Andy Grove 52e3cf9ff5 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use
### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances.

This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange.

### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them.

Closes #32195 from andygrove/SPARK-35093.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-05-19 07:45:26 -05:00
shahid 12142130cd [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
### What changes were proposed in this pull request?
Updating column stats for Union operator stats estimation
### Why are the changes needed?
This is a followup PR to update the null count also in the Union stats operator estimation. https://github.com/apache/spark/pull/30334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated UTs, manual testing

Closes #32494 from shahidki31/shahid/updateNullCountForUnion.

Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-19 21:23:19 +09:00
Kousuke Saruta 9283bebbbd [SPARK-35418][SQL] Add sentences function to functions.{scala,py}
### What changes were proposed in this pull request?

This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.

### Why are the changes needed?

This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use this function from Scala and Python.

### How was this patch tested?

New test.

Closes #32566 from sarutak/sentences-function.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-05-19 20:07:28 +09:00
shahid 46f7d780d3 [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
### What changes were proposed in this pull request?
Update histogram statistics for RANGE operator stats estimation
### Why are the changes needed?
If histogram optimization is enabled, this statistics can be used in various cost based optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs. Manual test.

Closes #32498 from shahidki31/shahid/histogram.

Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-05-19 16:49:32 +09:00