Commit graph

31343 commits

Author SHA1 Message Date
Xinrong Meng 7aee29016a [SPARK-36880][PYTHON] Inline type hints for python/pyspark/sql/functions.py
### What changes were proposed in this pull request?
Inline type hints from `python/pyspark/sql/functions.pyi` to `python/pyspark/sql/functions.py`.

### Why are the changes needed?
Currently, there is type hint stub files `python/pyspark/sql/functions.pyi` to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing test.

Closes #34130 from xinrong-databricks/inline_functions.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-05 17:13:16 +09:00
dchvn nguyen d6786e036d [SPARK-36711][PYTHON] Support multi-index in new syntax
### What changes were proposed in this pull request?
Support multi-index in new syntax to specify index data type

### Why are the changes needed?
Support multi-index in new syntax to specify index data type

https://issues.apache.org/jira/browse/SPARK-36707

### Does this PR introduce _any_ user-facing change?
After this PR user can use

``` python
>>> ps.DataFrame[[int, int],[int, int]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]

>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
>>> pdf = pd.DataFrame([[1,2,3],[2,3,4],[4,5,6]], index=idx, columns=["a", "b", "c"])
>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]

>>> ps.DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]

>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]

```

### How was this patch tested?
exist tests

Closes #34176 from dchvn/SPARK-36711.

Authored-by: dchvn nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-05 12:45:16 +09:00
Kousuke Saruta fa1805db48 [SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join
### What changes were proposed in this pull request?

This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")

df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.

df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```

The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #34172 from sarutak/fix-deduplication-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-10-05 11:16:55 +08:00
Max Gekk 65eb4a2129 [SPARK-36920][SQL] Support ANSI intervals by ABS()
### What changes were proposed in this pull request?
In the PR, I propose to handle ANSI interval types by the `Abs` expression, and the `abs()` function as a consequence of that:
- for positive and zero intervals, `ABS()` returns the same input value,
- for minimal supported values (`Int.MinValue` months for year-month interval and `Long.MinValue` microseconds for day-time interval), `ABS()` throws the arithmetic overflow exception.
- for other supported negative intervals, `ABS()` negate its input and returns a positive interval.

For example:
```sql
spark-sql> SELECT ABS(INTERVAL -'10-8' YEAR TO MONTH);
10-8
spark-sql> SELECT ABS(INTERVAL '-10 01:02:03.123456' DAY TO SECOND);
10 01:02:03.123456000
```

### Why are the changes needed?
To improve user experience with Spark SQL.

### Does this PR introduce _any_ user-facing change?
No, this PR just extends `ABS()` by supporting new types.

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *ArithmeticExpressionSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```

Closes #34169 from MaxGekk/abs-ansi-intervals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-05 10:43:28 +09:00
Xinrong Meng b30e214483 [SPARK-36881][PYTHON] Inline type hints for python/pyspark/sql/catalog.py
### What changes were proposed in this pull request?
Inline type hints for python/pyspark/sql/catalog.py.

### Why are the changes needed?
Currently, a type hint stub file hints for python/pyspark/sql/catalog.pyi is used. We may leverage static type check by inlining type hints.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing test.

Closes #34133 from xinrong-databricks/inline_catalog.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-05 10:39:34 +09:00
Xinrong Meng 81812606cc [SPARK-36906][PYTHON] Inline type hints for conf.py and observation.py in python/pyspark/sql
### What changes were proposed in this pull request?
Inline type hints for conf.py and observation.py in python/pyspark/sql.

### Why are the changes needed?
Currently, there is type hint stub files (*.pyi) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.

### Does this PR introduce _any_ user-facing change?
No.

It has a DOC typo fix:
`Metrics are aggregation expressions, which are applied to the DataFrame while **is** is being`
is changed to
`Metrics are aggregation expressions, which are applied to the DataFrame while **it** is being`

### How was this patch tested?
Existing test.

Closes #34159 from xinrong-databricks/inline_conf_observation.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-10-04 14:32:25 -07:00
Huaxin Gao 167664896d [MINOR][SQL] Use SQLConf.resolver for caseSensitiveResolution/caseInsensitiveResolution
### What changes were proposed in this pull request?
Use `SQLConf.resolver` for `caseSensitiveResolution`/`caseInsensitveResolution` instead of having a new method

### Why are the changes needed?
remove redundant code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing code

Closes #34171 from huaxingao/minor.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-10-04 12:57:48 -07:00
Takuya UESHIN 38d39812c1 [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut
### What changes were proposed in this pull request?

Fix `DataFrameGroupBy.apply` without shortcut.

Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:

```py
>>> pdf = pd.DataFrame(
...      {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...      columns=["a", "b", "c"],
... )

>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a  0  1
b
1  1  2
```

If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.

In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.

### Why are the changes needed?

`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.

```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
...     {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...     columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```

### Does this PR introduce _any_ user-facing change?

The error above will be gone:

```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
```

### How was this patch tested?

Added tests.

Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-03 12:25:19 +09:00
Chao Sun 14d4ceeb73 [SPARK-36891][SQL] Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding
### What changes were proposed in this pull request?

Add a new test suite `ParquetVectorizedSuite` to provide more coverage on vectorized Parquet decoding logic, with different combinations on column index, dictionary, batch size, page size, etc.

To facilitate the test, this also refactored `SpecificParquetRecordReaderBase` and makes the Parquet row group reader pluggable.

### Why are the changes needed?

Currently `ParquetIOSuite` and `ParquetColumnIndexSuite` only test on the high-level API which is insufficient, especially after the introduction of column index support, for which we want to cover various combinations involving row ranges, first row index, batch size, page size, etc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new test suite.

Closes #34149 from sunchao/SPARK-36891-parquet-test.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-10-01 23:35:23 -07:00
Dmitriy Fishman 25db6b45c7 [MINOR][DOCS] Typo fix in cloud-integration.md
### What changes were proposed in this pull request?
Typo fix

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #34129 from fishmandev/patch-1.

Authored-by: Dmitriy Fishman <fishman.code@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-10-01 11:26:21 -05:00
Venkata krishnan Sowrirajan 73747ecb97 [SPARK-36038][CORE] Speculation metrics summary at stage level
### What changes were proposed in this pull request?

Currently there are no speculation metrics available for Spark either at application/job/stage level. This PR is to add some basic speculation metrics for a stage when speculation execution is enabled.

This is similar to the existing stage level metrics tracking numTotal (total number of speculated tasks), numCompleted (total number of successful speculated tasks), numFailed (total number of failed speculated tasks), numKilled (total number of killed speculated tasks) etc.

With this new set of metrics, it helps further understanding speculative execution feature in the context of the application and also helps in further tuning the speculative execution config knobs.

Screenshot of Spark UI with speculation summary:
![Screen Shot 2021-09-22 at 12 12 20 PM](https://user-images.githubusercontent.com/8871522/135321311-db7699ad-f1ae-4729-afea-d1e2c4e86103.png)

Screenshot of Spark UI with API output:
![Screen Shot 2021-09-22 at 12 10 37 PM](https://user-images.githubusercontent.com/8871522/135321486-4dbb7a67-5580-47f8-bccf-81c758c2e988.png)

### Why are the changes needed?

Additional metrics for speculative execution.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests added and also deployed in our internal platform for quite some time now.

Lead-authored by: Venkata krishnan Sowrirajan <vsowrirajanlinkedin.com>
Co-authored by: Ron Hu <rhulinkedin.com>
Co-authored by: Thejdeep Gudivada <tgudivadalinkedin.com>

Closes #33253 from venkata91/speculation-metrics.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-10-01 16:59:29 +09:00
itholic 13ddc91668 [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
### What changes were proposed in this pull request?
This PR proposes implementing `MultiIndex.equal_levels`.

```python
>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
>>> psmidx1.equal_levels(psmidx2)
True

>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z"), ("a", "y")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("a", "y"), ("b", "x"), ("c", "z"), ("c", "x")])
>>> psmidx1.equal_levels(psmidx2)
True
```

This was originally proposed in https://github.com/databricks/koalas/pull/1789, and all reviews in origin PR has been resolved.

### Why are the changes needed?

We should support the pandas API as much as possible for pandas-on-Spark module.

### Does this PR introduce _any_ user-facing change?

Yes, the `MultiIndex.equal_levels` API is available.

### How was this patch tested?

Unittests

Closes #34113 from itholic/SPARK-36435.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-01 14:07:55 +09:00
Karen Feng 4aeddb81d1 [SPARK-36870][SQL] Introduce INTERNAL_ERROR error class
### What changes were proposed in this pull request?

Adds the `INTERNAL_ERROR` error class and the `isInternalError` API to `SparkThrowable`.
Removes existing error classes that are internal-only and replaces them with `INTERNAL_ERROR`.

### Why are the changes needed?

Makes it easy for end-users to diagnose whether an error is an internal error. If an end-user encounters an internal error, it should be reported immediately. This also limits the number of error classes, making it easy to audit. We do not need high-quality error messages for internal errors, as they should not be exposed to the end-user.

### Does this PR introduce _any_ user-facing change?

Yes; this changes the error class in master.

### How was this patch tested?

Unit tests

Closes #34123 from karenfeng/internal-error-class.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-01 13:46:45 +09:00
Kousuke Saruta 7c155806ed [SPARK-36830][SQL] Support reading and writing ANSI intervals from/to JSON datasources
### What changes were proposed in this pull request?

This PR aims to support reading and writing ANSI intervals from/to JSON datasources.
Aith this change, a interval data is written as a literal form like `{"col":"INTERVAL '1-2' YEAR TO MONTH"}`.
For the reading part, we need to specify the schema explicitly like:
```
val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").json(...)
```

### Why are the changes needed?

For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to JSON datasources.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test. It covers both V1 and V2 sources.

Closes #34155 from sarutak/ansi-interval-json-source.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-30 20:22:06 +03:00
Kousuke Saruta 5a32e41e9c [SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window
### What changes were proposed in this pull request?

This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.

### Why are the changes needed?

To provide PySpark users with the API document of the newly added function.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`make html` in `python/docs` and get the following docs.

[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)

[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)

Closes #34118 from sarutak/python-session-window-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-30 16:51:12 +09:00
Leona Yoda 17e3ca6df5 [SPARK-36899][R] Support ILIKE API on R
### What changes were proposed in this pull request?

Support ILIKE (case insensitive LIKE) API on R.

### Why are the changes needed?

ILIKE statement on SQL interface is supported by SPARK-36674.
This PR will support R API for it.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call ilike from R.

### How was this patch tested?

Unit tests.

Closes #34152 from yoda-mon/r-ilike.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-30 14:43:09 +09:00
Xinrong Meng ad5a53511e [SPARK-36896][PYTHON] Return boolean for dropTempView and dropGlobalTempView
### What changes were proposed in this pull request?
Currently `dropTempView` and `dropGlobalTempView` don't have return value, which conflicts with their docstring:
`Returns true if this view is dropped successfully, false otherwise.`. And that's not consistent with the same API in other languages.

The PR proposes a fix for that.

### Why are the changes needed?
Be consistent with API in other languages.

### Does this PR introduce _any_ user-facing change?
Yes.
#### From
```py
# dropTempView
>>> spark.createDataFrame([(1, 1)]).createTempView("my_table")
>>> spark.table("my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropTempView("my_table")
>>> spark.catalog.dropTempView("my_table")

# dropGlobalTempView
>>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table")
>>> spark.table("global_temp.my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropGlobalTempView("my_table")
>>> spark.catalog.dropGlobalTempView("my_table")
```

#### To
```py
# dropTempView
>>> spark.createDataFrame([(1, 1)]).createTempView("my_table")
>>> spark.table("my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropTempView("my_table")
True
>>> spark.catalog.dropTempView("my_table")
False

# dropGlobalTempView
>>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table")
>>> spark.table("global_temp.my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropGlobalTempView("my_table")
True
>>> spark.catalog.dropGlobalTempView("my_table")
False
```

### How was this patch tested?
Existing tests.

Closes #34147 from xinrong-databricks/fix_return.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-30 14:27:00 +09:00
Warren Zhu b60e576eaf [SPARK-36893][BUILD][MESOS] Upgrade mesos into 1.4.3
### What changes were proposed in this pull request?
Upgrade mesos into 1.4.3

### Why are the changes needed?
Fix CVE-2018-11793

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually

Closes #34144 from warrenzhu25/mesos.

Authored-by: Warren Zhu <warren.zhu25@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-29 21:49:22 -07:00
Richard Chen 8357572af5 [SPARK-36888][SQL] add tests cases for sha2 function
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
  8. If you want to add or modify an error type or message, please read the guideline first in
     'core/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

Adding tests to `HashExpressionSuite` to test the `sha2` function to test for bit lengths of 0 and 512. Also add comments to clarify existing ambiguous comment.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Currently, the `sha2` function with bit lengths of `0` and `512` were not tested. This PR adds those tests

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Ran the sha2 test in `HashExpressionsSuite` to ensure added tests pass.

Closes #34145 from richardc-db/add_sha_tests.

Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-30 09:52:40 +09:00
Dongjoon Hyun aa9064ad96 [SPARK-36883][INFRA] Upgrade R version to 4.1.1 in CI images
### What changes were proposed in this pull request?

This PR aims to upgrade GitHub Action CI image to recover CRAN installation failure.

### Why are the changes needed?

Sometimes, GitHub Action linter job failed
- https://github.com/apache/spark/runs/3739748809

New image have R 4.1.1 and will recover the failure.
```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210928 R --version
R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass `GitHub Action`.

Closes #34138 from dongjoon-hyun/SPARK-36883.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-29 11:39:01 -07:00
Kousuke Saruta a8734e3f16 [SPARK-36831][SQL] Support reading and writing ANSI intervals from/to CSV datasources
### What changes were proposed in this pull request?

This PR aims to support reading and writing ANSI intervals from/to CSV datasources.
Aith this change, a interval data is written as a literal form like `INTERVAL '1-2' YEAR TO MONTH`.
For the reading part, we need to specify the schema explicitly like:
```
val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").csv(...)
```

### Why are the changes needed?

For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to CSV datasources.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test. It covers both V1 and V2 sources.

Closes #34142 from sarutak/ansi-interval-csv-source.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-29 21:22:34 +03:00
Angerszhuuuu dcada3d48c [SPARK-36624][YARN] In yarn client mode, when ApplicationMaster failed with KILLED/FAILED, driver should exit with code not 0
### What changes were proposed in this pull request?
In current code for yarn client mode, even when use use `yarn application -kill` to kill the application, driver side still exit with code 0. This behavior make job scheduler can't know the job is not success. and user don't know too.

In this case we should exit program with a non 0 code.

### Why are the changes needed?
Make scheduler/user more clear about application's status

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #33873 from AngersZhuuuu/SPDI-36624.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-09-29 11:12:01 -05:00
sychen d003db34c4 [SPARK-36550][SQL] Propagation cause when UDF reflection fails
### What changes were proposed in this pull request?
When the exception is InvocationTargetException, get cause and stack trace.

### Why are the changes needed?
Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception.
```
Error in query: No handler for Hive UDF 'XXX': java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
manual test

Closes #33796 from cxzl25/SPARK-36550.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-29 08:30:50 -05:00
Max Gekk bbd1318b7b [SPARK-36889][SQL] Respect spark.sql.parquet.filterPushdown by v2 parquet scan builder
### What changes were proposed in this pull request?
In the PR, I propose to take into account the SQL config `spark.sql.parquet.filterPushdown` while building the array of pushed down filters in v2 `ParquetScanBuilder`.

### Why are the changes needed?
Before the changes, `explain()` outputs some filters even filter pushdown to parquet is disabled:
```
== Physical Plan ==
*(1) Filter (isnotnull(c0#7) AND (c0#7 = 1))
+- *(1) ColumnarToRow
   +- BatchScan[c0#7] ParquetScan DataFilters: [isnotnull(c0#7), (c0#7 = 1)], ... PushedFilters: [IsNotNull(c0), EqualTo(c0,1)] RuntimeFilters: []
```

### Does this PR introduce _any_ user-facing change?
If users parse `explain()`'s output, this PR can influence to them. But in general, it shouldn't. Also, need to highlight that the PR affects DSv2 only.

### How was this patch tested?
By running new test in `ParquetV2FilterSuite`:
```
$ build/sbt "test:testOnly *ParquetV2FilterSuite"
```

Closes #34140 from MaxGekk/fix-parquet-v2-pushed-filters.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 19:47:21 +09:00
ulysses-you 0e23bd7b7c [SPARK-34980][SQL] Support coalesce partition through union in AQE
### What changes were proposed in this pull request?

- Split plan into several groups, and every child of union is a new group
- Coalesce paritition for every group

### Why are the changes needed?

#### First Issue
The rule `CoalesceShufflePartitions` can only coalesce paritition if
* leaf node is ShuffleQueryStage
* all shuffle have same partition number

With `Union`, it might break the assumption. Let's say we have such plan
```
Union
   HashAggregate
      ShuffleQueryStage
   FileScan
```
`CoalesceShufflePartitions` can not optimize it and the result partition would be `shuffle partition + FileScan partition` which can be quite lagre.

It's better to support partial optimize with `Union`.

#### Second Issue
the coalesce partition formule used the **sum value** as the total input size and it's not friendly for union, see
```
// ShufflePartitionsUtil.coalescePartitions
val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum
```

So for such case:
```
Union
   HashAggregate
      ShuffleQueryStage
   HashAggregate
      ShuffleQueryStage
```
The `CoalesceShufflePartitions` rule will return an unexpected partition number.

### Does this PR introduce _any_ user-facing change?

Probably yes, the result partition might changed.

### How was this patch tested?

Add test.

Closes #32084 from ulysses-you/SPARK-34980.

Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <ulyssesyou18@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-29 16:31:52 +08:00
Leona Yoda ca1c09d88c [SPARK-36882][PYTHON] Support ILIKE API on Python
### What changes were proposed in this pull request?

Support ILIKE (case insensitive LIKE) API on Python.

### Why are the changes needed?

ILIKE statement on SQL interface is supported by SPARK-36674.
This PR will support Python API for it.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call `ilike` from Python.

### How was this patch tested?

Unit tests.

Closes #34135 from yoda-mon/python-ilike.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-29 15:04:03 +09:00
Hyukjin Kwon 13c2b711e4 [MINOR][DOCS] Mention other Python dependency tools in documentation
### What changes were proposed in this pull request?

Self-contained.

### Why are the changes needed?

For user's more information on available Python dependency management in PySpark.

### Does this PR introduce _any_ user-facing change?
Yes, documentation change.

### How was this patch tested?
Manaully built the docs and checked the results:
<img width="918" alt="Screen Shot 2021-09-29 at 10 11 56 AM" src="https://user-images.githubusercontent.com/6477701/135186536-2f271378-d06b-4c6b-a4be-691ce395db9f.png">
<img width="976" alt="Screen Shot 2021-09-29 at 10 12 22 AM" src="https://user-images.githubusercontent.com/6477701/135186541-0f4c5615-bc49-48e2-affd-dc2f5c0334bf.png">
<img width="920" alt="Screen Shot 2021-09-29 at 10 12 42 AM" src="https://user-images.githubusercontent.com/6477701/135186551-0b613096-7c86-4562-b345-ddd60208367b.png">

Closes #34134 from HyukjinKwon/minor-docs-py-deps.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 14:45:58 +09:00
Huaxin Gao d2bb359338 [SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface
### What changes were proposed in this pull request?
Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index

 This PR adds `supportsIndex` interface that provides APIs to work with indexes.

### Why are the changes needed?
Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc.

### Does this PR introduce _any_ user-facing change?
yes, the following new APIs are added:

- createIndex
- dropIndex
- indexExists
- listIndexes

New SQL syntax:
```

CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList]

    column_index_property_list: column_name [OPTIONS(indexPropertyList)]  [ ,  . . . ]
    indexPropertyList: index_property_name = index_property_value [ ,  . . . ]

DROP INDEX index_name

```
### How was this patch tested?
only interface is added for now. Tests will be added when doing the implementation

Closes #33754 from huaxingao/index_interface.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-29 10:49:29 +08:00
Liang-Chi Hsieh bfcc596398 [MINOR][SS][TEST] Remove unsupportedOperationCheck setting for TextSocketStreamSuite
### What changes were proposed in this pull request?

This patch simply removes a few `unsupportedOperationCheck` in `TextSocketStreamSuite`.

### Why are the changes needed?

`unsupportedOperationCheck` is used to disable the check of unsupported operations. If we are not to test unsupported operations, it was unnecessarily set in `TextSocketStreamSuite` and could cause unexpected error by missing check.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #34132 from viirya/minor-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-28 17:45:22 -07:00
Takuya UESHIN 05c0fa5738 [SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof
### What changes were proposed in this pull request?

Proposes an infrastructure for as-of join and implements `ps.merge_asof` here.

1. Introduce `AsOfJoin` logical plan
2. Rewrite the plan in the optimize phase:

- From something like (SQL syntax is not determied):

```sql
SELECT * FROM left ASOF JOIN right ON (condition, as_of on(left.t, right.t), tolerance)
```

- To

```sql
SELECT left.*, __right__.*
FROM (
     SELECT
          left.*,
          (
               SELECT MIN_BY(STRUCT(right.*), left.t - right.t) AS __nearest_right__
               FROM right
               WHERE condition AND left.t >= right.t AND right.t >= left.t - tolerance
          ) as __right__
     FROM left
     )
WHERE __right__ IS NOT NULL
```

3. The rewritten scalar-subquery will be handled by the existing decorrelation framework.

Note: APIs on SQL DataFrames and SQL syntax are TBD (e.g., [SPARK-22947](https://issues.apache.org/jira/browse/SPARK-22947)), although there are temporary APIs added here.

### Why are the changes needed?

Pandas' `merge_asof` or as-of join for SQL/DataFrame is useful for time series analysis.

### Does this PR introduce _any_ user-facing change?

Yes. `ps.merge_asof` can be used.

```py
>>> quotes
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

>>> trades
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

>>> ps.merge_asof(
...    trades, quotes, on="time", by="ticker"
... ).sort_values(["time", "ticker", "price"]).reset_index(drop=True)
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
4 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93

>>> ps.merge_asof(
...     trades,
...     quotes,
...     on="time",
...     by="ticker",
...     tolerance=F.expr("INTERVAL 2 MILLISECONDS")  # pd.Timedelta("2ms")
... ).sort_values(["time", "ticker", "price"]).reset_index(drop=True)
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
4 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
```

Note: As `IntervalType` literal is not supported yet, we have to specify the `IntervalType` value with `F.expr` as a workaround.

### How was this patch tested?

Added tests.

Closes #34053 from ueshin/issues/SPARK-36813/merge_asof.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 09:27:38 +09:00
Takuya UESHIN a9b4c27f12 [SPARK-36846][PYTHON] Inline most of type hint files under pyspark/sql/pandas folder
### What changes were proposed in this pull request?

Inlines type hint files under `pyspark/sql/pandas` folder, except for `pyspark/sql/pandas/functions.pyi` and files under `pyspark/sql/pandas/_typing`.

- Since the file contains a lot of overloads, we should revisit and manage it separately.
- We can't inline files under `pyspark/sql/pandas/_typing` because it includes new syntax for type hints.

### Why are the changes needed?

Currently there are type hint stub files (`*.pyi`) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #34101 from ueshin/issues/SPARK-36846/inline_typehints.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 09:25:18 +09:00
Wenchen Fan d233977d18 [SPARK-36848][SQL][FOLLOWUP] Simplify SHOW CURRENT NAMESPACE command
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/34104

`SHOW CURRENT NAMESPACE` is a very simple command that does not involve v2 catalog API, does not need analysis, does not have children. We can simply use `RunnableCommand` to save defining the physical plan.

### Why are the changes needed?

code simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #34128 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-28 10:31:59 -07:00
Max Gekk 8c51fd85d6 [SPARK-36866][SQL] Pushdown filters with ANSI interval values to parquet
### What changes were proposed in this pull request?
In the PR, I propose to pushdown filters with ANSI intervals as filter values to the parquet datasource. After the changes, filter values are pushed down with the following values via Filter API:
- `java.time.Period` for year-month filters
- `java.time.Duration` for day-time filters.

Since at the Parquet filter level, we don't have info about Catalyst's types (`YearMonthIntervalType` and `DayTimeIntervalType`) but only the info about primitive parquet types `INT32` and `INT64`. As a consequence of that, Spark has to convert filters values "dynamically" to `Int`/`Long` while building Parquet filters.

### Why are the changes needed?
The PR https://github.com/apache/spark/pull/34057 supported ANSI intervals in the Parquet datasource as INT32 (year-month interval) and INT64 (day-time interval) but filters with such values are not pushed down. So, comparing to primitive types, ANSI intervals can suffer from worse performance in read. This PR aims to solve the issue, and achieve feature parity with other types.

### Does this PR introduce _any_ user-facing change?
No, the changes might influence on performance of the parquet datasource only.

### How was this patch tested?
By running new tests in `ParquetFilterSuite`:
```
$ build/sbt clean "test:testOnly *ParquetV1FilterSuite"
$ build/sbt clean "test:testOnly *ParquetV2FilterSuite"
```

Closes #34115 from MaxGekk/interval-parquet-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-28 20:09:33 +08:00
Richard Chen 6c6291b3f6 [SPARK-36836][SQL] Fix incorrect result in sha2 expression
### What changes were proposed in this pull request?

`sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs.
This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell.

Spark currently returns
```
#\t}"4�"�B�w��U�*��你���l��
```
while the expected result is
```
23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
```

This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string.

### Why are the changes needed?

`sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added new test to `HashExpressionsSuite.scala` which previously failed and now pass

Closes #34086 from richardc-db/sha224.

Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 18:38:20 +08:00
Chao Sun 53f58b6e51 [SPARK-36873][BUILD][TEST-MAVEN] Add provided Guava dependency for network-yarn module
### What changes were proposed in this pull request?

Add provided Guava dependency to `network-yarn` module.

### Why are the changes needed?

In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes:

```
build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol
  symbol:   class VisibleForTesting
  location: class org.apache.spark.network.yarn.YarnShuffleService
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested with the above `mvn` command and it's now passing.

Closes #34125 from sunchao/SPARK-36873.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 18:23:30 +08:00
Chao Sun 8d8b71058c [SPARK-36863][BUILD] Update dependency manifest files for all released Spark artifacts
### What changes were proposed in this pull request?

Update `dev/test-dependencies.sh` so that it now covers all released Spark artifacts such as `hadoop-cloud`.

### Why are the changes needed?

Currently `dev/test-dependencies.sh` doesn't cover all released Spark modules. Therefore, if there is any dependency change in those modules (e.g., `hadoop-cloud`), we won't be able to detect it early.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #34119 from sunchao/SPARK-36863-dependency.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-28 19:22:48 +09:00
Liang-Chi Hsieh 4caf71f0d5 [SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP
### What changes were proposed in this pull request?

This patch proposes to remove broadcast variable in `InSubqueryExec` which is used in DPP.

### Why are the changes needed?

Currently we include a broadcast variable in `InSubqueryExec`. We use it to hold filtering side query result of DPP. It looks weird because we don't use the result in executors but only need the result in the driver during query planning. We already hold the original result, so basically we hold two copied of query result at this moment.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test.

Closes #34051 from viirya/dpp-no-broadcast.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-28 17:34:45 +08:00
copperybean 56c21e8e5a [SPARK-36856][BUILD] Get correct JAVA_HOME for macOS
### What changes were proposed in this pull request?
Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS.

### Why are the changes needed?
Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`build/mvn -DskipTests package` passed on `macOS 11.5.2`.

Closes #34111 from copperybean/work.

Authored-by: copperybean <copperybean.zhang@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-28 17:27:02 +08:00
Cheng Su c3cdb89b9d [SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join
### What changes were proposed in this pull request?

For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join.

This PR adds the optimization in `UnsafeHashedRelation` for broadcast hash join and shuffled hash join. The optimization for `LongHashedRelation` would be added later in the future, because it needs more change of underlying hash table data structure `LongToUnsafeRowMap` to check if key exists in hash table or not.

### Why are the changes needed?

Help reduce the hash table size of join for LEFT SEMI and LEFT ANTI.
This can increase the chance of broadcast join of these queries, and reduce OOM possibility of shuffled hash join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite.scala`.

Just to help easier review, as a lot of files are changed due to unit test plan change. Below files have the real code change:

* BroadcastExchangeExec.scala
* BroadcastHashJoinExec.scala
* HashJoin.scala
* HashedRelation.scala
* ShuffledHashJoinExec.scala
* JoinSuite.scala
* DebuggingSuite.scala

All other files change are generated with followed commands:

```
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite"
```

Closes #34034 from c21/join-deduplicate.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-28 15:32:46 +08:00
Yuming Wang e024bdc306 Revert "[SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type"
### What changes were proposed in this pull request?

This reverts commit aaa0d2a66b.

### Why are the changes needed?

This approach has 2 disadvantages:
1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`.
2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596

Instead, we can use bloom filter join pruning in the future.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #34116 from wangyum/revert-SPARK-32855.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-09-28 10:30:11 +08:00
Yufei Gu d03999ab88 [SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum
### What changes were proposed in this pull request?
A follow up of https://github.com/apache/spark/pull/34054. Three things changed:
1. Add a test for extendable class `ColumnarBatch`
2. Make `ColumnarBatchRow` public.
3. Change private fields to protected fields.

### Why are the changes needed?
A follow up of https://github.com/apache/spark/pull/34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
A new test is added

Closes #34087 from flyrain/SPARK-36821.

Authored-by: Yufei Gu <yufei_gu@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2021-09-27 18:26:59 +00:00
Huaxin Gao e4e64c7552 [SPARK-36848][SQL] Migrate ShowCurrentNamespaceStatement to v2 command framework
### What changes were proposed in this pull request?

Migrate `ShowCurrentNamespaceStatement` to v2 command framework

### Why are the changes needed?
Migrate to the standard V2 framework

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #34104 from huaxingao/namesp.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-27 10:36:55 -07:00
Angerszhuuuu ad96aee574 [SPARK-36829][SQL] Refactor NULL check for collectionOperators
### What changes were proposed in this pull request?
Refactor NULL check for collectionOperators

### Why are the changes needed?
Refactor NULL check for collectionOperators

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #34077 from AngersZhuuuu/SPARK-36829.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-27 22:14:22 +08:00
Cheng Su 978a915406 [SPARK-32712][SQL] Support writing Hive bucketed table (Hive file formats with Hive hash)
### What changes were proposed in this pull request?

This is to support writing Hive bucketed table with Hive file formats (the code path for Hive table write - `InsertIntoHiveTable`). The bucketed table is partitioned with Hive hash, same as Hive, Presto and Trino.

### Why are the changes needed?

To make Spark write other-SQL-engines-compatible bucketed table. Same motivation as https://github.com/apache/spark/pull/33432 .

### Does this PR introduce _any_ user-facing change?

Yes. Before this PR, writing to these Hive bucketed table would throw an exception in Spark if config "hive.enforce.bucketing" or "hive.enforce.sorting" set to true. After this PR, writing to these Hive bucketed table would succeed. The table can be read back by Presto and Trino efficiently as other Hive bucketed table.

### How was this patch tested?

Modified unit test in `BucketedWriteWithHiveSupportSuite.scala`, to verify bucket file names and each row in each bucket is written properly, for Hive write code path as well.

Closes #34103 from c21/hive-bucket.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-27 15:58:41 +08:00
Liang-Chi Hsieh 44070e0e5d [SPARK-36797][SQL] Union should resolve nested columns as top-level columns
### What changes were proposed in this pull request?

This patch proposes to generalize the resolving-by-position behavior to nested columns for Union.

### Why are the changes needed?

Union, by the API definition, resolves columns by position. Currently we only follow this behavior at top-level columns, but not nested columns.

As we are making nested columns as first-class citizen, the nested-column-only limitation and the difference between top-level column and nested column do not make sense. We should also resolve nested columns like top-level columns for Union.

### Does this PR introduce _any_ user-facing change?

Yes. After this change, Union also resolves nested columns by position.

### How was this patch tested?

Added tests.

Closes #34038 from viirya/SPARK-36797.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-27 15:51:43 +08:00
Chao Sun f9efdeea8c [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Move spark.yarn.isHadoopProvided to parent pom
### What changes were proposed in this pull request?

Move `spark.yarn.isHadoopProvided` to Spark parent pom, so that under `resource-managers/yarn` we can make `hadoop-3.2` as the default profile.

### Why are the changes needed?

Currently under `resource-managers/yarn` there are 3 maven profiles : `hadoop-provided`, `hadoop-2.7`, and `hadoop-3.2`, of which `hadoop-3.2` is activated by default (via `activeByDefault`). The activation, however, doesn't work when there is other explicitly activated profiles. In specific, if users build Spark with `hadoop-provided`, maven will fail because it can't find Hadoop 3.2 related dependencies, which are defined in the `hadoop-3.2` profile section.

To fix the issue, this proposes to move the `hadoop-provided` section to the parent pom. Currently this is only used to define a property `spark.yarn.isHadoopProvided`, and it shouldn't matter where we define it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested via running the command:
```
build/mvn clean package -DskipTests -B -Pmesos -Pyarn -Pkubernetes -Pscala-2.12 -Phadoop-provided
```
which was failing before this PR but is succeeding with it.

Also checked active profiles with the command:
```
build/mvn -Pyarn -Phadoop-provided help:active-profiles
```
and it shows that `hadoop-3.2` is active for `spark-yarn` module now.

Closes #34110 from sunchao/SPARK-36835-followup2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-27 15:17:04 +08:00
Angerszhuuuu 00b986384d [SPARK-36838][SQL] Improve InSet generated code performance
### What changes were proposed in this pull request?
Since Set can't check is NaN value is contained in current set.
With codegen, only when value set contains NaN then we have  necessary to check if the value is NaN, or we just need t
o check is the Set contains the value.

### Why are the changes needed?
Improve generated code's performance. Make only check NaN when Set contains NaN.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #34097 from AngersZhuuuu/SPARK-36838.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-27 12:13:47 +09:00
Dongjoon Hyun 0b1eec133c [SPARK-36859][BUILD][K8S] Upgrade kubernetes-client to 5.8.0
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` dependency from 5.7.3 to 5.8.0 for Apache Spark 3.3.

### Why are the changes needed?

This will add K8s Model v1.22.1 for the developers who are using `ExternalClusterManager` in K8s environments.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.8.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #34109 from dongjoon-hyun/SPARK-36859.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-26 16:57:39 -07:00
PengLei 0fdca1f0df [SPARK-36851][SQL] Incorrect parsing of negative ANSI typed interval literals
### What changes were proposed in this pull request?
Handle incorrect parsing of negative ANSI typed interval literals
[SPARK-36851](https://issues.apache.org/jira/browse/SPARK-36851)

### Why are the changes needed?
Incorrect result:
```
spark-sql> select interval -'1' year;
1-0
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add ut testcase

Closes #34107 from Peng-Lei/SPARK-36851.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-26 18:43:26 +08:00
Chao Sun 937a74e6e7 [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Fix maven issue for Hadoop 2.7 profile after enabling dependency reduced pom
### What changes were proposed in this pull request?

Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile.

### Why are the changes needed?

After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue.

As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #34100 from sunchao/SPARK-36835-followup.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-26 13:39:36 +08:00