Commit graph

31000 commits

Author SHA1 Message Date
Wenchen Fan 5107ad3157 [SPARK-36535][SQL] Refine the sql reference doc
### What changes were proposed in this pull request?

Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#####` in `sql-ref-literals.md`

### Why are the changes needed?

The subitems in the sidebar are quite useless, as the menu page serves the same functionalities:
<img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png">
It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync).

The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page.

The `#####` is not rendered properly:
<img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png">
It's better to avoid using it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #33767 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4b015e8d7d)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-17 12:46:49 -07:00
Cedric-Magnan e15daa31b3 [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined
### What changes were proposed in this pull request?
Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module.
Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class.

### Why are the changes needed?
This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran the following command successfully :
```sh
python/run-tests --testnames 'pyspark.pandas.tests.test_groupby'
```
Tests passed in 327 seconds

Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas.

Authored-by: Cedric-Magnan <cedric.magnan@artefact.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 964dfe254f)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:47:01 -07:00
Gengliang Wang 70635b4b26 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 8bfb4f1e72)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:24:09 +08:00
Max Gekk 07c6976f79 [SPARK-36524][SQL] Common class for ANSI interval types
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
    case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
    case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 82a31508af)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-17 12:28:07 +03:00
attilapiros eb09be9e68 [SPARK-36052][K8S] Introducing a limit for pending PODs
Introducing a limit for pending PODs (newly created/requested executors included).
This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles.

Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load.

No.

With new unit tests.

Closes #33492 from attilapiros/SPARK-36052.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 1dced492fb)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-16 16:06:29 -07:00
Gengliang Wang 41e5144b53 [SPARK-36521][SQL] Disallow comparison between Interval and String
### What changes were proposed in this pull request?

Disallow comparison between Interval and String in the default type coercion rules.

### Why are the changes needed?

If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33750 from gengliangwang/disallowCom.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 26d6b952dc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 22:41:25 +03:00
Yuanjian Li 4caa43e398 [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
### What changes were proposed in this pull request?
Add the document for the new RocksDBStateStoreProvider.

### Why are the changes needed?
User guide for the new feature.

### Does this PR introduce _any_ user-facing change?
No, doc only.

### How was this patch tested?
Doc only.

Closes #33683 from xuanyuanking/SPARK-36041.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 3d57e00a7f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-16 12:32:19 -07:00
zhuqi-lucas 2fb62e0e3d [SPARK-35548][CORE][SHUFFLE] Handling new attempt has started error message in BlockPushErrorHandler in client
### What changes were proposed in this pull request?
Add a new type of error message in BlockPushErrorHandler which indicates the PushblockStream message is received after a new application attempt has started. This error message should be correctly handled in client without retrying the block push.

### Why are the changes needed?
When we get a block push failure because of the too old attempt, we will not retry pushing the block nor log the exception on the client side.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add the corresponding unit test.

Closes #33617 from zhuqi-lucas/master.

Authored-by: zhuqi-lucas <821684824@qq.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 05cd5f97c3)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-16 13:59:54 -05:00
Xinrong Meng cb14a32005 [SPARK-36469][PYTHON] Implement Index.map
### What changes were proposed in this pull request?
Implement `Index.map`.

The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype.

`map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs.

### Why are the changes needed?
Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html).
We shall also support hat.

### Does this PR introduce _any_ user-facing change?
Yes. `Index.map` is available now.

```py
>>> psidx = ps.Index([1, 2, 3])

>>> psidx.map({1: "one", 2: "two", 3: "three"})
Index(['one', 'two', 'three'], dtype='object')

>>> psidx.map(lambda id: "{id} + 1".format(id=id))
Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object')

>>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3])
>>> psidx.map(pser)
Index(['one', 'two', 'three'], dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33694 from xinrong-databricks/index_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 4dcd746025)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-16 11:06:23 -07:00
Kazuyuki Tanimura 9149cad57d [SPARK-32210][CORE] Fix NegativeArraySizeException in MapOutputTracker with large spark.default.parallelism
### What changes were proposed in this pull request?
The current `MapOutputTracker` class may throw `NegativeArraySizeException` with a large number of partitions. Within the serializeOutputStatuses() method, it is trying to compress an array of mapStatuses and outputting the binary data into (Apache)ByteArrayOutputStream . Inside the (Apache)ByteArrayOutputStream.toByteArray(), negative index exception happens because the index is int and overflows (2GB limit) when the output binary size is too large.

This PR proposes two high-level ideas:
  1. Use `org.apache.spark.util.io.ChunkedByteBufferOutputStream`, which has a way to output the underlying buffer as `Array[Array[Byte]]`.
  2. Change the signatures from `Array[Byte]` to `Array[Array[Byte]]` in order to handle over 2GB compressed data.

### Why are the changes needed?
This issue seems to be missed out in the earlier effort of addressing 2GB limitations [SPARK-6235](https://issues.apache.org/jira/browse/SPARK-6235)

Without this fix, `spark.default.parallelism` needs to be kept at the low number. The drawback of setting smaller spark.default.parallelism is that it requires more executor memory (more data per partition). Setting `spark.io.compression.zstd.level` to higher number (default 1) hardly helps.

That essentially means we have the data size limit that for shuffling and does not scale.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passed existing tests
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite"
```
Also added a new unit test
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite  -- -z SPARK-32210"
```
Ran the benchmark using GitHub Actions and didn't not observe any performance penalties. The results are attached in this PR
```
core/benchmarks/MapStatusesSerDeserBenchmark-jdk11-results.txt
core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
```

Closes #33721 from kazuyukitanimura/SPARK-32210.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8ee464cd7a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-16 09:11:51 -07:00
Venkata krishnan Sowrirajan 233af3d239 [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
### What changes were proposed in this pull request?

Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))

### Why are the changes needed?

Helps users understand the feature

### Does this PR introduce _any_ user-facing change?

Docs

### How was this patch tested?

N/A

Closes #33615 from venkata91/SPARK-36374.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2270ecf32f)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-16 10:25:33 -05:00
Liang-Chi Hsieh 3aa933b162 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:16 +09:00
Sean Owen b8c1014e23 Update Spark key negotiation protocol 2021-08-14 09:08:29 -05:00
Huaxin Gao ede1d1e9a7 [SPARK-34952][SQL][FOLLOWUP] Normalize pushed down aggregate col name and group by col name
### What changes were proposed in this pull request?
Normalize pushed down aggregate col names and group by col names ...

### Why are the changes needed?
to handle case sensitive col names

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modify existing test

Closes #33739 from huaxingao/normalize.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3f8ec0dae4)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-13 22:31:30 -07:00
Gengliang Wang c898a940e2 [SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal
### What changes were proposed in this pull request?

If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc.
The related binary operation for Interval contains
- Add
- Subtract
- Comparisions

Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them.

### Why are the changes needed?

Avoid ambiguously implicit casting string literals to interval types.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion is not released yet.

### How was this patch tested?

New tests.

Closes #33737 from gengliangwang/disallowStringAndInterval.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit ecdea91602)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-14 10:50:43 +08:00
yi.wu 101f720cc9 [SPARK-32920][CORE][FOLLOW-UP] Fix string interpolator in the log
### What changes were proposed in this pull request?

fix string interpolator

### Why are the changes needed?

To log the correct stage info.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existed tests.

Closes #33738 from Ngone51/fix.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a47ceaf549)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 21:44:30 +09:00
Xingbo Jiang 09a1ddba41 [SPARK-36500][CORE] Fix temp_shuffle file leaking when a task is interrupted
### What changes were proposed in this pull request?

When a task thread is interrupted, the underlying output stream referred by `DiskBlockObjectWriter.mcs` may have been closed, then we get IOException when flushing the buffered data. This breaks the assumption that `revertPartialWritesAndClose()` should not throw exceptions.

To fix the issue, we can catch the IOException in `ManualCloseOutputStream.manualClose()`.

### Why are the changes needed?

Previously the IOException was not captured, thus `revertPartialWritesAndClose()` threw an exception. When this happens, `BypassMergeSortShuffleWriter.stop()` would stop deleting the temp_shuffle files tracked by `partitionWriters`, hens lead to temp_shuffle file leak issues.

### Does this PR introduce _any_ user-facing change?

No, this is an internal bug fix.

### How was this patch tested?

Tested by running a longevity stress test. After the fix, there is no more leaked temp_shuffle files.

Closes #33731 from jiangxb1987/temp_shuffle.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ec5f3a17e3)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 19:25:27 +09:00
Gengliang Wang eaf92bea99 [SPARK-36499][SQL][TESTS] Test Interval multiply / divide null
### What changes were proposed in this pull request?

Test the following valid operations:
```
year-month interval * null
null * year-month interval
year-month interval / null
```
and invalid operations:
```
null / interval
int / interval
```

### Why are the changes needed?

Improve test coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #33729 from gengliangwang/addTest.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit eb6be7f1ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-13 11:06:09 +03:00
gengjiaan eb840578f7 [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
### What changes were proposed in this pull request?
With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode.

```
org.apache.spark.sql.AnalysisException
cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7
```

We should update the function `make_timestamp` to allow integer type 'seconds' parameter.

### Why are the changes needed?
Make `make_timestamp` could accepts integer as 'seconds' parameter.

### Does this PR introduce _any_ user-facing change?
'Yes'.
`make_timestamp` could accepts integer as 'seconds' parameter.

### How was this patch tested?
New tests.

Closes #33665 from beliefer/SPARK-36428.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7d82336734)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 13:13:15 +08:00
Maryann Xue ca91292cf0 [SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs
This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances.

The major changes include:
1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined.
2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference.
3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined.
4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse.

A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results.

No.

Added UTs.
Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs.

Closes #33671 from maryannxue/spark-36447.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 29b1e394c6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:48:46 +08:00
Gengliang Wang f7017694e8 [SPARK-36497][SQL] Support Interval add/subtract NULL
### What changes were proposed in this pull request?

Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type.
This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well.
To change it, I propose to cast the null as the same data type of the other element in the add/subtract:
```
null + interval => cast(null as interval) + interval
null - interval => cast(null as interval) - interval
interval + null=> interval + cast(null as interval)
interval - null => interval - cast(null as interval)
```

### Why are the changes needed?

Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval`

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33727 from gengliangwang/intervalTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d4466d55ca)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:10:45 +08:00
Tim Armstrong 99a0085790 [SPARK-36501][ML] Fix random col names in LSHModel.approxSimilarityJoin
### What changes were proposed in this pull request?
Random.nextString() can include characters that are not valid in identifiers or likely to be buggy, e.g. non-printing characters, ".", "`". Instead use a utility that will always generate valid alphanumeric identifiers

### Why are the changes needed?
To deflake BucketedRandomProjectionLSHSuite and avoid similar failures that could be encountered by users.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran org.apache.spark.ml.feature.BucketedRandomProjectionLSHSuite

Closes #33730 from timarmstrong/flaky-lsb.

Authored-by: Tim Armstrong <tim.armstrong@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 886dbe01cd)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 12:04:54 +09:00
Min Shen 32d67d1a93 [SPARK-36483][CORE][TESTS] Fix intermittent test failures at Netty 4.1.52+
### What changes were proposed in this pull request?

Fix an intermittent test failure due to Netty dependency version bump.
Starting from Netty 4.1.52, its AbstractChannel will throw a new `StacklessClosedChannelException` for channel closed exception.
A hardcoded list of Strings to match for channel closed exception in `RPCIntegrationSuite` was not updated, thus leading to the intermittent test failure reported in #33613

### Why are the changes needed?

Fix intermittent test failure

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #33713 from Victsm/SPARK-36378-followup.

Authored-by: Min Shen <mshen@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit b8e2186fe1)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-12 20:15:58 -05:00
Gengliang Wang 3785738838 [SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions
### What changes were proposed in this pull request?

1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted
```
string - timestamp => cast(string as timestamp) - timestamp
```
This PR is to promote the following string literals:
```
string - date => cast(string as date) - date
date - string => date - cast(date as string)
timestamp - string => timestamp
```
It is very straightforward to cast the string literal as the data type of the other side in the subtraction.

2. Merge the string promotion logic from the rule `StringLiteralCoercion`:
```
date_sub(date, string) => date_sub(date, cast(string as int))
date_add(date, string) => date_add(date, cast(string as int))
```

### Why are the changes needed?

1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL
2. Certerize all the string literal promotion in the ANSI type coercion rule

### Does this PR introduce _any_ user-facing change?

No, the new ANSI type coercion rules are not released yet.

### How was this patch tested?

Existing UT

Closes #33724 from gengliangwang/datetimeTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 48e333af54)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-13 01:02:54 +08:00
Wenchen Fan 89cc547afd [SPARK-35881][SQL][FOLLOWUP] Remove the AQE post stage creation extension
### What changes were proposed in this pull request?

This is a followup of #33140

It turns out that we may be able to complete the AQE and columnar execution integration without the AQE post stage creation extension. The rule `ApplyColumnarRulesAndInsertTransitions` can add to-columnar transition if the shuffle/broadcast supports columnar.

### Why are the changes needed?

remove APIs that are not needed.

### Does this PR introduce _any_ user-facing change?

No, the APIs are not released yet.

### How was this patch tested?

existing and manual tests

Closes #33701 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 124d011ee7)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 21:35:44 +08:00
IonutBoicuAms 3b3eb6f8ea [SPARK-36489][SQL] Aggregate functions over no grouping keys, on tables with a single bucket, return multiple rows
### What changes were proposed in this pull request?

This PR fixes a bug in `DisableUnnecessaryBucketedScan`.
When running any aggregate function, without any grouping keys, on a table with a single bucket, multiple rows are returned.
This happens because the aggregate function satisfies the `AllTuples` distribution, no `Exchange` will be planned, and the bucketed scan will be disabled.

### Why are the changes needed?

Bug fixing. Aggregates over no grouping keys should return a single row.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new test in `DisableUnnecessaryBucketedScanSuite`.

Closes #33711 from IonutBoicuAms/fix-bug-disableunnecessarybucketedscan.

Authored-by: IonutBoicuAms <ionut.boicu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2b665751d9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 15:22:50 +08:00
Wenchen Fan 3e3c33d2c9 [SPARK-36479][SQL][TEST] Improve datetime test coverage in SQL files
### What changes were proposed in this pull request?

This PR adds more datetime tests in `date.sql` and `timestamp.sql`, especially for string promotion.

### Why are the changes needed?

improve test coverage

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33707 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 00a4364f38)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-12 12:52:03 +08:00
Jungtaek Lim 9df850df9e [SPARK-36480][SS] SessionWindowStateStoreSaveExec should not filter input rows against watermark
### What changes were proposed in this pull request?

This PR proposes to remove the filter applying to input rows against watermark in SessionWindowStateStoreSaveExec, since SessionWindowStateStoreSaveExec is expected to store all inputs into state store, and apply eviction later.

### Why are the changes needed?

The code is logically not right, though I can't reproduce the actual problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. I can't come up with broken case failing on previous code, but we can review the logic instead.

Closes #33708 from HeartSaVioR/SPARK-36480.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit fac4e5eb3e)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-11 20:11:07 -07:00
William Hyun 1a371fbfa1 [SPARK-36482][BUILD] Bump orc to 1.6.10
### What changes were proposed in this pull request?
This PR aims to bump ORC to 1.6.10

### Why are the changes needed?
This will bring the latest bug fixes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

Closes #33712 from williamhyun/orc.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit aff1b5594a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-11 11:32:18 -07:00
Max Gekk 8dbcbebc36 [SPARK-36468][SQL][DOCS] Update docs about ANSI interval literals
### What changes were proposed in this pull request?
In the PR, I propose to update the doc page https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal, and describe formats of ANSI interval literals.

<img width="1032" alt="Screenshot 2021-08-11 at 10 31 36" src="https://user-images.githubusercontent.com/1580697/128988454-7a6ac435-409b-4961-9b79-ebecfb141d5e.png">
<img width="1030" alt="Screenshot 2021-08-10 at 20 58 04" src="https://user-images.githubusercontent.com/1580697/128912018-a4ea3ee5-f252-49c7-a90e-5beaf7ac868f.png">

### Why are the changes needed?
To improve UX with Spark SQL, and inform users about recently added ANSI interval literals.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually checked the generated docs:
```
$ SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build
```

Closes #33693 from MaxGekk/doc-ansi-interval-literals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit bbf988bd73)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-11 13:38:52 +03:00
Gengliang Wang 293a6cb1ab [SPARK-36445][SQL] ANSI type coercion rule for date time operations
### What changes were proposed in this pull request?

Implement a new rule for the date-time operations in the ANSI type coercion system:
1. Date will be converted to Timestamp when it is in the subtraction with Timestmap.
2. Promote string literals in date_add/date_sub/time_add

### Why are the changes needed?

Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system:
1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type.
2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal.

Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion rules are not releaesd.

### How was this patch tested?

New UT

Closes #33666 from gengliangwang/datetimeOp.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 3029e62a82)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-11 11:56:00 +08:00
Jungtaek Lim 161908c10d [SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window
### What changes were proposed in this pull request?

This PR proposes to prohibit update mode in streaming aggregation with session window.

UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.

This PR also cleans up test code via deduplicating.

### Why are the changes needed?

The semantic of "update" mode for session window based streaming aggregation is quite unclear.

For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.

This doesn't hold true for session window based streaming aggregation, as session range is changing.

If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.

If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.

### Does this PR introduce _any_ user-facing change?

No, as we haven't released this feature.

### How was this patch tested?

Updated tests.

Closes #33689 from HeartSaVioR/SPARK-36463.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit ed60aaa9f1)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-11 10:46:03 +09:00
Min Shen c6b683e5a2 [SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client
We have run performance evaluations on the version of push-based shuffle committed to upstream so far, and have identified a few places for further improvements:
1. On the server side, we have noticed that the usage of `String.format`, especially when receiving a block push request, has a much higher overhead compared with string concatenation.
2. On the server side, the usage of `Throwables.getStackTraceAsString` in the `ErrorHandler.shouldRetryError` and `ErrorHandler.shouldLogError` has generated quite some overhead.

These 2 issues are related to how we are currently handling certain common block push failures.
We are communicating such failures via `RPCFailure` by transmitting the exception stack trace.
This generates the overhead on both server and client side for creating these exceptions and makes checking the type of failures fragile and inefficient with string matching of exception stack trace.
To address these, this PR also proposes to encode the common block push failure as an error code and send that back to the client with a proper RPC message.

Improve shuffle service efficiency for push-based shuffle.
Improve code robustness for handling block push failures.

No

Existing unit tests.

Closes #33613 from Victsm/SPARK-36378.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 3f09093a21)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-10 16:54:21 -05:00
Kazuyuki Tanimura c230ca95e6 [SPARK-36464][CORE] Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
### What changes were proposed in this pull request?
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`.
That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream`

This PR proposes to change the underlying `_size` variable from `Int` to `Long` at the initialization

### Why are the changes needed?
Be cause the `size` method of `ChunkedByteBufferOutputStream` incorrectly returns a negative value when over 2GB data is written.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passed existing tests
```
build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite"
```
Also added a new unit test
```
build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite – -z SPARK-36464"
```

Closes #33690 from kazuyukitanimura/SPARK-36464.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c888bad6a1)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-10 10:30:07 -07:00
gengjiaan 6018d44280 [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported
### What changes were proposed in this pull request?
Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`.
```
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>>
-- !query output
{"t":null}
```

```
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
```

We should make `from_json` throws exception too.
This PR fix the discussion below
https://github.com/apache/spark/pull/33640#discussion_r682862523

### Why are the changes needed?
Make the behavior of `from_json` more reasonable.

### Does this PR introduce _any_ user-facing change?
'Yes'.
from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ.

### How was this patch tested?
Tests updated.

Closes #33684 from beliefer/SPARK-36429-new.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 186815be1c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 22:52:39 +08:00
Angerszhuuuu fb6f3792af [SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields
### What changes were proposed in this pull request?
 Support TypeCoercion of ANSI intervals with different fields

### Why are the changes needed?
 Support TypeCoercion of ANSI intervals with different fields

### Does this PR introduce _any_ user-facing change?
After this pr user can
 - use comparison function with  different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH`
 - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)`
 - support different field of ansi interval type in `coalesce` etc..

### How was this patch tested?
Added UT

Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 89d8a4eacf)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-10 14:22:47 +03:00
Cheng Pan 45acd00dd6 [SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly
### What changes were proposed in this pull request?

Retain `spark.sql.catalog.*` confs when resolving view.

### Why are the changes needed?

Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet.

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

Add UT.

Closes #33692 from pan3793/SPARK-36466.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7f56b73cad)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 17:31:36 +08:00
Terry Kim 882ef6dd73 [SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns
### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it.

### Why are the changes needed?

To check the duplicate columns during analysis.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the following:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33676 from imback82/replace_cols_duplicates.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e1a5d94117)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 13:20:50 +08:00
Venkata krishnan Sowrirajan 1a432fe6bb [SPARK-36460][SHUFFLE] Pull out NoOpMergedShuffleFileManager inner class outside
### What changes were proposed in this pull request?

Pull out NoOpMergedShuffleFileManager inner class outside. This is required since passing dollar sign ($) for the config (`spark.shuffle.server.mergedShuffleFileManagerImpl`) value can be an issue. Currently `spark.shuffle.server.mergedShuffleFileManagerImpl` is by default set to `org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager`. After this change the default value be set to `org.apache.spark.network.shuffle.NoOpMergedShuffleFileManager`

### Why are the changes needed?

Passing `$` for the config value can be an issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified existing unit tests.

Closes #33688 from venkata91/SPARK-36460.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit df0de83c46)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-10 10:19:41 +08:00
Venkata krishnan Sowrirajan 38dd42a50c [SPARK-36332][SHUFFLE] Cleanup RemoteBlockPushResolver log messages
### What changes were proposed in this pull request?
Cleanup `RemoteBlockPushResolver` log messages by using `AppShufflePartitionInfo#toString()` to avoid duplications. Currently this is based off of https://github.com/apache/spark/pull/33034 will remove those changes once it is merged and remove the WIP at that time.

### Why are the changes needed?
Minor cleanup to make code more readable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No tests, just changing log messages

Closes #33561 from venkata91/SPARK-36332.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit ab897109a3)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-08-10 09:54:54 +08:00
Huaxin Gao 10f7f6e62b [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
### What changes were proposed in this pull request?
not push down partition filter to `ORCScan` for DSv2

### Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1
```
== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>
```
Also, we don't push down partition filter for parquet in DSv2.
https://github.com/apache/spark/pull/30652

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suites

Closes #33680 from huaxingao/orc_filter.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit b04330cd38)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-09 10:47:23 -07:00
Jungtaek Lim 9dc8d0c3ae [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
### What changes were proposed in this pull request?

This PR proposes to add a new example of complex sessionization, which leverages flatMapGroupsWithState.

### Why are the changes needed?

We have replaced an example of sessionization from flatMapGroupsWithState to native support of session window. Given there are still use cases on sessionization which native support of session window cannot cover, it would be nice if we can demonstrate such case. It will also be used as an example of flatMapGroupsWithState.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested. Example data is given in class doc.

Closes #33681 from HeartSaVioR/SPARK-36455.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 004b87d9a1)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-09 19:40:37 +09:00
Wenchen Fan 5bddafe3e0 [SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE
### What changes were proposed in this pull request?

This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172

Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs.

After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable.

This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster.

### Why are the changes needed?

fix perf regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33655 from cloud-fan/minor.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9a539d5846)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 17:26:11 +08:00
Angerszhuuuu 2a46bf6a3b [SPARK-36271][SQL] Unify V1 insert check field name before prepare writter
### What changes were proposed in this pull request?
Unify DataSource V1 insert schema check field name before prepare writer.
And in this PR we add check for avro V1 insert too.

### Why are the changes needed?
Unify code and add check for avro V1 insert too.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33566 from AngersZhuuuu/SPARK-36271.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f3e079b09b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 17:18:20 +08:00
Angerszhuuuu a5ecf2a490 [SPARK-36352][SQL] Spark should check result plan's output schema name
### What changes were proposed in this pull request?
Spark should check result plan's output schema name

### Why are the changes needed?
In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan
```
Project[a, B]
|--Scan[A, b, c]
```
the origin output schema is `a, B`, after SchemaPruning. it become
```
Project[A, b]
|--Scan[A, b]
```
It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed UT

Closes #33583 from AngersZhuuuu/SPARK-36352.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e051a540a1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:48:10 +08:00
Wenchen Fan 94dc3c77c2 [SPARK-35881][SQL][FOLLOWUP] Add a boolean flag in AdaptiveSparkPlanExec to ask for columnar output
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/33140 to propose a simpler idea for integrating columnar execution into AQE.

Instead of making the `ColumnarToRowExec` and `RowToColumnarExec` dynamic to handle `AdaptiveSparkPlanExec`, it's simpler to let the consumer decide if it needs columnar output or not, and pass a boolean flag to `AdaptiveSparkPlanExec`.

For Spark vendors, they can set the flag differently in their custom columnar parquet writing command when the input plan is `AdaptiveSparkPlanExec`.

One argument is if we need to look at the final plan of AQE and consume the data differently (either row or columnar format). I can't think of a use case and I think we can always statically know if the AQE plan should output row or columnar data.

### Why are the changes needed?

code simplification.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manual test

Closes #33624 from cloud-fan/aqe.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8714eefe6f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:34:10 +08:00
Sajith Ariyarathna f0844f70b1 [SPARK-36432][BUILD] Upgrade Jetty version to 9.4.43
### What changes were proposed in this pull request?
This PR upgrades Jetty version to `9.4.43.v20210629`.

### Why are the changes needed?
To address vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429 which affects Jetty `9.4.42.v20210604`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

Closes #33656 from this/upgrade-jetty-9.4.43.

Lead-authored-by: Sajith Ariyarathna <sajith.janaprasad@gmail.com>
Co-authored-by: Sajith Ariyarathna <this@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 5a22f9ceaf)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-09 10:14:16 +09:00
Min Shen 552a332dd4 [SPARK-36423][SHUFFLE] Randomize order of blocks in a push request to improve block merge ratio for push-based shuffle
### What changes were proposed in this pull request?

On the client side, we are currently randomizing the order of push requests before processing each request. In addition we can further randomize the order of blocks within each push request before pushing them.
In our benchmark, this has resulted in a 60%-70% reduction of blocks that fail to be merged due to bock collision (the existing block merge ratio is already pretty good in general, and this further improves it).

### Why are the changes needed?

Improve block merge ratio for push-based shuffle

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Straightforward small change, no additional test needed.

Closes #33649 from Victsm/SPARK-36423.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 6e729515fd)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-06 09:48:31 -05:00
Yuto Akutsu a5d0eafa32 [SPARK-595][DOCS] Add local-cluster mode option in Documentation
### What changes were proposed in this pull request?

Add local-cluster mode option to submitting-applications.md

### Why are the changes needed?

Help users to find/use this option for unit tests.

### Does this PR introduce _any_ user-facing change?

Yes, docs changed.

### How was this patch tested?

`SKIP_API=1 bundle exec jekyll build`
<img width="460" alt="docchange" src="https://user-images.githubusercontent.com/87687356/127125380-6beb4601-7cf4-4876-b2c6-459454ce2a02.png">

Closes #33537 from yutoacts/SPARK-595.

Lead-authored-by: Yuto Akutsu <yuto.akutsu@jp.nttdata.com>
Co-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
(cherry picked from commit 41b011e416)
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-08-06 09:27:24 -05:00
Kousuke Saruta 586eb5d4c6 Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported"
### What changes were proposed in this pull request?

This PR reverts the change in SPARK-36429 (#33654).
See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037).

### Why are the changes needed?

To recover CIs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33670 from sarutak/revert-SPARK-36429.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit e17612d0bf)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-06 20:56:40 +09:00