Commit graph

29397 commits

Author SHA1 Message Date
Dongjoon Hyun 2e31e2c5f3 [SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default
### What changes were proposed in this pull request?

Apache Spark 3.0 introduced `spark.eventLog.compression.codec` configuration.
For Apache Spark 3.2, this PR aims to set `zstd` as the default value for `spark.eventLog.compression.codec` configuration.
This only affects creating a new log file.

### Why are the changes needed?

The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.

**1. Save storage resources (and money)**

In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.

| CODEC | SIZE (bytes) |
|---------|-------------|
| LZ4         | 184001434|
| ZSTD      |  64522396|

And, the plain file is 17.6 times bigger.
```
-rw-r--r--    1 dongjoon  staff  1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679
-rw-r--r--    1 dongjoon  staff    64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd
```

**2. Better Usability**

We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.
```
$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded
```
```
$ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd
spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes
```

**3. Speed**
The following results are collected by running [lzbench](https://github.com/inikep/lzbench) on the above Spark event log. Note that
- This is not a direct comparison of Spark compression/decompression codec.
- `lzbench` is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.

Here,
- To get ZSTD 1.4.8-1 result, `lzbench` `master` branch is used because Spark is using ZSTD 1.4.8.
- To get LZ4 1.7.5 result, `lzbench` `v1.7` branch is used because Spark is using LZ4 1.7.1.
```
Compressor name      Compress. Decompress. Compr. size  Ratio Filename
memcpy               7393 MB/s  7166 MB/s  1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679
zstd 1.4.8 -1        1344 MB/s  3351 MB/s    56665767   4.99 spark-a1843ead29834f46b1125a03eca32679
lz4 1.7.5            1385 MB/s  4782 MB/s   127662168  11.24 spark-a1843ead29834f46b1125a03eca32679
```

### Does this PR introduce _any_ user-facing change?

- No for the apps which doesn't use `spark.eventLog.compress` because `spark.eventLog.compress` is disabled by default.
- No for the apps using `spark.eventLog.compression.codec` explicitly because this is a change of the default value.
- Yes for the apps using `spark.eventLog.compress` without setting `spark.eventLog.compression.codec`. In this case, previously `spark.io.compression.codec` value was used whose default is `lz4`.

So this JIRA issue, SPARK-34503, is labeled with `releasenotes`.

### How was this patch tested?

Pass the updated UT.

Closes #31618 from dongjoon-hyun/SPARK-34503.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-23 16:37:29 -08:00
Max Gekk 7f27d33a3c [SPARK-31891][SQL] Support MSCK REPAIR TABLE .. [{ADD|DROP|SYNC} PARTITIONS]
### What changes were proposed in this pull request?

In the PR, I propose to extend the `MSCK REPAIR TABLE` command, and support new options `{ADD|DROP|SYNC} PARTITIONS`. In particular:

1. Extend the logical node `RepairTable`, and add two new flags `enableAddPartitions` and `enableDropPartitions`.
2. Add similar flags to the v1 execution node `AlterTableRecoverPartitionsCommand`
3. Add new method `dropPartitions()` to `AlterTableRecoverPartitionsCommand` which drops partitions from the catalog if their locations in the file system don't exist.
4. Updated public docs about the `MSCK REPAIR TABLE` command:
<img width="1037" alt="Screenshot 2021-02-16 at 13 46 39" src="https://user-images.githubusercontent.com/1580697/108052607-7446d280-705d-11eb-8e25-7398254787a4.png">

Closes #31097

### Why are the changes needed?
- The changes allow to recover tables with removed partitions. The example below portraits the problem:
```sql
spark-sql> create table tbl2 (col int, part int) partitioned by (part);
spark-sql> insert into tbl2 partition (part=1) select 1;
spark-sql> insert into tbl2 partition (part=0) select 0;
spark-sql> show table extended like 'tbl2' partition (part = 0);
default	tbl2	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
...
```
Remove the partition (part = 0) from the filesystem:
```
$ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
```
Even after recovering, we cannot query the table:
```sql
spark-sql> msck repair table tbl2;
spark-sql> select * from tbl2;
21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
```

- To have feature parity with Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, we can query recovered table:
```sql
spark-sql> msck repair table tbl2 sync partitions;
spark-sql> select * from tbl2;
1	1
spark-sql> show partitions tbl2;
part=1
```

### How was this patch tested?
- By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *PlanResolutionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParallelSuite"
```
- Added unified v1 and v2 tests for `MSCK REPAIR TABLE`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite"
```

Closes #31499 from MaxGekk/repair-table-drop-partitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-23 13:45:15 -08:00
Wenchen Fan 95e45c6257 [SPARK-34168][SQL][FOLLOWUP] Improve DynamicPartitionPruningSuiteBase
### What changes were proposed in this pull request?

A few minor improvements for `DynamicPartitionPruningSuiteBase`.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #31625 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-23 13:41:24 -08:00
Wenchen Fan 0d5d248bdc [SPARK-34508][SQL][TEST] Skip HiveExternalCatalogVersionsSuite if network is down
### What changes were proposed in this pull request?

It's possible that the network is down when running Spark tests, and it's annoying to see `HiveExternalCatalogVersionsSuite` keep failing.

This PR proposes to skip this test suite if we can't get the latest Spark version from the Apache website.

### Why are the changes needed?

Make the Spark tests more robust.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #31627 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-23 13:35:29 -08:00
Huaxin Gao 443139b601 [SPARK-34502][SQL] Remove unused parameters in join methods
### What changes were proposed in this pull request?

Remove unused parameters in `CoalesceBucketsInJoin`, `UnsafeCartesianRDD` and `ShuffledHashJoinExec`.

### Why are the changes needed?
Clean up

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #31617 from huaxingao/join-minor.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-02-23 12:18:43 -08:00
Wenchen Fan 429f8af9b6 Revert "[SPARK-34380][SQL] Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES for v2 command"
This reverts commit 9a566f83a0.
2021-02-24 02:38:22 +08:00
Max Gekk 8f994cbb4a [SPARK-34475][SQL] Rename logical nodes of v2 ALTER commands
### What changes were proposed in this pull request?
In the PR, I propose to rename logical nodes of v2 commands in the form: `<verb> + <object>` like:
- AlterTableAddPartition -> AddPartition
- AlterTableSetLocation -> SetTableLocation

### Why are the changes needed?
1. For simplicity and readability of logical plans
2. For consistency with other logical nodes. For example, the logical node `RenameTable` for `ALTER TABLE .. RENAME TO` was added before `AlterTableRenamePartition`.

### Does this PR introduce _any_ user-facing change?
Should not since this is non-public APIs.

### How was this patch tested?
1. Check scala style: `./dev/scalastyle`
2. Affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #31596 from MaxGekk/rename-alter-table-logic-nodes.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-23 12:04:31 +00:00
Linhong Liu be675a052c [SPARK-34490][SQL] Analysis should fail if the view refers a dropped table
### What changes were proposed in this pull request?
When resolving a view, we use the captured view name in `AnalysisContext` to
distinguish whether a relation name is a view or a table. But if the resolution failed,
other rules (e.g. `ResolveTables`) will try to resolve the relation again but without
`AnalysisContext`. So, in this case, the resolution may be incorrect. For example,
if the view refers to a dropped table while a view with the same name exists, the
dropped table will be resolved as a view rather than an unresolved exception.

### Why are the changes needed?
bugfix

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
newly added test cases

Closes #31606 from linhongliu-db/fix-temp-view-master.

Lead-authored-by: Linhong Liu <linhong.liu@databricks.com>
Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-23 15:51:02 +08:00
Kousuke Saruta 612d52315b [SPARK-34500][DOCS][EXAMPLES] Replace symbol literals with $"" in examples and documents
### What changes were proposed in this pull request?

This PR replaces all the occurrences of symbol literals (`'name`) with string interpolation (`$"name"`) in examples and documents.

### Why are the changes needed?

Symbol literals are used to represent columns in Spark SQL but the Scala community seems to remove `Symbol` completely.
As we discussed in #31569, first we should replacing symbol literals with `$"name"` in user facing examples and documents.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Build docs.

Closes #31615 from sarutak/replace-symbol-literals-in-doc-and-examples.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-23 11:22:02 +09:00
HyukjinKwon b5470ae294 [MINOR][DOCS] Replace http to https when possible in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes:
- Change http to https for better security
- Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/)

### Why are the changes needed?

For better security, and to use official link.

### Does this PR introduce _any_ user-facing change?

Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation.

### How was this patch tested?

I manually checked if each link works

Closes #31616 from HyukjinKwon/minor-https.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-23 11:18:47 +09:00
Max Gekk 7df4fed420 [MINOR][SQL] Fix the comment for CalendarIntervalType about comparability
### What changes were proposed in this pull request?
In the PR, I propose to revert https://github.com/apache/spark/pull/26659 partially regarding to comparability of interval values. The comment became incorrect after https://github.com/apache/spark/pull/27262.

### Why are the changes needed?
The comment is incorrect, and it might confuse Spark's devs/users.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By checking scala coding style `./dev/scalastyle`.

Closes #31610 from MaxGekk/doc-interval-not-comparable.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 21:29:14 +08:00
Dongjoon Hyun 0bccf1664f [SPARK-34496][BUILD] Upgrade ZSTD-JNI to 1.4.8-5 for better API compatibility
### What changes were proposed in this pull request?

This PR aims to upgrade ZSTD-JNI to 1.4.8-5 for better API compatibility.

### Why are the changes needed?

Previously, we upgrade for ZSTD-JNI performance improvement.
And, `Apache Spark`/`Apache Parquet`/`Apache Avro` master branches are using JZSTD-JNI 1.4.8-x.

This PR aims to upgrade a minor version for a better API compatibility.
- def1860c6f
- 188c803044

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs

Closes #31609 from dongjoon-hyun/SPARK-34496.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-22 22:27:39 +09:00
Wenchen Fan 02c784ca68 [SPARK-34473][SQL] Avoid NPE in DataFrameReader.schema(StructType)
### What changes were proposed in this pull request?

This fixes a regression in `DataFrameReader.schema(StructType)`, to avoid NPE if the given `StructType` is null. Note that, passing null to Spark public APIs leads to undefined behavior. There is no document mentioning the null behavior, and it's just an accident that `DataFrameReader.schema(StructType)` worked before. So I think this is not a 3.1 blocker.

### Why are the changes needed?

It fixes a 3.1 regression

### Does this PR introduce _any_ user-facing change?

yea, now `df.read.schema(null: StructType)` is a noop as before, while in the current branch-3.1 it throws NPE.

### How was this patch tested?

It's undefined behavior and is very obvious, so I didn't add a test. We should add tests when we clearly define and fix the null behavior for all public APIs.

Closes #31593 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 21:11:21 +08:00
Karl-WangSK a6a82c8e69 [MINOR][DOCS] Add table_identifier in sql-migration-guide for SHOW CREATE TABLE
### What changes were proposed in this pull request?
Add `table_identifier` in sql-migration-guide for SHOW CREATE TABLE.

### Why are the changes needed?
To make document more readable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing test suites.

Closes #31608 from Karl-WangSK/sqldoc.

Lead-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com>
Co-authored-by: ShiKai Wang <wskqing@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-02-22 20:15:19 +08:00
kevincmchen 9767041153 [SPARK-34432][SQL][TESTS] Add JavaSimpleWritableDataSource
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/19269

In #19269 , there is only a scala implementation of simple writable data source in `DataSourceV2Suite`.

This PR adds a java implementation of it.

### Why are the changes needed?

To improve test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing testsuites

Closes #31560 from kevincmchen/SPARK-34432.

Lead-authored-by: kevincmchen <kevincmchen@tencent.com>
Co-authored-by: Kevin Pis <68981916+kevincmchen@users.noreply.github.com>
Co-authored-by: Kevin Pis <kc4163568@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 09:38:13 +00:00
Max Gekk 23a5996a46 [SPARK-34450][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME tests
### What changes were proposed in this pull request?
1. Move parser tests from `DDLParserSuite` to `AlterTableRenameParserSuite`.
2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.AlterTableRenameBase` and to `v1.AlterTableRenameSuite`.
3. Add a test for DSv2 `ALTER TABLE .. RENAME` to `v2.AlterTableRenameSuite`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenameSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenameParserSuite"
```

Closes #31575 from MaxGekk/unify-rename-table-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 08:36:16 +00:00
Dongjoon Hyun 2fb5f21b1e [SPARK-34495][TESTS] Add DedicatedJVMTest test tag
### What changes were proposed in this pull request?

This PR aims to add a test tag, `DedicatedJVMTest`, and replace `SecurityTest` with this.

### Why are the changes needed?

To have a reusable general test tag.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31607 from dongjoon-hyun/SPARK-34495.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-22 16:00:48 +09:00
Raza Jafri 38fbe560fd [SPARK-34167][SQL] Reading parquet with IntDecimal written as a LongDecimal blows up
### What changes were proposed in this pull request?
If an IntDecimal type was written as a LongDecimal in a parquet file. Spark should read it as a long from `VectorizedValuesReader` but write it to the `WritableColumnVector` as an int by down-casting it and calling the appropriate method. `readLongs` has been modified to take in a boolean flag that tells it if the number would fit in a 32-bit Decimal and subsequently downsized.

### Why are the changes needed?
If a Parquet file writes an IntDecimal as LongDecimal, which is supported by the parquet spec, Spark will not be able to read it and will throw an exception.  The reason this happens is because method `readLong` tries to write the long to a `WritableColumnVector` which has been initialized to accept only Ints which leads to a `NullPointerException`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually tested and added unit-test

Closes #31284 from razajafri/decimal_fix.

Authored-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 04:48:56 +00:00
Max Gekk a22d20a6ca [SPARK-34468][SQL] Rename v2 table in place if new name has single part
### What changes were proposed in this pull request?
If new table name consists of single part (no namespaces), the v2 `ALTER TABLE .. RENAME TO` command renames the table while keeping it in the same namespace. For example:
```sql
ALTER TABLE catalog_name.ns1.ns2.ns3.ns4.ns5.tbl RENAME TO new_table
```
the command should rename the source table to `catalog_name.ns1.ns2.ns3.ns4.ns5.new_table`. Before the changes, the command moves the table to the "root" name space i.e. `catalog_name.new_table`.

### Why are the changes needed?
To have the same behavior as v1 implementation of `ALTER TABLE .. RENAME TO`, and other DBMSs.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new test:
```
$ build/sbt "sql/test:testOnly *DataSourceV2SQLSuite"
```

Closes #31594 from MaxGekk/rename-table-single-part.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 04:43:19 +00:00
Max Gekk 6ea4b5fda7 [SPARK-34401][SQL][DOCS] Update docs about altering cached tables/views
### What changes were proposed in this pull request?
Update public docs of SQL commands about altering cached tables/views. For instance:
<img width="869" alt="Screenshot 2021-02-08 at 15 11 48" src="https://user-images.githubusercontent.com/1580697/107217940-fd3b8980-6a1f-11eb-98b9-9b2e3fe7f4ef.png">

### Why are the changes needed?
To inform users about commands behavior in altering cached tables or views.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the command below and manually checking the docs:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```

Closes #31524 from MaxGekk/doc-cmd-caching.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-22 04:32:09 +00:00
Dongjoon Hyun 03f4cf5845 [SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider
### What changes were proposed in this pull request?

This is a retry of #31065 . Last time, the newly add test cases passed in Jenkins and individually, but it's reverted because they fail when `GitHub Action` runs with  `SERIAL_SBT_TESTS=1`.

In this PR, `SecurityTest` tag is used to isolate `KeyProvider`.

This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`.

Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe.

### Why are the changes needed?

Apache ORC 1.6 supports columnar encryption.

### Does this PR introduce _any_ user-facing change?

No. This is for a test case.

### How was this patch tested?

Pass the newly added test suite.

Closes #31603 from dongjoon-hyun/SPARK-34486-RETRY.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-21 15:05:29 -08:00
Yuming Wang 94f9617cb4 [SPARK-34129][SQL] Add table name to LogicalRelation.simpleString
### What changes were proposed in this pull request?

This pr add table name to `LogicalRelation.simpleString`.

### Why are the changes needed?

Make optimized logical plan more readable.

Before this pr:
```
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = class_id#160)) AND (i_category_id#18 = category_id#161)), Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
   :     +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
      +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
         +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
            :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
            :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS class_id#160, i_category_id#18 AS category_id#161], Statistics(sizeInBytes=2.73E+21 B)
            :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), Statistics(sizeInBytes=3.83E+21 B)
            :  :     :- Project [ss_sold_date_sk#51, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
            :  :     :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), Statistics(sizeInBytes=516.5 PiB)
            :  :     :     :- Project [ss_item_sk#30, ss_sold_date_sk#51], Statistics(sizeInBytes=61.1 GiB)
            :  :     :     :  +- Filter ((isnotnull(ss_item_sk#30) AND isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), Statistics(sizeInBytes=580.6 GiB)
            :  :     :     :     :  +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :  :     :     :     :     +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :  :     :     :     :        +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :  :     :     :     +- Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51] parquet, Statistics(sizeInBytes=580.6 GiB)
            :  :     :     +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
            :  :     :        +- Filter (((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
            :  :     :           +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :  :     +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :  :        +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :  :           +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :  +- Aggregate [i_brand_id#14, i_class_id#16, i_category_id#18], [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=1414.2 EiB)
            :     +- Project [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=1414.2 EiB)
            :        +- Join Inner, (cs_sold_date_sk#113 = d_date_sk#52), Statistics(sizeInBytes=1979.9 EiB)
            :           :- Project [cs_sold_date_sk#113, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=231.1 PiB)
            :           :  +- Join Inner, (cs_item_sk#94 = i_item_sk#7), Statistics(sizeInBytes=308.2 PiB)
            :           :     :- Project [cs_item_sk#94, cs_sold_date_sk#113], Statistics(sizeInBytes=36.2 GiB)
            :           :     :  +- Filter ((isnotnull(cs_item_sk#94) AND isnotnull(cs_sold_date_sk#113)) AND dynamicpruning#169 [cs_sold_date_sk#113]), Statistics(sizeInBytes=470.5 GiB)
            :           :     :     :  +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :           :     :     :     +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :           :     :     :        +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :           :     :     +- Relation[cs_sold_time_sk#80,cs_ship_date_sk#81,cs_bill_customer_sk#82,cs_bill_cdemo_sk#83,cs_bill_hdemo_sk#84,cs_bill_addr_sk#85,cs_ship_customer_sk#86,cs_ship_cdemo_sk#87,cs_ship_hdemo_sk#88,cs_ship_addr_sk#89,cs_call_center_sk#90,cs_catalog_page_sk#91,cs_ship_mode_sk#92,cs_warehouse_sk#93,cs_item_sk#94,cs_promo_sk#95,cs_order_number#96L,cs_quantity#97,cs_wholesale_cost#98,cs_list_price#99,cs_sales_price#100,cs_ext_discount_amt#101,cs_ext_sales_price#102,cs_ext_wholesale_cost#103,... 10 more fields] parquet, Statistics(sizeInBytes=470.5 GiB)
            :           :     +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5)
            :           :        +- Filter isnotnull(i_item_sk#7), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :           :           +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :           +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :              +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :                 +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            +- Aggregate [i_brand_id#14, i_class_id#16, i_category_id#18], [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=650.5 EiB)
               +- Project [i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=650.5 EiB)
                  +- Join Inner, (ws_sold_date_sk#147 = d_date_sk#52), Statistics(sizeInBytes=910.6 EiB)
                     :- Project [ws_sold_date_sk#147, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=106.3 PiB)
                     :  +- Join Inner, (ws_item_sk#116 = i_item_sk#7), Statistics(sizeInBytes=141.7 PiB)
                     :     :- Project [ws_item_sk#116, ws_sold_date_sk#147], Statistics(sizeInBytes=16.6 GiB)
                     :     :  +- Filter ((isnotnull(ws_item_sk#116) AND isnotnull(ws_sold_date_sk#147)) AND dynamicpruning#170 [ws_sold_date_sk#147]), Statistics(sizeInBytes=216.4 GiB)
                     :     :     :  +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
                     :     :     :     +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
                     :     :     :        +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
                     :     :     +- Relation[ws_sold_time_sk#114,ws_ship_date_sk#115,ws_item_sk#116,ws_bill_customer_sk#117,ws_bill_cdemo_sk#118,ws_bill_hdemo_sk#119,ws_bill_addr_sk#120,ws_ship_customer_sk#121,ws_ship_cdemo_sk#122,ws_ship_hdemo_sk#123,ws_ship_addr_sk#124,ws_web_page_sk#125,ws_web_site_sk#126,ws_ship_mode_sk#127,ws_warehouse_sk#128,ws_promo_sk#129,ws_order_number#130L,ws_quantity#131,ws_wholesale_cost#132,ws_list_price#133,ws_sales_price#134,ws_ext_discount_amt#135,ws_ext_sales_price#136,ws_ext_wholesale_cost#137,... 10 more fields] parquet, Statistics(sizeInBytes=216.4 GiB)
                     :     +- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5)
                     :        +- Filter isnotnull(i_item_sk#7), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
                     :           +- Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
                     +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
                        +- Filter ((((d_year#58 >= 1999) AND (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
                           +- Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
```

After this pr:
```
== Optimized Logical Plan ==
Project [i_item_sk#9 AS ss_item_sk#3], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#16 = brand_id#0) AND (i_class_id#18 = class_id#1)) AND (i_category_id#20 = category_id#2)), Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#16) AND isnotnull(i_class_id#18)) AND isnotnull(i_category_id#20)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
   :     +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#0, class_id#1, category_id#2], [brand_id#0, class_id#1, category_id#2], Statistics(sizeInBytes=2.73E+21 B)
      +- Aggregate [brand_id#0, class_id#1, category_id#2], [brand_id#0, class_id#1, category_id#2], Statistics(sizeInBytes=2.73E+21 B)
         +- Join LeftSemi, (((brand_id#0 <=> i_brand_id#16) AND (class_id#1 <=> i_class_id#18)) AND (category_id#2 <=> i_category_id#20)), Statistics(sizeInBytes=2.73E+21 B)
            :- Join LeftSemi, (((brand_id#0 <=> i_brand_id#16) AND (class_id#1 <=> i_class_id#18)) AND (category_id#2 <=> i_category_id#20)), Statistics(sizeInBytes=2.73E+21 B)
            :  :- Project [i_brand_id#16 AS brand_id#0, i_class_id#18 AS class_id#1, i_category_id#20 AS category_id#2], Statistics(sizeInBytes=2.73E+21 B)
            :  :  +- Join Inner, (ss_sold_date_sk#53 = d_date_sk#54), Statistics(sizeInBytes=3.83E+21 B)
            :  :     :- Project [ss_sold_date_sk#53, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=387.3 PiB)
            :  :     :  +- Join Inner, (ss_item_sk#32 = i_item_sk#9), Statistics(sizeInBytes=516.5 PiB)
            :  :     :     :- Project [ss_item_sk#32, ss_sold_date_sk#53], Statistics(sizeInBytes=61.1 GiB)
            :  :     :     :  +- Filter ((isnotnull(ss_item_sk#32) AND isnotnull(ss_sold_date_sk#53)) AND dynamicpruning#150 [ss_sold_date_sk#53]), Statistics(sizeInBytes=580.6 GiB)
            :  :     :     :     :  +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :  :     :     :     :     +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :  :     :     :     :        +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :  :     :     :     +- Relation tpcds5t.store_sales[ss_sold_time_sk#31,ss_item_sk#32,ss_customer_sk#33,ss_cdemo_sk#34,ss_hdemo_sk#35,ss_addr_sk#36,ss_store_sk#37,ss_promo_sk#38,ss_ticket_number#39L,ss_quantity#40,ss_wholesale_cost#41,ss_list_price#42,ss_sales_price#43,ss_ext_discount_amt#44,ss_ext_sales_price#45,ss_ext_wholesale_cost#46,ss_ext_list_price#47,ss_ext_tax#48,ss_coupon_amt#49,ss_net_paid#50,ss_net_paid_inc_tax#51,ss_net_profit#52,ss_sold_date_sk#53] parquet, Statistics(sizeInBytes=580.6 GiB)
            :  :     :     +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
            :  :     :        +- Filter (((isnotnull(i_brand_id#16) AND isnotnull(i_class_id#18)) AND isnotnull(i_category_id#20)) AND isnotnull(i_item_sk#9)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
            :  :     :           +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :  :     +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :  :        +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :  :           +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :  +- Aggregate [i_brand_id#16, i_class_id#18, i_category_id#20], [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=1414.2 EiB)
            :     +- Project [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=1414.2 EiB)
            :        +- Join Inner, (cs_sold_date_sk#115 = d_date_sk#54), Statistics(sizeInBytes=1979.9 EiB)
            :           :- Project [cs_sold_date_sk#115, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=231.1 PiB)
            :           :  +- Join Inner, (cs_item_sk#96 = i_item_sk#9), Statistics(sizeInBytes=308.2 PiB)
            :           :     :- Project [cs_item_sk#96, cs_sold_date_sk#115], Statistics(sizeInBytes=36.2 GiB)
            :           :     :  +- Filter ((isnotnull(cs_item_sk#96) AND isnotnull(cs_sold_date_sk#115)) AND dynamicpruning#151 [cs_sold_date_sk#115]), Statistics(sizeInBytes=470.5 GiB)
            :           :     :     :  +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :           :     :     :     +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :           :     :     :        +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            :           :     :     +- Relation tpcds5t.catalog_sales[cs_sold_time_sk#82,cs_ship_date_sk#83,cs_bill_customer_sk#84,cs_bill_cdemo_sk#85,cs_bill_hdemo_sk#86,cs_bill_addr_sk#87,cs_ship_customer_sk#88,cs_ship_cdemo_sk#89,cs_ship_hdemo_sk#90,cs_ship_addr_sk#91,cs_call_center_sk#92,cs_catalog_page_sk#93,cs_ship_mode_sk#94,cs_warehouse_sk#95,cs_item_sk#96,cs_promo_sk#97,cs_order_number#98L,cs_quantity#99,cs_wholesale_cost#100,cs_list_price#101,cs_sales_price#102,cs_ext_discount_amt#103,cs_ext_sales_price#104,cs_ext_wholesale_cost#105,... 10 more fields] parquet, Statistics(sizeInBytes=470.5 GiB)
            :           :     +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5)
            :           :        +- Filter isnotnull(i_item_sk#9), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :           :           +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
            :           +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
            :              +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
            :                 +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
            +- Aggregate [i_brand_id#16, i_class_id#18, i_category_id#20], [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=650.5 EiB)
               +- Project [i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=650.5 EiB)
                  +- Join Inner, (ws_sold_date_sk#149 = d_date_sk#54), Statistics(sizeInBytes=910.6 EiB)
                     :- Project [ws_sold_date_sk#149, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=106.3 PiB)
                     :  +- Join Inner, (ws_item_sk#118 = i_item_sk#9), Statistics(sizeInBytes=141.7 PiB)
                     :     :- Project [ws_item_sk#118, ws_sold_date_sk#149], Statistics(sizeInBytes=16.6 GiB)
                     :     :  +- Filter ((isnotnull(ws_item_sk#118) AND isnotnull(ws_sold_date_sk#149)) AND dynamicpruning#152 [ws_sold_date_sk#149]), Statistics(sizeInBytes=216.4 GiB)
                     :     :     :  +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
                     :     :     :     +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
                     :     :     :        +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
                     :     :     +- Relation tpcds5t.web_sales[ws_sold_time_sk#116,ws_ship_date_sk#117,ws_item_sk#118,ws_bill_customer_sk#119,ws_bill_cdemo_sk#120,ws_bill_hdemo_sk#121,ws_bill_addr_sk#122,ws_ship_customer_sk#123,ws_ship_cdemo_sk#124,ws_ship_hdemo_sk#125,ws_ship_addr_sk#126,ws_web_page_sk#127,ws_web_site_sk#128,ws_ship_mode_sk#129,ws_warehouse_sk#130,ws_promo_sk#131,ws_order_number#132L,ws_quantity#133,ws_wholesale_cost#134,ws_list_price#135,ws_sales_price#136,ws_ext_discount_amt#137,ws_ext_sales_price#138,ws_ext_wholesale_cost#139,... 10 more fields] parquet, Statistics(sizeInBytes=216.4 GiB)
                     :     +- Project [i_item_sk#9, i_brand_id#16, i_class_id#18, i_category_id#20], Statistics(sizeInBytes=8.5 MiB, rowCount=3.72E+5)
                     :        +- Filter isnotnull(i_item_sk#9), Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
                     :           +- Relation tpcds5t.item[i_item_sk#9,i_item_id#10,i_rec_start_date#11,i_rec_end_date#12,i_item_desc#13,i_current_price#14,i_wholesale_cost#15,i_brand_id#16,i_brand#17,i_class_id#18,i_class#19,i_category_id#20,i_category#21,i_manufact_id#22,i_manufact#23,i_size#24,i_formulation#25,i_color#26,i_units#27,i_container#28,i_manager_id#29,i_product_name#30] parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
                     +- Project [d_date_sk#54], Statistics(sizeInBytes=8.6 KiB, rowCount=731)
                        +- Filter ((((d_year#60 >= 1999) AND (d_year#60 <= 2001)) AND isnotnull(d_year#60)) AND isnotnull(d_date_sk#54)), Statistics(sizeInBytes=175.6 KiB, rowCount=731)
                           +- Relation tpcds5t.date_dim[d_date_sk#54,d_date_id#55,d_date#56,d_month_seq#57,d_week_seq#58,d_quarter_seq#59,d_year#60,d_dow#61,d_moy#62,d_dom#63,d_qoy#64,d_fy_year#65,d_fy_quarter_seq#66,d_fy_week_seq#67,d_day_name#68,d_quarter_name#69,d_holiday#70,d_weekend#71,d_following_holiday#72,d_first_dom#73,d_last_dom#74,d_same_day_ly#75,d_same_day_lq#76,d_current_day#77,... 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31196 from wangyum/SPARK-34129.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-21 12:04:49 -06:00
Dongjoon Hyun 9942548c37 [SPARK-34487][K8S][TESTS] Use the runtime Hadoop version in K8s IT
### What changes were proposed in this pull request?

This PR aims to use the runtime Hadoop version in K8s integration test.

### Why are the changes needed?

SPARK-33212 upgrades Hadoop dependency from 3.2.0 to 3.2.2 and we will upgrade to 3.3.x+.
We had better use the runtime Hadoop version instead of having a static string.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the K8s IT.

This is tested locally like the following.
```
KubernetesSuite:
...
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
...
```

Closes #31604 from dongjoon-hyun/SPARK-34487.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-21 08:57:02 -08:00
Dongjoon Hyun 020e84e92f [SPARK-34486][K8S] Upgrade kubernetes-client to 4.13.2
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` library from 4.12.0 to 4.13.2 for Apache Spark 3.2.0.

### Why are the changes needed?

This will bring [K8s 1.19.1](https://github.com/fabric8io/kubernetes-client/pull/2541) models officially and the latest bug fixes.

- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.1
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.2

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the K8s IT and UT.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 19 minutes, 25 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #31602 from dongjoon-hyun/SPARK-34486.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 18:35:38 +09:00
yi.wu 546d2eb5d4 [SPARK-34384][CORE] Add missing docs for ResourceProfile APIs
### What changes were proposed in this pull request?

This PR adds missing docs for ResourceProfile related APIs. Besides, it includes a few minor changes on API:

* ResourceProfileBuilder.build -> ResourceProfileBuilder.builder()
* Provides java specific API `allSupportedExecutorResourcesJList`
* private `ResourceAllocator` since it was mistakenly exposed previously

### Why are the changes needed?

Add missing API docs

### Does this PR introduce _any_ user-facing change?

No, as Apache Spark 3.1 hasn't officially released.

### How was this patch tested?

Updated unit tests due to the signature change of `build()`.

Closes #31496 from Ngone51/resource-profile-api-cleanup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 18:29:44 +09:00
Max Gekk 04c3125dcf [SPARK-34360][SQL] Support truncation of v2 tables
### What changes were proposed in this pull request?
1. Add new interface `TruncatableTable` which represents tables that allow atomic truncation.
2. Implement new method in `InMemoryTable` and in `InMemoryPartitionTable`.

### Why are the changes needed?
To support `TRUNCATE TABLE` for v2 tables.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
Added new tests to `TableCatalogSuite` that check truncation of non-partitioned and partitioned tables:
```
$ build/sbt "test:testOnly *TableCatalogSuite"
```

Closes #31475 from MaxGekk/dsv2-truncate-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 17:50:38 +09:00
Kent Yao 1fac706db5 [SPARK-34373][SQL] HiveThriftServer2 startWithContext may hang with a race issue
### What changes were proposed in this pull request?

fix a race issue by interrupting the thread

### Why are the changes needed?

```
21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: No underlying server socket.
at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
at org.apache.thrift.transport.TServerTransport.acceException in thread "Thread-15" java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
```
when the TServer try to `serve` after `stop`, it hangs with the log above forever
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

passing ci

Closes #31479 from yaooqinn/SPARK-34373.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 17:37:12 +09:00
Gera Shegalov fadd0f5d9b [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator
This PR is a fix for the JLS 17.5.3 violation identified in
zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA.

### What changes were proposed in this pull request?
- Use a var field to hold the state of the collection accumulator

### Why are the changes needed?
AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis https://github.com/NVIDIA/spark-rapids/issues/1522.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
- this is a concurrency bug that is almost impossible to reproduce as a quick unit test.
- By trial and error I crafted a command https://github.com/NVIDIA/spark-rapids/pull/1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours
- existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite`

Closes #31540 from gerashegalov/SPARK-20977.

Authored-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-20 20:57:14 -06:00
Yuchen Huo 7de49a8fc0 [SPARK-34481][SQL] Refactor dataframe reader/writer optionsWithPath logic
### What changes were proposed in this pull request?

Extract optionsWithPath logic into their own function.

### Why are the changes needed?

Reduce the code duplication and improve modularity.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Just some refactoring. Existing tests.

Closes #31599 from yuchenhuo/SPARK-34481.

Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-20 17:57:43 -08:00
Kousuke Saruta 82b33a3041 [SPARK-34379][SQL] Map JDBC RowID to StringType rather than LongType
### What changes were proposed in this pull request?

This PR fix an issue that `java.sql.RowId` is mapped to `LongType` and prefer `StringType`.

In the current implementation, JDBC RowID type is mapped to `LongType` except for `OracleDialect`, but there is no guarantee to be able to convert RowID to long.
`java.sql.RowId` declares `toString` and the specification of `java.sql.RowId` says

> _all methods on the RowId interface must be fully implemented if the JDBC driver supports the data type_
(https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)

So, we should prefer StringType to LongType.

### Why are the changes needed?

This seems to be a potential bug.

### Does this PR introduce _any_ user-facing change?

Yes. RowID is mapped to StringType rather than LongType.

### How was this patch tested?

New test and  the existing test case `SPARK-32992: map Oracle's ROWID type to StringType` in `OracleIntegrationSuite` passes.

Closes #31491 from sarutak/rowid-type.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-02-20 23:45:56 +09:00
Sean Owen f78466dca6 [SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API
### What changes were proposed in this pull request?

UserDefinedType and UDTRegistration become public Developer APIs, not package-private to Spark.

### Why are the changes needed?

This proposes to simply open up the UserDefinedType class as a developer API. It was public in 1.x, but closed in 2.x for some possible redesign that does not seem to have happened.

Other libraries have managed to define UDTs anyway by inserting shims into the Spark namespace, and this evidently has worked OK. But package isolation in Java 9+ breaks this.

The logic here is mostly: this is de facto a stable API, so can at least be open to developers with the usual caveats about developer APIs.

Open questions:

- Is there in fact some important redesign that's needed before opening it? The comment to this effect is from 2016
- Is this all that needs to be opened up? Like PythonUserDefinedType?
- Should any of this be kept package-private?

This was first proposed in https://github.com/apache/spark/pull/16478 though it was a larger change, but, the other API issues it was fixing seem to have been addressed already (e.g. no need to return internal Spark types). It was never really reviewed.

My hunch is that there isn't much downside, and some upside, to just opening this as-is now.

### Does this PR introduce _any_ user-facing change?

UserDefinedType becomes visible to developers to subclass.

### How was this patch tested?

Existing tests; there is no change to the existing logic.

Closes #31461 from srowen/SPARK-7768.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-20 07:32:06 -06:00
Bo Zhang 489d32aa9b [SPARK-34471][SS][DOCS] Document Streaming Table APIs in Structured Streaming Programming Guide
### What changes were proposed in this pull request?

This change is to document the newly added streaming table APIs in Structured Streaming Programming Guide.

### Why are the changes needed?

This will help our users when they try to use the new APIs.

### Does this PR introduce _any_ user-facing change?
Yes. Users will see the changes in the programming guide.

### How was this patch tested?
Built the HTML page and verified.

Attached is a screenshot of the section added:
![Table APIs Section - Scala](https://user-images.githubusercontent.com/44179472/108581923-1ff86700-736b-11eb-8fcd-efa04ac936de.png)

Closes #31590 from bozhang2820/table-api-doc.

Lead-authored-by: Bo Zhang <bo.zhang@databricks.com>
Co-authored-by: Bo Zhang <bozhang2820@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2021-02-20 15:54:43 +09:00
yi.wu 4dc16f2d59 [SPARK-24818][CORE] Support delay scheduling for barrier execution
### What changes were proposed in this pull request?

This PR tries to support the (non-legacy) delay scheduling for the barrier execution.

The idea is, adding a pending launch tasks list(`barrierPendingLaunchTasks`) in the barrier `TaskSetManager`. And we don't really add those pending launch tasks to the running list and post task start event to the listeners and so on until all tasks in the barrier `TaskSetManager` has been added to `barrierPendingLaunchTasks` after a single round `resourceOffers()`. If there're only partial tasks that are able to launch after a single `rousourceOffers()` round, we'll revert all the assigned resources to those tasks which were added in `barrierPendingLaunchTasks` and clear `barrierPendingLaunchTasks` and wait for the next `resourceOffers()` round. The barrier `TaskSetManager` should be launched finally since we've ensured enough slots before the scheduling.

### Why are the changes needed?

Currently, with delay scheduling enabled for the barrier execution, the application can abort immediately when there're only partial tasks can be launched. This is really bad, especially when the application already completed many stages before the barrier stage. For example, the application may do some ETL jobs before the barrier job(for ML).

After this PR, this scenario  should no longer happen.

### Does this PR introduce _any_ user-facing change?

Yes, users will no longer face the `Fail resource offers for barrier stage...` error.

### How was this patch tested?

Added/updated unit tests.

Closes #30650 from Ngone51/barrier-delay-scheduling.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-02-19 16:04:44 -06:00
Dongjoon Hyun 484a83e73e [SPARK-34469][K8S] Ignore RegisterExecutor when SparkContext is stopped
### What changes were proposed in this pull request?

This PR aims to make `KubernetesClusterSchedulerBackend` ignore `RegisterExecutor` message when `SparkContext` is stopped already.

### Why are the changes needed?

If `SparkDriver` is terminated, the executors will be removed by K8s automatically.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the newly added test case.

Closes #31587 from dongjoon-hyun/SPARK-34469.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-19 09:36:07 -08:00
Zhichao Zhang 96bcb4bbe4 [SPARK-34283][SQL] Combines all adjacent 'Union' operators into a single 'Union' when using 'Dataset.union.distinct.union.distinct'
### What changes were proposed in this pull request?

Handled 'Deduplicate(Keys, Union)' operation in rule 'CombineUnions' to combine adjacent 'Union' operators  into a single 'Union' if necessary when using 'Dataset.union.distinct.union.distinct'.
Currently only handle distinct-like 'Deduplicate', where the keys == output, for example:
```
val df1 = Seq((1, 2, 3)).toDF("a", "b", "c")
val df2 = Seq((6, 2, 5)).toDF("a", "b", "c")
val df3 = Seq((2, 4, 3)).toDF("c", "a", "b")
val df4 = Seq((1, 4, 5)).toDF("b", "a", "c")
val unionDF1 = df1.unionByName(df2).dropDuplicates(Seq("b", "a", "c"))
      .unionByName(df3).dropDuplicates().unionByName(df4)
      .dropDuplicates("a")
```
In this case, **all Union operators will be combined**.
but,
```
val df1 = Seq((1, 2, 3)).toDF("a", "b", "c")
val df2 = Seq((6, 2, 5)).toDF("a", "b", "c")
val df3 = Seq((2, 4, 3)).toDF("c", "a", "b")
val df4 = Seq((1, 4, 5)).toDF("b", "a", "c")
val unionDF = df1.unionByName(df2).dropDuplicates(Seq("a"))
      .unionByName(df3).dropDuplicates("c").unionByName(df4)
      .dropDuplicates("b")
```
In this case, **all unions will not be combined, because the Deduplicate.keys doesn't equal to Union.output**.

### Why are the changes needed?

When using 'Dataset.union.distinct.union.distinct', the operator is  'Deduplicate(Keys, Union)', but AstBuilder transform sql-style 'Union' to operator 'Distinct(Union)', the rule 'CombineUnions' in Optimizer only handle 'Distinct(Union)' operator but not Deduplicate(Keys, Union).
Please see the detailed  description in [SPARK-34283](https://issues.apache.org/jira/browse/SPARK-34283).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #31404 from zzcclp/SPARK-34283.

Authored-by: Zhichao Zhang <441586683@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-19 15:19:13 +00:00
gengjiaan 06df1210d4 [SPARK-28123][SQL] String Functions: support btrim
### What changes were proposed in this pull request?
Spark support `trim`/`ltrim`/`rtrim` now. The function `btrim` is an alternate form of `TRIM(BOTH <chars> FROM <expr>)`.
`btrim` removes the longest string consisting only of specified characters from the start and end of a string.

The mainstream database support this feature show below:

**Postgresql**
https://www.postgresql.org/docs/11/functions-binarystring.html

**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/BTRIM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____5

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_BTRIM.html

**Druid**
https://druid.apache.org/docs/latest/querying/sql.html#string-functions

**Greenplum**
http://docs.greenplum.org/6-8/ref_guide/function-summary.html

### Why are the changes needed?
btrim is very useful.

### Does this PR introduce _any_ user-facing change?
Yes. btrim is a new function

### How was this patch tested?
Jenkins test.

Closes #31390 from beliefer/SPARK-28123-support-btrim.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-19 13:28:49 +00:00
Peter Toth 27abb6ab56 [SPARK-34421][SQL] Resolve temporary functions and views in views with CTEs
### What changes were proposed in this pull request?
This PR:
- Fixes a bug that prevents analysis of:
  ```
  CREATE TEMPORARY VIEW temp_view AS WITH cte AS (SELECT temp_func(0)) SELECT * FROM cte;
  SELECT * FROM temp_view
  ```
  by throwing:
  ```
  Undefined function: 'temp_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
  ```
- and doesn't report analysis error when it should:
  ```
  CREATE TEMPORARY VIEW temp_view AS SELECT 0;
  CREATE VIEW view_on_temp_view AS WITH cte AS (SELECT * FROM temp_view) SELECT * FROM cte
  ```
  by properly collecting temporary objects from VIEW definitions with CTEs.

- Minor refactor to make the affected code more readable.

### Why are the changes needed?
To fix a bug introduced with https://github.com/apache/spark/pull/30567

### Does this PR introduce _any_ user-facing change?
Yes, the query works again.

### How was this patch tested?
Added new UT + existing ones.

Closes #31550 from peter-toth/SPARK-34421-temp-functions-in-views-with-cte.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-19 18:14:49 +08:00
Max Gekk b26e7b510b [SPARK-34314][SQL] Fix partitions schema inference
### What changes were proposed in this pull request?
Infer the partitions schema by:
1. interring the common type over all partition part values, and
2. casting those values to the common type

Before the changes:
1. Spark creates a literal with most appropriate type for concrete partition value i.e. `part0=-0` -> `Literal(0, IntegerType)`, `part0=abc` -> `Literal(UTF8String.fromString("abc"), StringType)`.
2. Finds the common type for all literals of a partition column. For the example above, it is `StringType`.
3. Casts those literal to the desired type:
  - `Cast(Literal(0, IntegerType), StringType)` -> `UTF8String.fromString("0")`
  - `Cast(Literal(UTF8String.fromString("abc", StringType), StringType)` -> `UTF8String.fromString("abc")`

In the example, we get a partition part value "0" which is different from the original one "-0". Spark shouldn't modify partition part values of the string type because it can influence on query results.

Closes #31423

### Why are the changes needed?
The changes fix the bug demonstrated by the example:
1. There are partitioned parquet files (file format doesn't matter):
```
/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d
├── _SUCCESS
├── part=-0
│   └── part-00001-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
└── part=AA
    └── part-00000-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet
```
placed to two partitions "AA" and **"-0"**.

2. When reading them w/o specified schema:
```
val df = spark.read.parquet(path)
df.printSchema()
root
 |-- id: integer (nullable = true)
 |-- part: string (nullable = true)
```
the inferred type of the partition column `part` is the **string** type.
3. The expected values in the column `part` are "AA" and "-0" but we get:
```
df.show(false)
+---+----+
|id |part|
+---+----+
|0  |AA  |
|1  |0   |
+---+----+
```
So, Spark returns **"0"** instead of **"-0"**.

### Does this PR introduce _any_ user-facing change?
This PR can change query results.

### How was this patch tested?
By running new test and existing test suites:
```
$ build/sbt "test:testOnly *FileIndexSuite"
$ build/sbt "test:testOnly *ParquetV1PartitionDiscoverySuite"
$ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite"
```

Closes #31549 from MaxGekk/fix-partition-file-index-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-19 08:36:13 +00:00
Max Gekk 4a9a1d42e7 [SPARK-34466][SQL][DOCS] Improve docs for ALTER TABLE .. RENAME TO
### What changes were proposed in this pull request?
Explicitly highlight that the table rename command cannot move a table between databases.

### Why are the changes needed?
To inform users about actual behavior of the table rename command.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```sql
spark-sql> CREATE DATABASE db1;
spark-sql> CREATE DATABASE db2;
spark-sql> CREATE TABLE db1.tbl1 (c0 INT);
spark-sql> ALTER TABLE db1.tbl1 RENAME TO db2.tbl1;
Error in query: RENAME TABLE source and destination databases do not match: 'db1' != 'db2';
spark-sql> ALTER TABLE db1.tbl1 RENAME TO db1.tbl2;
spark-sql> SHOW TABLES IN db1 LIKE '*';
db1	tbl2	false
```

Closes #31586 from MaxGekk/doc-rename-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-19 04:48:16 +00:00
yzjg 26548edfa2 [MINOR][SQL][DOCS] Fix the comments in the example at window function
### What changes were proposed in this pull request?

`functions.scala` window function has an comment error in the field name. The column should be `time` per `timestamp:TimestampType`.

### Why are the changes needed?

To deliver the correct documentation and examples.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing docs.

### How was this patch tested?

CI builds in this PR should test the documentation build.

Closes #31582 from yzjg/yzjg-patch-1.

Authored-by: yzjg <785246661@qq.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-19 10:45:21 +09:00
Max Gekk cad469d47a [SPARK-34465][SQL] Rename v2 alter table exec nodes
### What changes were proposed in this pull request?
Rename the following v2 exec nodes:
- AlterTableAddPartitionExec -> AddPartitionExec
- AlterTableRenamePartitionExec -> RenamePartitionExec
- AlterTableDropPartitionExec -> DropPartitionExec

### Why are the changes needed?
- To be consistent with v2 exec node added before: ALTER TABLE .. RENAME TO` -> RenameTableExec.
- For simplicity and readability of the execution plans.

### Does this PR introduce _any_ user-facing change?
Should not since this is internal API.

### How was this patch tested?
By running the existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #31584 from MaxGekk/rename-alter-table-exec-nodes.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-18 14:33:26 -08:00
Dongjoon Hyun 331c6fd4ef [SPARK-34467][BUILD] Upgrade Zstd-jni to 1.4.8-4
### What changes were proposed in this pull request?

This PR aims to upgrade Zstd-JNI library to 1.4.8-4 to bring JNI side optimization.
`ZStandardBenchmark` shows that there is no regression in terms of performance and show some improvements.

### Why are the changes needed?

https://github.com/luben/zstd-jni/commits/v1.4.8-4
- be9be47fae
- be51ebade1
- 44ff8b6f95

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31585 from dongjoon-hyun/SPARK-ZSTD-1.4.8-4.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-18 13:35:49 -08:00
Max Gekk 8f7ec4b28e [SPARK-34454][SQL] Mark legacy SQL configs as internal
### What changes were proposed in this pull request?
1. Make the following SQL configs as internal:
    - spark.sql.legacy.allowHashOnMapType
    - spark.sql.legacy.sessionInitWithConfigDefaults
2. Add a test to check that all SQL configs from the `legacy` namespace are marked as internal configs.

### Why are the changes needed?
Assuming that legacy SQL configs shouldn't be set by users in common cases. The purpose of such configs is to allow switching to old behavior in corner cases. So, the configs should be marked as internals.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running new test:
```
$ build/sbt "test:testOnly *SQLConfSuite"
```

Closes #31577 from MaxGekk/mark-legacy-configs-as-internal.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-18 10:39:51 -08:00
Chao Sun 27873280ff [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
### What changes were proposed in this pull request?

Currently in `SpecificParquetRecordReaderBase` we use deprecated APIs in a few places from Parquet, such as `readFooter`, `ParquetInputSplit`, `new ParquetFileReader`, `filterRowGroups`, etc. This replaces these with the newer APIs. In specific this:
- Replaces `ParquetInputSplit` with `FileSplit`. We never use specific things in the former such as `rowGroupOffsets` so the swap is pretty simple.
- Removes `readFooter` calls by using `ParquetFileReader.open`
- Replace deprecated `ParquetFileReader` ctor with the newer API which takes `ParquetReadOptions`.
- Removes the unnecessary handling of case when `rowGroupOffsets` is not null. It seems this never happens.

### Why are the changes needed?

The aforementioned APIs were deprecated and is going to be removed at some point in future. This is to ensure better supportability.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a cleanup and relies on existing tests on the relevant code paths.

Closes #29542 from sunchao/SPARK-32703.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-18 10:18:14 -06:00
Steve Loughran ff5115c3ac [SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes
BasicWriteStatsTracker to probe for a custom Xattr if the size of
the generated file is 0 bytes; if found and parseable use that as
the declared length of the output.

The matching Hadoop patch in HADOOP-17414:

* Returns all S3 object headers as XAttr attributes prefixed "header."
* Sets the custom header x-hadoop-s3a-magic-data-length to the length of
  the data in the marker file.

As a result, spark job tracking will correctly report the amount of data uploaded
and yet to materialize.

### Why are the changes needed?

Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer
which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro`
to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at
the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending`
containing all the information for the job committer to complete the upload.

But: the write tracker statictics don't show progress as they measure the length of the
created file, find the marker file and report 0 bytes.
By probing for a specific HTTP header in the marker file and parsing that if
retrieved, the real progress can be reported.

There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530)
which adds getXAttr API support to the S3A connector and returns the headers; the magic
committer adds the relevant attributes.

If the FS being probed doesn't support the XAttr API, the header is missing
or the value not a positive long then the size of 0 is returned.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to
implement getXAttr on top of LocalFS; this is used to explore the set of
options:
* no XAttr API implementation (existing tests; what callers would see with
  most filesystems)
* no attribute found (HDFS, ABFS without the attribute)
* invalid data of different forms

All of these return Some(0) as file length.

The Hadoop PR verifies XAttr implementation in S3A and that
the commit protocol attaches the header to the files.

External downstream testing has done the full hadoop+spark end
to end operation, with manual review of logs to verify that the
data was successfully collected from the attribute.

Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-02-18 08:43:18 -06:00
gengjiaan edccf96cad [SPARK-34394][SQL] Unify output of SHOW FUNCTIONS and pass output attributes properly
### What changes were proposed in this pull request?
The current implement of some DDL not unify the output and not pass the output properly to physical command.
Such as: The output attributes of `ShowFunctions` does't pass to `ShowFunctionsCommand` properly.

As the query plan, this PR pass the output attributes from `ShowFunctions` to `ShowFunctionsCommand`.

### Why are the changes needed?
This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #31519 from beliefer/SPARK-34394.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-18 12:50:50 +00:00
gengjiaan c925e4d0fd [SPARK-34393][SQL] Unify output of SHOW VIEWS and pass output attributes properly
### What changes were proposed in this pull request?
The current implement of some DDL not unify the output and not pass the output properly to physical command.
Such as: The output attributes of `ShowViews` does't pass to `ShowViewsCommand` properly.

As the query plan, this PR pass the output attributes from `ShowViews` to `ShowViewsCommand`.

### Why are the changes needed?
This PR pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #31508 from beliefer/SPARK-34393.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-18 12:48:39 +00:00
Kousuke Saruta 5167228172 [SPARK-34449][BUILD] Upgrade Jetty to fix CVE-2020-27218
### What changes were proposed in this pull request?

This PR upgrades Jetty from `9.4.34` to `9.4.36`.

### Why are the changes needed?

CVE-2020-27218 affects currently used Jetty 9.4.34.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27218

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified existing test and new test which comply with the new version of Jetty.

Closes #31574 from sarutak/upgrade-jetty-9.4.36.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-18 18:02:34 +09:00
Max Gekk b58f0976a9 [SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs
### What changes were proposed in this pull request?
In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing:
- spark.sql.parquet.int96RebaseModeInWrite
- spark.sql.parquet.datetimeRebaseModeInWrite
- spark.sql.parquet.int96RebaseModeInRead
- spark.sql.parquet.datetimeRebaseModeInRead
- spark.sql.avro.datetimeRebaseModeInWrite
- spark.sql.avro.datetimeRebaseModeInRead

Parquet options added by #31489:
- datetimeRebaseMode
- int96RebaseMode

and Avro options added by #31529:
- datetimeRebaseMode

<img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png">

### Why are the changes needed?
To inform users about supported DS options and SQL configs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By generating the doc and manually checking:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```

Closes #31564 from MaxGekk/doc-rebase-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-18 17:48:50 +09:00
Max Gekk 7b549c3e53 [SPARK-34455][SQL] Deprecate spark.sql.legacy.replaceDatabricksSparkAvro.enabled
### What changes were proposed in this pull request?
1. Put the SQL config `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` to the list of deprecated configs `deprecatedSQLConfigs`
2. Update docs for the Avro datasource
<img width="982" alt="Screenshot 2021-02-17 at 21 04 26" src="https://user-images.githubusercontent.com/1580697/108249890-abed7180-7166-11eb-8cb7-0c246d2a34fc.png">

### Why are the changes needed?
The config exists for enough time. We can deprecate it, and recommend users to use `.format("avro")` instead.

### Does this PR introduce _any_ user-facing change?
Should not except of the warning with the recommendation to use the `avro` format.

### How was this patch tested?
1. By generating docs via:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```
2. Manually checking the warning:
```
scala> spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", false)
21/02/17 21:20:18 WARN SQLConf: The SQL config 'spark.sql.legacy.replaceDatabricksSparkAvro.enabled' has been deprecated in Spark v3.2 and may be removed in the future. Use `.format("avro")` in `DataFrameWriter` or `DataFrameReader` instead.
```

Closes #31578 from MaxGekk/deprecate-replaceDatabricksSparkAvro.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-17 21:54:20 -08:00