Commit graph

10610 commits

Author SHA1 Message Date
Kent Yao d1177b5230 [SPARK-34192][SQL] Move char padding to write side and remove length check on read side too
### What changes were proposed in this pull request?

On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst.

This PR reverts 6da5cdf1db  that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty.

This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark.

### Why are the changes needed?

fix perf regression when tables have char/varchar type columns

closes #31278
### Does this PR introduce _any_ user-facing change?

yes, spark will not raise error for oversized char/varchar values in read side
### How was this patch tested?

modified ut

the dropped read side benchmark
```
================================================================================================
Char Varchar Read Side Perf w/o Tailing Spaces
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1564           1573           9         63.9          15.6       1.0X
read char with length 20                           1532           1551          18         65.3          15.3       1.0X
read varchar with length 20                        1520           1531          13         65.8          15.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1573           1613          41         63.6          15.7       1.0X
read char with length 40                           1575           1577           2         63.5          15.7       1.0X
read varchar with length 40                        1568           1576          11         63.8          15.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1526           1540          23         65.5          15.3       1.0X
read char with length 60                           1514           1539          23         66.0          15.1       1.0X
read varchar with length 60                        1486           1497          10         67.3          14.9       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1531           1542          19         65.3          15.3       1.0X
read char with length 80                           1514           1529          15         66.0          15.1       1.0X
read varchar with length 80                        1524           1565          42         65.6          15.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1597           1623          25         62.6          16.0       1.0X
read char with length 100                          1499           1512          16         66.7          15.0       1.1X
read varchar with length 100                       1517           1524           8         65.9          15.2       1.1X

================================================================================================
Char Varchar Read Side Perf w/ Tailing Spaces
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1524           1526           1         65.6          15.2       1.0X
read char with length 20                           1532           1537           9         65.3          15.3       1.0X
read varchar with length 20                        1520           1532          15         65.8          15.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1556           1580          32         64.3          15.6       1.0X
read char with length 40                           1600           1611          17         62.5          16.0       1.0X
read varchar with length 40                        1648           1716          88         60.7          16.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1504           1524          20         66.5          15.0       1.0X
read char with length 60                           1509           1512           3         66.2          15.1       1.0X
read varchar with length 60                        1519           1535          21         65.8          15.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1640           1652          17         61.0          16.4       1.0X
read char with length 80                           1625           1666          35         61.5          16.3       1.0X
read varchar with length 80                        1590           1605          13         62.9          15.9       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1622           1628           5         61.6          16.2       1.0X
read char with length 100                          1614           1646          30         62.0          16.1       1.0X
read varchar with length 100                       1594           1606          11         62.7          15.9       1.0X
```

Closes #31281 from yaooqinn/SPARK-34192.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-26 02:08:35 +08:00
Max Gekk bfc0235013 [SPARK-34203][SQL] Convert null partition values to __HIVE_DEFAULT_PARTITION__ in v1 In-Memory catalog
### What changes were proposed in this pull request?
In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203.

### Why are the changes needed?
`InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below:
```
$ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory
```
```scala
scala> spark.conf.get("spark.sql.catalogImplementation")
res0: String = in-memory

scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default':
Map(p1 -> null)
  at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog:
```scala
scala> spark.table("tbl").show(false)
+----+----+
|col1|p1  |
+----+----+
|0   |null|
+----+----+

scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.table("tbl").show(false)
+----+---+
|col1|p1 |
+----+---+
+----+---+
```

### How was this patch tested?
Added new test to `AlterTableDropPartitionSuiteBase`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31322 from MaxGekk/insert-overwrite-null-part.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-25 15:27:20 +00:00
Dereck Li 096b15fa12 [SPARK-34607][SQL][FOLLOWUP] Change Option[LogicalRelation] to LogicalRelation
### What changes were proposed in this pull request?
optimize: change Option[LogicalRelation] to LogicalRelation

### Why are the changes needed?
simplify code

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Existed unit test.

Closes #31315 from monkeyboy123/spark-34067-follow-up.

Authored-by: Dereck Li <monkeyboy.ljh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-25 13:34:07 +00:00
Max Gekk 6fe5a8a2ae [SPARK-34197][SQL] SessionCatalog.refreshTable() should not invalidate the relation cache for temporary views
### What changes were proposed in this pull request?
Check the name passed to `SessionCatalog.refreshTable`, and if it belongs to a temporary view, do not invalidate the relation cache.

### Why are the changes needed?
When `SessionCatalog.refreshTable` refreshes a temporary or global temporary view, it should not invalidate an entry in the relation cache associated to a table with the same name.

### Does this PR introduce _any_ user-facing change?
Should not. The change might improve performance slightly.

### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite"
```

Closes #31265 from MaxGekk/fix-session-catalog-refresh-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-25 07:37:24 +00:00
yliou 512cacf7c6 [SPARK-33726][SQL] Fix for Duplicate field names during Aggregation
### What changes were proposed in this pull request?
The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`).

Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.

When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:

`with T as (select id as a, -id as x from range(3)),
        U as (select id as b, cast(id as string) as x from range(3))
select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x`
where the 'x' column in the left side of the join is a Long but on the right side is a String.

### Why are the changes needed?
Fixes the issue where duplicate field name aggregation has null values in the dataframe.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT, tested manually on spark shell.

Closes #30788 from yliou/SPARK-33726.

Authored-by: yliou <yliou@berkeley.edu>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-25 06:53:26 +00:00
Peter Toth 98ec6c27e3 [SPARK-34147][SQL][TEST] Keep table partitioning in TPCDSQueryBenchmak when CBO is enabled
### What changes were proposed in this pull request?
This PR keeps partitioning of input tables in TPCDSQueryBenchmark when `--cbo` option is enabled.

https://github.com/apache/spark/pull/31011 introduced the `--cbo` option but unfortunately in that mode the table partitioning of the input data is lost. This means that the results of CBO mode is very different to non CBO mode, one example is that Dynamic Partition Pruning doesn't kick in in CBO mode.

### Why are the changes needed?
To monitor performance changed with CBO enabled.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually checked.

Closes #31218 from peter-toth/SPARK-34147-keep-partitioning-in-tpcdsquerybenchmark.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-25 11:42:47 +09:00
Yuanjian Li 59cbacaddf [SPARK-34185][DOCS] Review and fix issues in API docs
### What changes were proposed in this pull request?
Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?
Fix the issues in the Spark 3.1.1 release API docs.

### Does this PR introduce _any_ user-facing change?
Yes, API doc changes.

### How was this patch tested?
Manually test.

Closes #31271 from xuanyuanking/SPARK-34185.

Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-25 11:38:20 +09:00
Takuya UESHIN 43fdd1271e [SPARK-33489][PYSPARK] Add NullType support for Arrow executions
### What changes were proposed in this pull request?

Adds `NullType` support for Arrow executions.

### Why are the changes needed?

As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled.

### Does this PR introduce _any_ user-facing change?

Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled.

### How was this patch tested?

Added tests.

Closes #31285 from ueshin/issues/SPARK-33489/arrow_nulltype.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-25 11:34:47 +09:00
Yuming Wang 4fce05d93f [SPARK-34155][SQL][TEST] Add partition columns for TPCDS tables
### What changes were proposed in this pull request?

This pr add partition columns for TPCDS tables. The partition column is consistent with the [TPCDSTables](https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala).

### Why are the changes needed?

Better track plan changes. For example, [this is the change](3fe1a93a40) after SPARK-34119.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #31243 from wangyum/SPARK-34155.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-24 13:08:26 +09:00
Max Gekk f8bf72ed5d [SPARK-34213][SQL] Refresh cached data of v1 table in LOAD DATA
### What changes were proposed in this pull request?
Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well.

### Why are the changes needed?
The example below portraits the issue:

- Create a source table:
```sql
spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0);
default	src_tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0
...
```
- Load data from the source table to a cached destination table:
```sql
spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part);
spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1;
spark-sql> CACHE TABLE dst_tbl;
spark-sql> SELECT * FROM dst_tbl;
1	1
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
1	1
```
The last query does not return new loaded data.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works correctly:
```sql
spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0);
spark-sql> SELECT * FROM dst_tbl;
0	0
1	1
```

### How was this patch tested?
Added new test to `org.apache.spark.sql.hive.CachedTableSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #31304 from MaxGekk/load-data-refresh-cache.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-23 15:49:10 -08:00
Max Gekk 0592503669 [SPARK-34207][SQL] Rename isTemporaryTable to isTempView in SessionCatalog
### What changes were proposed in this pull request?
Rename `SessionCatalog.isTemporaryTable()` to `SessionCatalog.isTempView()`.

### Why are the changes needed?
To improve code maintenance. Currently, there are two methods that do the same but have different names:
```scala
def isTempView(nameParts: Seq[String]): Boolean
```
and
```scala
def isTemporaryTable(name: TableIdentifier): Boolean
```

### Does this PR introduce _any_ user-facing change?
Should not since `SessionCatalog` is not public API.

### How was this patch tested?
By running the existing tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SessionCatalogSuite"
```

Closes #31295 from MaxGekk/replace-isTemporaryTable-by-isTempView.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-23 08:16:11 -08:00
yangjie01 e48a8ad1a2 [SPARK-34202][SQL][TEST] Add ability to fetch spark release package from internal environment in HiveExternalCatalogVersionsSuite
### What changes were proposed in this pull request?
`HiveExternalCatalogVersionsSuite` can't run in orgs internal environment where access to outside internet is not allowed because `HiveExternalCatalogVersionsSuite` will download spark release package from internet.

Similar to SPARK-32998, this pr add 1 environment variables `SPARK_RELEASE_MIRROR` to let user can specify an accessible download address of spark release package and run `HiveExternalCatalogVersionsSuite`  in orgs internal environment.

### Why are the changes needed?
Let `HiveExternalCatalogVersionsSuite` can run in orgs internal environment without relying on external spark release download address.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test  with and without env variables set in internal environment can't access internet.

execute
```
mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl  sql/hive -am -DskipTests

mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl  sql/hive -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest=none
```

**Without env**

```
HiveExternalCatalogVersionsSuite:
19:50:35.123 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed)
19:50:35.126 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://dist.apache.org/repos/dist/release/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed)
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
  Exception encountered when invoking run on a nested suite - Unable to download Spark 3.0.1 (HiveExternalCatalogVersionsSuite.scala:125)
Run completed in 2 seconds, 669 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
```

**With env**

```
export SPARK_RELEASE_MIRROR=${spark-release.internal.com}/dist/release/
```

```
HiveExternalCatalogVersionsSuite
- backward compatibility
Run completed in 1 minute, 32 seconds.
Total number of tests run: 1
Suites: completed 2, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #31294 from LuciferYang/SPARK-34202.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-23 08:02:52 -08:00
Wenchen Fan b8a6906627 [SPARK-34200][SQL] Ambiguous column reference should consider attribute availability
### What changes were proposed in this pull request?

This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is
```
Join(b#1 = 3)
  TableScan(t, [a#0, b#1])
  Project(a#2)
    TableScan(t, [a#2, b#3])
```
It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`.

### Why are the changes needed?

to not fail valid self-join queries.

### Does this PR introduce _any_ user-facing change?

yea as a bug fix

### How was this patch tested?

a new test

Closes #31287 from cloud-fan/self-join.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-22 20:11:53 +09:00
Yu Zhong 2db0a954e3 [SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE
### What changes were proposed in this pull request?
This PR is the same as https://github.com/apache/spark/pull/30998, but with a better UT.
In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others.
This partial fix only grantee the start of materialization for BroadcastQueryStage is prior to others, but because the submission of collect job for broadcasting is run in another thread, the issue is not completely solved.

### Why are the changes needed?
When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map stage(job) and broadcast job are submitted almost at the same time, but map stage will hold all the computing resources. If the map stage runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475).
The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small.

The order of calling materialize can guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.
1. for broadcast job, call doPrepare() in main thread, and then start the real materialization in "broadcast-exchange-0" thread pool: calling getByteArrayRdd().collect() to submit collect job
2. for shuffle map job, call ShuffleExchangeExec.mapOutputStatisticsFuture() which call sparkContext.submitMapStage() directly in main thread to submit map stage

1 is trigger before 2, so in normal cases, the broadcast job will be submit first.
However, we can not control how fast the two thread runs, so the "broadcast-exchange-0" thread could run a little bit slower than main thread, result in map stage submit first. So there's still risk for the shuffle map job schedule earlier before broadcast job.

Since completely fix the issue is complex and might introduce major changes, we need more time to follow up. This partial fix is better than do nothing, it resolved most cases in SPARK-33933.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Add UT

Closes #31269 from zhongyu09/aqe-broadcast-partial-fix.

Authored-by: Yu Zhong <zhongyu8@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-22 07:22:53 +00:00
beliefer fec82c9504 [SPARK-33245][SQL] Add built-in UDF - GETBIT
### What changes were proposed in this pull request?
`GETBIT` is a bitwise expression function given an INTEGER value, returns the value of a bit at a specified position.
`GETBIT( <integer_expr>, <bit_position> )`

Examples
select getbit(11, 100), getbit(11, 3), getbit(11, 2), getbit(11, 1), getbit(11, 0);
GETBIT(11, 3) | GETBIT(11, 2) | GETBIT(11, 1) | GETBIT(11, 0)
-- | -- | -- | --
1 | 0 | 1 | 1

The mainstream database support this feature show below:

**Teradata**
https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w

**Impala**
https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit

**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/getbit.html

**Yellowbrick**
https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html

### Why are the changes needed?
GETBIT is very useful.

### Does this PR introduce _any_ user-facing change?
Yes. GETBIT is a new bitwise function.

### How was this patch tested?
Jenkins test

Closes #31198 from beliefer/SPARK-33245.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-22 04:57:39 +00:00
beliefer cde697a479 [SPARK-33541][SQL] Group exception messages in catalyst/expressions
### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #31228 from beliefer/SPARK-33541.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-22 04:52:05 +00:00
Kousuke Saruta 45f076336b [SPARK-33813][SQL] Fix the issue that JDBC source can't treat MS SQL Server's spatial types
### What changes were proposed in this pull request?

This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails.
MS SQL server supports two non-standard spatial JDBC types, `geometry` and  `geography` but Spark SQL can't treat them

```
java.sql.SQLException: Unrecognized SQL type -157
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
 at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
 at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
 at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
 at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
 at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381)
```

Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`.

### Why are the changes needed?

To provide better support.

### Does this PR introduce _any_ user-facing change?

Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables.

### How was this patch tested?

New test case added to `MsSqlServerIntegrationSuite`.

Closes #31283 from sarutak/mssql-spatial-types.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-22 04:28:22 +00:00
Kousuke Saruta 842902154a [SPARK-34180][SQL] Fix the regression brought by SPARK-33888 for PostgresDialect
### What changes were proposed in this pull request?

This PR fixes the regression bug brought by SPARK-33888 (#30902).
After that PR merged, `PostgresDIalect#getCatalystType` throws Exception for array types.
```
[info] - Type mapping for various types *** FAILED *** (551 milliseconds)
[info]   java.util.NoSuchElementException: key not found: scale
[info]   at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:106)
[info]   at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:104)
[info]   at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111)
[info]   at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51)
[info]   at org.apache.spark.sql.jdbc.PostgresDialect$.getCatalystType(PostgresDialect.scala:43)
[info]   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
```

### Why are the changes needed?

To fix the regression bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the test case `SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL` in `PostgresIntegrationSuite` passed.

I also confirmed whether all the `v2.*IntegrationSuite` pass because this PR changed them and they passed.

Closes #31262 from sarutak/fix-postgres-dialect-regression.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-22 13:03:02 +09:00
Kousuke Saruta 116f4cab6b [SPARK-34094][SQL] Extends StringTranslate to support unicode characters whose code point >= U+10000
### What changes were proposed in this pull request?

This PR extends `StringTranslate` to support unicode characters whose code point >= `U+10000`.

### Why are the changes needed?

To make it work with wide variety of characters.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use `StringTranslate` with unicode characters whose code point >= `U+10000`.

### How was this patch tested?

New assertion added to the existing test.

Closes #31164 from sarutak/extends-translate.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-21 08:15:55 -06:00
Max Gekk e79c1cde1b [SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables
### What changes were proposed in this pull request?
This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details:
1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`.
2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`.
3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice).

### Why are the changes needed?
1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly.
2. To improve code maintenance.
3. To reduce the amount of calls to Hive external catalog.
4. Also this should speed up table recaching.
5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172

### Does this PR introduce _any_ user-facing change?
From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example:

Before:
```scala
scala> sql("CREATE TABLE tbl (c int)")
scala> sql("CACHE TABLE tbl")
scala> sql("CREATE VIEW v AS SELECT * FROM tbl")
scala> sql("CACHE TABLE v")

scala> spark.catalog.isCached("v")
res6: Boolean = true
scala> spark.catalog.refreshTable("tbl")

scala> spark.catalog.isCached("v")
res8: Boolean = false
```

After:
```scala
scala> spark.catalog.refreshTable("tbl")

scala> spark.catalog.isCached("v")
res8: Boolean = true
```

### How was this patch tested?
1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`.
2. By running the unified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
# build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #31206 from MaxGekk/refreshTable-recache-by-plan.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-21 13:03:24 +00:00
ulysses-you da4b50f8e2 [SPARK-33901][SQL][FOLLOWUP] Add drop table in charvarchar test
### What changes were proposed in this pull request?

Add `drop table` in charvarchar sql test.

### Why are the changes needed?

1. `drop table` is also a test case, for better coverage.
2. It's more clear to drop table which created in current test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31277 from ulysses-you/SPARK-33901-FOLLOWUP.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-21 12:41:52 +00:00
Max Gekk 31d828379c [SPARK-34099][SQL][TESTS] Check re-caching of v2 table dependents after table altering
### What changes were proposed in this pull request?
Add tests to check that v2 table dependents are re-cached after table altering via the commands:
- `ALTER TABLE .. ADD PARTITION`
- `ALTER TABLE .. DROP PARTITION`
- `ALTER TABLE .. RENAME PARTITION`

### Why are the changes needed?
To improve test coverage and prevent regressions.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite"
```

Closes #31250 from MaxGekk/check-v2-dependents-recached.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-21 08:42:17 +00:00
beliefer 140538ea5b [SPARK-34096][SQL] Improve performance for nth_value ignore nulls over offset window
### What changes were proposed in this pull request?
The current implement of `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame` only support `nth_value` that respect nulls. So nth_value will execute with `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame`.
`UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame` will call `updateExpressions` of `nth_value` multiple times.

### Why are the changes needed?
Improve performance for nth_value ignore nulls over offset window

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Jenkins test

Closes #31178 from beliefer/SPARK-34096.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-21 07:31:36 +00:00
Kent Yao d640631e36 [SPARK-34164][SQL] Improve write side varchar check to visit only last few tailing spaces
### What changes were proposed in this pull request?

For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N.

### Why are the changes needed?

improve varchar performance for write side
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

benchmark and existing ut

Closes #31253 from yaooqinn/SPARK-34164.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-21 05:30:57 +00:00
Ismaël Mejía e9e81f798f [SPARK-27733][CORE] Upgrade Avro to version 1.10.1
### What changes were proposed in this pull request?

Update Avro dependency to version 1.10.1

### Why are the changes needed?

To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Since there were no API changes required we just run the tests

Closes #31232 from iemejia/SPARK-27733-avro-upgrade.

Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-20 15:42:27 -08:00
yangjie01 d68612a008 [SPARK-34176][BUILD] Restore the independent mvn test ability of sql/hive module in Scala 2.13
### What changes were proposed in this pull request?
There is one Java UT error when testing sql/hive module independently in Scala 2.13 after SPARK-33212,  the error message as follow:

```
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 18.548 s  <<< ERROR!
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
	at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
	at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport
	at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
	at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
```

This pr add a Scala-2.13 profile with dependency of `scala-parallel-collections_` to `sql/hive` module to fix the Java UT in Scala 2.13.

### Why are the changes needed?
Recover the independent mvn test ability of sql/hive module in Scala 2.13.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Manual test

```
dev/change-scala-version.sh 2.13

mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl  sql/hive -am -DskipTests

mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl  sql/hive
```

**Before**

```
[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 18.725 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF  Time elapsed: 16.853 s  <<< ERROR!
java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
	at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
	at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)
Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport
	at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41)
	at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92)

[INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
16:15:36.186 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
16:15:36.288 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
16:15:36.396 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.481 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
[INFO]
[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   JavaDataFrameSuite.testUDAF:92->checkAnswer:41 » NoClassDefFound scala/collect...
[INFO]
[ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0
```

**After**

```
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.287 s - in org.apache.spark.sql.hive.JavaDataFrameSuite
[INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
16:12:16.697 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
16:12:17.540 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
16:12:17.654 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.58 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
```

Closes #31259 from LuciferYang/SPARK-34176.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-20 15:33:31 -08:00
yi.wu f498977222 [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance
### What changes were proposed in this pull request?

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

### Why are the changes needed?

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

537a49fc09/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (L125)

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated a unit test

Closes #31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-20 13:36:14 +00:00
Chao Sun 902a08b9e6 [SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT"
### What changes were proposed in this pull request?

This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command.

In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value).

### Why are the changes needed?

Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.

For instance, in the following:
```sql
> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;
```
The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist.

On the other hand, the following:
```sql
> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;
```
will throw "Table or view not found" error in the last select query.

This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case.

This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.

### Does this PR introduce _any_ user-facing change?

Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.

### How was this patch tested?

Modified/Enhanced some existing tests.

Closes #31107 from sunchao/SPARK-34052.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-20 02:09:39 +00:00
Max Gekk 00b444d5ed [SPARK-34056][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests
### What changes were proposed in this pull request?
1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`.
2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31105 from MaxGekk/unify-recover-partitions-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-20 01:49:31 +00:00
Angerszhuuuu f6338a3e0b [SPARK-34121][SQL] Intersect operator missing rowCount when CBO enabled
### What changes were proposed in this pull request?

This pr add row count to `Intersect` operator when CBO enabled.

### Why are the changes needed?
Improve query performance, [JoinEstimation.estimateInnerOuterJoin](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added

Closes #31240 from AngersZhuuuu/SPARK-34121.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-20 10:00:44 +09:00
Max Gekk 32dad1d5a6 [SPARK-34149][SQL] Refresh cache in v2 ALTER TABLE .. ADD PARTITION
### What changes were proposed in this pull request?
Clear table cache after adding partitions to v2 table in `AlterTableAddPartitionExec`.

### Why are the changes needed?
This PR fixes correctness issue. Without the fix, queries on cached tables modified via `ALTER TABLE .. ADD PARTITION` return incorrect results.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Added new UT to `org.apache.spark.sql.execution.command.v2.AlterTableAddPartitionSuite`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
```

Closes #31229 from MaxGekk/v2-add-partition-recache.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 09:42:07 +00:00
ulysses-you addbbe8339 [SPARK-33939][SQL] Make Column.named UnresolvedExtractValue use UnresolvedAlias to assign name
### What changes were proposed in this pull request?

Change `Column.named` code to let expression check  if exists `UnresolvedExtractValue` and use `UnresolvedAlias` to assign name.

### Why are the changes needed?

It's more reasonable to treat user specify expression as unresolved expression then we should assign name after analyze.

Let's say we have this code
```
spark.range(1).selectExpr("id as id1", "id as id2")
  .selectExpr("cast(struct(id1, id2).id1 as int)")
```

before this PR, the field name is `CAST(struct(id1, id2)[id1] AS INT)`.

After, the field name is `CAST(struct(id1, id2).id1 AS INT)`.

### Does this PR introduce _any_ user-facing change?

Yes, the default field name may be changed.

### How was this patch tested?

Add test.

Closes #30974 from ulysses-you/SPARK-33939-0.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 09:35:56 +00:00
Kent Yao 6fa2fb9eb5 [SPARK-34130][SQL] Impove preformace for char varchar padding and length check with StaticInvoke
### What changes were proposed in this pull request?

This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.

here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?

performance improvement as in the PR benchmark test, the performance  w/ codegen is 2~3x better than w/o codegen.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, it's a code reflect so the existing ut should be enough

cross-check with https://github.com/apache/spark/pull/31012 where the tpcds shall all pass

benchmark compared with master

```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1571           1667          83         63.6          15.7       1.0X
read char with length 20                           1710           1764          58         58.5          17.1       0.9X
read varchar with length 20                        1774           1792          16         56.4          17.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1824           1927          91         54.8          18.2       1.0X
read char with length 40                           1788           1928         137         55.9          17.9       1.0X
read varchar with length 40                        1676           1700          40         59.7          16.8       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1727           1762          30         57.9          17.3       1.0X
read char with length 60                           1628           1674          43         61.4          16.3       1.1X
read varchar with length 60                        1651           1665          13         60.6          16.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80, hasSpaces: true:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1748           1778          28         57.2          17.5       1.0X
read char with length 80                           1673           1678           9         59.8          16.7       1.0X
read varchar with length 80                        1667           1684          27         60.0          16.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1709           1743          48         58.5          17.1       1.0X
read char with length 100                          1610           1664          67         62.1          16.1       1.1X
read varchar with length 100                       1614           1673          53         61.9          16.1       1.1X

================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 20, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20                        2277           2327          67          4.4         227.7       1.0X
write char with length 20                          2421           2443          19          4.1         242.1       0.9X
write varchar with length 20                       2393           2419          27          4.2         239.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 40, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40                        2249           2290          38          4.4         224.9       1.0X
write char with length 40                          2386           2444          57          4.2         238.6       0.9X
write varchar with length 40                       2397           2405          12          4.2         239.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 60, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60                        2326           2367          41          4.3         232.6       1.0X
write char with length 60                          2478           2501          37          4.0         247.8       0.9X
write varchar with length 60                       2475           2503          24          4.0         247.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 80, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80                        9367           9773         354          1.1         936.7       1.0X
write char with length 80                         10454          10621         238          1.0        1045.4       0.9X
write varchar with length 80                      18943          19503         571          0.5        1894.3       0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 100, hasSpaces: true:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100                      11055          11104          59          0.9        1105.5       1.0X
write char with length 100                        12204          12275          63          0.8        1220.4       0.9X
write varchar with length 100                     21737          22275         574          0.5        2173.7       0.5X

```

Closes #31199 from yaooqinn/SPARK-34130.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 09:03:06 +00:00
Yuming Wang 030639f456 [SPARK-34119][SQL] Keep necessary stats after partition pruning
### What changes were proposed in this pull request?

This pr keep necessary stats after partition pruning.

### Why are the changes needed?

Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)). Therefore, it will eventually be planned as SortMergeJoin.

Please see the log:
```
join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4)
join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and benchmark test

SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
q14a | 594 | 384
q14b | 600 | 402

This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column.

Closes #31205 from wangyum/SPARK-34119.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 06:09:16 +00:00
Max Gekk a98e77c113 [SPARK-34143][SQL][TESTS] Fix adding partitions to fully partitioned v2 tables
### What changes were proposed in this pull request?
While adding new partition to v2 `InMemoryAtomicPartitionTable`/`InMemoryPartitionTable`, add single row to the table content when the table is fully partitioned.

### Why are the changes needed?
The `ALTER TABLE .. ADD PARTITION` command does not change content of fully partitioned v2 table. For instance, `INSERT INTO` changes table content:
```scala
      sql(s"CREATE TABLE t (p0 INT, p1 STRING) USING _ PARTITIONED BY (p0, p1)")
      sql(s"INSERT INTO t SELECT 1, 'def'")
      sql(s"SELECT * FROM t").show(false)

+---+---+
|p0 |p1 |
+---+---+
|1  |def|
+---+---+
```
but `ALTER TABLE .. ADD PARTITION` doesn't change v2 table content:
```scala
      sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')")
      sql(s"SELECT * FROM t").show(false)

+---+---+
|p0 |p1 |
+---+---+
+---+---+
```

### Does this PR introduce _any_ user-facing change?
No, the changes impact only on tests but for the example above in tests:
```scala
      sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')")
      sql(s"SELECT * FROM t").show(false)

+---+---+
|p0 |p1 |
+---+---+
|0  |abc|
+---+---+
```

### How was this patch tested?
By running the unified tests for `ALTER TABLE .. ADD PARTITION`.

Closes #31216 from MaxGekk/add-partition-by-all-columns.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 05:40:15 +00:00
ulysses-you 055124a048 [SPARK-34150][SQL] Strip Null literal.sql in resolve alias
### What changes were proposed in this pull request?

Change null Literal to PrettyAttribute during ResolveAlias.

### Why are the changes needed?

We will convert `Literal(null)` to target data type during analysis. Then the generated alias name will include something like `CAST(NULL AS String)` instead of `NULL`.
```
spark.sql("SELECT RAND(null)").columns

-- before
rand(CAST(NULL AS INT))

-- after
rand(NULL)
```

### Does this PR introduce _any_ user-facing change?

Yes, the default column name maybe changed.

### How was this patch tested?

Add test and pass exists test.

Closes #31233 from ulysses-you/SPARK-34150.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-19 03:35:08 +00:00
Max Gekk bea10a6274 [SPARK-34153][SQL] Remove unused getRawTable() from HiveExternalCatalog.alterPartitions()
### What changes were proposed in this pull request?
Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`.

### Why are the changes needed?
It reduces the number of calls to Hive External catalog.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #31234 from MaxGekk/remove-getRawTable-from-alterPartitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-19 11:42:33 +09:00
Max Gekk 514172aae7 [SPARK-34074][SQL][TESTS][FOLLOWUP] Fix table size parsing from statistics
### What changes were proposed in this pull request?
Fix table size parsing from the `Statistics` field which is formed at: c3d81fbe79/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (L573) . Before the fix, `getTableSize()` returns only the last digit. This works for Hive table in the tests because its size < 10 bytes, and accidentally works for V1 In-Memory catalog table in the tests.

### Why are the changes needed?
This makes tests more reliable. For example, the parsing can not work in `AlterTableDropPartitionSuite` when table size before partition dropping:
```
+---------+
|data_type|
+---------+
|878 bytes|
+---------+
```

After:
```
+---------+
|data_type|
+---------+
|439 bytes|
+---------+
```
at:
```scala
        val onePartSize = getTableSize(t)
        assert(0 < onePartSize && onePartSize < twoPartSize)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
```

Closes #31237 from MaxGekk/optimize-updateTableStats-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-19 09:51:20 +09:00
Liang-Chi Hsieh bba830f523 [SPARK-34148][TEST][SS] Move general StateStore tests to StateStoreSuiteBase
### What changes were proposed in this pull request?

This patch moves move general StateStore-related tests into `StateStoreSuiteBase`.

### Why are the changes needed?

There are some general StateStore tests in `StateStoreSuite` which is `HDFSBackedStateStoreProvider`-specific test suite. We should move general tests into `StateStoreSuiteBase`, so it is easier to incorporate other StateStores.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Unit test.

Closes #31219 from viirya/SPARK-34148.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-18 13:03:25 -08:00
Max Gekk dee596e3ef [SPARK-34027][SQL] Refresh cache in ALTER TABLE .. RECOVER PARTITIONS
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`.

### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> create table tbl (col int, part int) using parquet partitioned by (part);
spark-sql> insert into tbl partition (part=0) select 0;
spark-sql> cache table tbl;
spark-sql> select * from tbl;
0	0
spark-sql> show table extended like 'tbl' partition(part=0);
default	tbl	false	Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0
...
```
Create new partition by copying the existing one:
```
$ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1
```
```sql
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0	0
```

The last query must return `0	1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> alter table tbl recover partitions;
spark-sql> select * from tbl;
0	0
0	1
```

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #31066 from MaxGekk/recover-partitions-refresh-cache.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-18 13:52:39 +00:00
yangjie01 163afa6fcf [SPARK-34151][SQL] Replaces java.io.File.toURL with java.io.File.toURI.toURL
### What changes were proposed in this pull request?
`java.io.FIle.toURL` method does not automatically escape characters that are illegal in URLs.

Java doc recommended that new code convert an abstract pathname into a URL by first converting it into a URI, via the `toURI` method, and then converting the URI into a URL via the `URI.toURL` method.

So this pr cleaned up the relevant cases in Spark code.

### Why are the changes needed?
Cleaning up `Deprecated` Java API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31230 from LuciferYang/SPARK-34151.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-18 21:39:00 +09:00
Terry Kim 78893b8dc9 [SPARK-34139][SQL] UnresolvedRelation should retain SQL text position for DDL commands
### What changes were proposed in this pull request?

Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect:
```
scala> sql("CACHE TABLE unknown")
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0;
```
, whereas the `pos` should be `12`.

This PR proposes to fix this issue for commands using `UnresolvedRelation`:
```
CACHE TABLE unknown
UNCACHE TABLE unknown
DELETE FROM unknown
UPDATE unknown SET name='abc'
MERGE INTO unknown1 AS target USING unknown2 AS source ON target.col = source.col WHEN MATCHED THEN DELETE
INSERT INTO TABLE unknown SELECT 1
INSERT OVERWRITE TABLE unknown VALUES (1, 'a')
```

### Why are the changes needed?

To fix a bug.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example will print the following:
```
org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 12;
```

### How was this patch tested?

Add a new test.

Closes #31209 from imback82/unresolved_relation_message.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-18 05:57:25 +00:00
Yuming Wang c87b0085c9 [SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8
### What changes were proposed in this pull request?

Hive 2.3.8 changes:
HIVE-19662: Upgrade Avro to 1.8.2
HIVE-24324: Remove deprecated API usage from Avro
HIVE-23980: Shade Guava from hive-exec in Hive 2.3
HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue
HIVE-24512: Exclude calcite in packaging.
HIVE-22708: Fix for HttpTransport to replace String.equals
HIVE-24551: Hive should include transitive dependencies from calcite after shading it
HIVE-24553: Exclude calcite from test-jar dependency of hive-exec

### Why are the changes needed?

Upgrade Avro and Parquet to latest version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517

Closes #30657 from wangyum/SPARK-33696.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-17 21:54:35 -08:00
Terry Kim 0540392464 [SPARK-34140][SQL] Move QueryCompilationErrors.scala and QueryExecutionErrors.scala to org/apache/spark/sql/errors
### What changes were proposed in this pull request?

`QueryCompilationErrors.scala` and `QueryExecutionErrors.scala` use the `org.apache.spark.sql.errors` package, but these files are reside in `org/apache/spark/sql` directory. This PR proposes to move these files to `org/apache/spark/sql/errors`.

### Why are the changes needed?

To match the package name with the directory structure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #31211 from imback82/error_package.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-16 13:40:21 -08:00
Max Gekk c3d81fbe79 [SPARK-34060][SQL][FOLLOWUP] Preserve serializability of canonicalized CatalogTable
### What changes were proposed in this pull request?
Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after https://github.com/apache/spark/pull/31112 due to usage of `filterKeys`. The workaround was taken from https://github.com/scala/bug/issues/7005.

### Why are the changes needed?
This prevents the errors like:
```
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
[info]   Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
```

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running the test suite affected by https://github.com/apache/spark/pull/31112:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31197 from MaxGekk/fix-caching-hive-table-2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-15 17:02:29 -08:00
Chao Sun b6f46ca297 [SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
### What changes were proposed in this pull request?

This:
1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x.
2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2

Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client.

In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties:

```
hadoop-client-api.artifact
hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```

which default to:
```
hadoop-client-api
hadoop-client-runtime
hadoop-client-minicluster
```
but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`.

Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests).

### Why are the changes needed?

Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts.

### Does this PR introduce _any_ user-facing change?

When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts.

### How was this patch tested?

Relying on existing tests.

Closes #30701 from sunchao/test-hadoop-3.2.2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-15 14:06:50 -08:00
Kent Yao a235c3b254 [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally
### What changes were proposed in this pull request?
The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow.

### Why are the changes needed?

rm unnecessary logic which may cause potential performance regressions

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

tpcds tests for plan

Closes #31079 from yaooqinn/SPARK-34037.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-01-15 10:18:58 -08:00
yangjie01 9e33d49b5b [SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val'
### What changes were proposed in this pull request?
Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these  from `var` to  `val` except for `mockito` related cases.

### Why are the changes needed?
Use `val` instead of `var` when possible.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31142 from LuciferYang/SPARK-33346.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-15 08:47:02 -06:00
Wenchen Fan 6cd0092150 Revert "[SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled"
This reverts commit f1b21ba505.
2021-01-15 21:45:17 +08:00
Peter Toth 00d43b1f82 [SPARK-32864][SQL] Support ORC forced positional evolution
### What changes were proposed in this pull request?
Add support for `orc.force.positional.evolution` config that forces ORC top level column matching by position rather than by name.

This does work in Hive:
```
> set orc.force.positional.evolution;
+--------------------------------------+
|                 set                  |
+--------------------------------------+
| orc.force.positional.evolution=true  |
+--------------------------------------+
> create table t (c1 string, c2 string) stored as orc;
> insert into t values ('foo', 'bar');
> alter table t change c1 c3 string;
```
The orc file in this case contains the original `c1` and `c2` columns that doesn't match the metadata in HMS. But due to the positional evolution setting, Hive is capable to return all the data:
```
> select * from t;
+--------+--------+
| t.c3   | t.c2   |
+--------+--------+
| foo    | bar    |
+--------+--------+
```
Without this PR Spark returns `null`s for the renamed `c3` column.

After this PR Spark returns the data in `c3` column.

### Why are the changes needed?
Hive/ORC does support it.

### Does this PR introduce _any_ user-facing change?
Yes, we will support `orc.force.positional.evolution`.

### How was this patch tested?
New UT.

Closes #29737 from peter-toth/SPARK-32864-support-orc-forced-positional-evolution.

Lead-authored-by: Peter Toth <peter.toth@gmail.com>
Co-authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-14 21:27:25 -08:00