Commit graph

29071 commits

Author SHA1 Message Date
Takeshi Yamamuro 414d323d6c [SPARK-33988][SQL][TEST] Add an option to enable CBO in TPCDSQueryBenchmark
### What changes were proposed in this pull request?

This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled.

### Why are the changes needed?

To monitor performance chaneges with CBO enabled.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #31011 from maropu/AddOptionForCBOInTPCDSBenchmark.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:31:20 -08:00
Max Gekk fc3f22645e [SPARK-33990][SQL][TESTS] Remove partition data by v2 ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests.

### Why are the changes needed?
This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running tests suites for v1 and v2 catalogs:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31014 from MaxGekk/fix-drop-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:26:39 -08:00
HyukjinKwon 6b86aa0b52 [SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1
### What changes were proposed in this pull request?

This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).

### Why are the changes needed?

To leverage fixes from the upstream in Py4J.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub Actions will test it out.

Closes #31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:23:38 -08:00
Terry Kim ddc0d5148a [SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables
### What changes were proposed in this pull request?

This PR proposes to implement `DESCRIBE COLUMN` for v2 tables.

Note that `isExnteded` option is not implemented in this PR.

### Why are the changes needed?

Parity with v1 tables.

### Does this PR introduce _any_ user-facing change?

Yes, now, `DESCRIBE COLUMN` works for v2 tables.
```scala
sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show
```
```
+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|      data|
|data_type|    string|
|  comment|     hello|
+---------+----------+
```

Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.`

### How was this patch tested?

Added new test.

Closes #30881 from imback82/describe_col_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 16:14:33 +00:00
angerszhu 8583a4605f [SPARK-33844][SQL] InsertIntoHiveDir command should check col name too
### What changes were proposed in this pull request?

In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.

When we use spark 2.4 with UT
```
  test("insert overwrite directory with comma col name") {
    withTempDir { dir =>
      val path = dir.toURI.getPath

      val v1 =
        s"""
           | INSERT OVERWRITE DIRECTORY '${path}'
           | STORED AS TEXTFILE
           | SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
         """.stripMargin

      sql(v1).explain(true)

      sql(v1).show()
    }
  }
```
failed with as below since column name contains `,` then column names and column types size not equal.
```
19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter:  [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
	at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

```

After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)

And in script transform, we parse column name  to avoid this problem
554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)

So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.

### Why are the changes needed?
More save use serde

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #30850 from AngersZhuuuu/SPARK-33844.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 09:43:15 +00:00
Dongjoon Hyun 271c4f6e00 [SPARK-33978][SQL] Support ZSTD compression in ORC data source
### What changes were proposed in this pull request?

This PR aims to support ZSTD compression in ORC data source.

### Why are the changes needed?

Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost.
- https://issues.apache.org/jira/browse/ORC-363

**BEFORE**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none.
```

**AFTER**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
```

```bash
$ orc-tools meta /tmp/zstd
Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230]
Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
File Version: 0.12 with ORC_14
Rows: 1
Compression: ZSTD
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 230 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.2.0
```

### Does this PR introduce _any_ user-facing change?

Yes, this is a new feature.

### How was this patch tested?

Pass the newly added test case.

Closes #31002 from dongjoon-hyun/SPARK-33978.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 00:54:47 -08:00
Max Gekk 8b3fb43f40 [SPARK-33965][SQL][TESTS] Recognize spark_catalog by CACHE TABLE in Hive table names
### What changes were proposed in this pull request?
Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below:
```scala
      case UnresolvedRelation(ident, _, _) =>
        if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) {
```
added by https://github.com/apache/spark/pull/30883.

### Why are the changes needed?
1. To have feature parity with v1 In-Memory catalog.
2. To be able to write unified tests for In-Memory and Hive external catalogs.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #30997 from MaxGekk/cache-table-spark_catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 08:28:26 +00:00
Hoa 0b647fe69c [SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
### What changes were proposed in this pull request?
JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now.

### Why are the changes needed?
Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Close #30902

Closes #30902 from saikocat/SPARK-33888.

Lead-authored-by: Hoa <hoameomu@gmail.com>
Co-authored-by: Hoa <saikocatz@gmail.com>
Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn>
Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 06:53:12 +00:00
angerszhu adac633f93 [SPARK-33934][SQL] Add SparkFile's root dir to env property PATH
### What changes were proposed in this pull request?
In hive we always use
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`.

This pr add a SparkFile's root dir path to execution env property `PATH`, then  sub-processor will find `scrip.py` as program under `PATH`.

### Why are the changes needed?
Support SQL migration form Hive to Spark.

### Does this PR introduce _any_ user-facing change?
User can direct use script file name as program in script transform SQL.

```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
### How was this patch tested?
UT

Closes #30973 from AngersZhuuuu/SPARK-33934.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-04 15:46:49 +09:00
Yuming Wang 2a68ed71e4 [SPARK-33954][SQL] Some operator missing rowCount when enable CBO
### What changes were proposed in this pull request?

This pr fix some operator missing rowCount when enable CBO, e.g.:
```scala
spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("set spark.sql.cbo.planStats.enabled=true")
spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost")
```

Before this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
   +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
      +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
         +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```

After this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
   +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
         +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)

```

### Why are the changes needed?

 [`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30987 from wangyum/SPARK-33954.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 05:53:14 +00:00
gengjiaan b037930952 [SPARK-33951][SQL] Distinguish the error between filter and distinct
### What changes were proposed in this pull request?
The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use.

### Why are the changes needed?
increase readability and ease of use.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test

Closes #30982 from beliefer/SPARK-33951.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 05:44:00 +00:00
Max Gekk 67195d0d97 [SPARK-33950][SQL] Refresh cache in v1 ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table.

### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
spark-sql> CACHE TABLE tbl1;
spark-sql> SELECT * FROM tbl1;
0	0
1	1
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
0	0
1	1
```
The last query must not return `0	0` since it was deleted by previous command.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
1	1
```

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #30983 from MaxGekk/drop-partition-refresh-cache.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 04:11:39 +00:00
Ruifeng Zheng 6b7527e381 [SPARK-33398] Fix loading tree models prior to Spark 3.0
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.

### Why are the changes needed?
to support load old tree model in 3.0/3.1/master

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added testsuites

Closes #30889 from zhengruifeng/fix_tree_load.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-03 11:52:46 -06:00
Liang-Chi Hsieh 963c60fe49 [SPARK-33955][SS] Add latest offsets to source progress
### What changes were proposed in this pull request?

This patch proposes to add latest offset to source progress for streaming queries.

### Why are the changes needed?

Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress.

### Does this PR introduce _any_ user-facing change?

Yes, for new metric about latest source offset in source progress.

### How was this patch tested?

Unit test. Manually test in Spark cluster:

```
    "description" : "KafkaV2[Subscribe[page_view_events]]",
    "startOffset" : {
      "page_view_events" : {
        "2" : 582370921,
        "4" : 391910836,
        "1" : 631009201,
        "3" : 406601346,
        "0" : 195799112
      }
    },
    "endOffset" : {
      "page_view_events" : {
        "2" : 583764414,
        "4" : 392338002,
        "1" : 632183480,
        "3" : 407101489,
        "0" : 197304028
      }
    },
    "latestOffset" : {
      "page_view_events" : {
        "2" : 589852545,
        "4" : 394204277,
        "1" : 637313869,
        "3" : 409286602,
        "0" : 203878962
      }
    },
    "numInputRows" : 4999997,
    "inputRowsPerSecond" : 29287.70501405811,
```

Closes #30988 from viirya/latest-offset.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-03 01:31:38 -08:00
Liang-Chi Hsieh cfd4a08398 [SPARK-33962][SS] Fix incorrect min partition condition
### What changes were proposed in this pull request?

This patch fixes an incorrect condition when comparing offset range size and min partition config.

### Why are the changes needed?

When calculating offset ranges, we consider `minPartitions` configuration. If `minPartitions` is not set or is less than or equal the size of given ranges, it means there are enough partitions at Kafka so we don't need to split offsets to satisfy min partition requirement. But the current condition is `offsetRanges.size > minPartitions.get` and is not correct. Currently `getRanges` will split offsets in unnecessary case.

Besides, in non-split case, we can assign preferred executor location and reuse `KafkaConsumer`. So unnecessary splitting offset range will miss the chance to reuse `KafkaConsumer`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Manual test in Spark cluster with Kafka.

Closes #30994 from viirya/ss-minor4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-03 01:29:12 -08:00
Max Gekk fc7d0165d2 [SPARK-33963][SQL] Canonicalize HiveTableRelation w/o table stats
### What changes were proposed in this pull request?
Skip table stats in canonicalizing of `HiveTableRelation`.

### Why are the changes needed?
The changes fix a regression comparing to Spark 3.0, see SPARK-33963.

### Does this PR introduce _any_ user-facing change?
Yes. After changes Spark behaves as in the version 3.0.1.

### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #30995 from MaxGekk/fix-caching-hive-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-03 11:23:46 +09:00
Yuming Wang 6c5ba8169a [SPARK-33959][SQL] Improve the statistics estimation of the Tail
### What changes were proposed in this pull request?

This pr improve the statistics estimation of the `Tail`:

```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1")
println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats)
```

Before this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=3.8 KiB)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```

After this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```

### Why are the changes needed?

Import statistics estimation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30991 from wangyum/SPARK-33959.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-03 10:59:12 +09:00
Dongjoon Hyun 1c25bea0bb [SPARK-33961][BUILD] Upgrade SBT to 1.4.6
### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.4.6 to fix the SBT regression.

### Why are the changes needed?

[SBT 1.4.6](https://github.com/sbt/sbt/releases/tag/v1.4.6) has the following fixes
- Updates to Coursier 2.0.8, which fixes the cache directory setting on Windows
- Fixes performance regression in shell tab completion
- Fixes match error when using withDottyCompat
- Fixes thread-safety in AnalysisCallback handler

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30993 from dongjoon-hyun/SPARK-SBT-1.4.6.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-02 14:49:03 -08:00
Yuming Wang 4cd680581a [SPARK-33956][SQL] Add rowCount for Range operator
### What changes were proposed in this pull request?

This pr add rowCount for `Range` operator:
```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("select id from range(100)").explain("cost")
```

Before this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
```

After this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100)
```

### Why are the changes needed?

 [`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30989 from wangyum/SPARK-33956.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-02 08:58:48 -08:00
William Hyun bd346f4a2d [SPARK-33957][BUILD] Update commons-lang3 to 3.11
### What changes were proposed in this pull request?

This PR aims to update commons-lang3 to 3.11 to support Java 16+ better.

### Why are the changes needed?

commons-lang3 has the following bug fixes and Java 16 support.
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?
Pass the CIs.

Closes #30990 from williamhyun/Commons-lang3.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-01 19:59:17 -08:00
Baohe Zhang 45df6db906 [SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics
### What changes were proposed in this pull request?
Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.

### Why are the changes needed?
App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

val x = sc.makeRDD(1 to 100000, 5)
x.count()

Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png)

Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test, rerun the steps of bug reproduce and see the bug is gone.

Closes #30920 from baohe-zhang/SPARK-33906.

Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:34:55 -08:00
Kent Yao ed9f728801 [SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options
### What changes were proposed in this pull request?

While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled.

### Why are the changes needed?

bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

 new tests

Closes #30978 from yaooqinn/SPARK-33944.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:20:31 -08:00
angerszhu 771c538620 [SPARK-33084][SQL][TESTS][FOLLOW-UP] Fix Scala 2.13 UT failure
### What changes were proposed in this pull request?
Fix UT according to  https://github.com/apache/spark/pull/29966#issuecomment-752830046

Change StructType construct from
```
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
```
to
```
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)
```
The whole udf class is :

```
package org.apache.spark.examples.sql

import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

class Spark33084 extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)

  // Data types of values in the aggregation buffer
  def bufferSchema: StructType =
    new StructType().add("sum", LongType).add("count", LongType)
  // The data type of the returned value
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
```

### Why are the changes needed?
Fix UT for scala 2.13

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30980 from AngersZhuuuu/spark-33084-followup.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:18:31 -08:00
yi.wu 3fe5614a7c [SPARK-31946][CORE] Make worker/executor decommission signal configurable
### What changes were proposed in this pull request?

This PR proposed to make worker/executor decommission signal configurable.

* Added confs: `spark.worker.decommission.signal` / `spark.executor.decommission.signal`
* Rename `WorkerSigPWRReceived`/ `ExecutorSigPWRReceived` to `WorkerDecomSigReceived`/ `ExecutorDecomSigReceived`

### Why are the changes needed?

The current signal `PWR` can't work on macOS since it's not compliant with POSIX while macOS does.  So the developers currently can't do end-to-end decommission test on their macOS environment.

Besides, the configuration becomes more flexible for users in case the default signal (`PWR`) gets conflicted with their own applications/environment.

### Does this PR introduce _any_ user-facing change?

No (it's a new API for 3.2)

### How was this patch tested?

Manually tested.

Closes #30968 from Ngone51/configurable-decom-signal.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:13:02 -08:00
Pradyumn Agrawal (pradyumn.ag) 13e8c28409 [SPARK-33942][DOCS] Remove hiveClientCalls.count in CodeGenerator metrics docs
### What changes were proposed in this pull request?
Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor

### Why are the changes needed?
Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md)

Closes #30976 from coderbond007/SPARK-33942.

Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-30 17:25:46 -08:00
yangjie01 85de644733 [SPARK-33804][CORE] Fix compilation warnings about 'view bounds are deprecated'
### What changes were proposed in this pull request?

There are only 3 compilation warnings related to `view bounds are deprecated` in Spark Code:
```
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:55: view bounds are deprecated; use an implicit parameter instead.
```

This pr try to fix these compilation warnings.

### Why are the changes needed?
Fix compilation warnings about ` view bounds are deprecated`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30924 from LuciferYang/SPARK-33804.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-30 13:57:44 -06:00
Liang-Chi Hsieh f38265ddda [SPARK-33907][SQL] Only prune columns of from_json if parsing options is empty
### What changes were proposed in this pull request?

As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing.

This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed.

### Why are the changes needed?

It is to avoid unexpected behavior change regarding parsing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30970 from viirya/SPARK-33907-3.2.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-12-30 09:57:15 -08:00
gengjiaan ba974ea8e4 [SPARK-30789][SQL] Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
### What changes were proposed in this pull request?
All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS | RESPECT NULLS. For example:
```
LEAD (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering )
```

```
LAG (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering )
```

```
NTH_VALUE (expr, offset)
[ IGNORE NULLS | RESPECT NULLS ]
OVER
( [ PARTITION BY window_partition ]
[ ORDER BY window_ordering
 frame_clause ] )
```

The mainstream database or engine supports this syntax contains:
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html

**Presto**
https://prestodb.io/docs/current/functions/window.html

**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm

**Teradata**
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w

**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/lead.html
https://docs.snowflake.com/en/sql-reference/functions/lag.html
https://docs.snowflake.com/en/sql-reference/functions/nth_value.html
https://docs.snowflake.com/en/sql-reference/functions/first_value.html
https://docs.snowflake.com/en/sql-reference/functions/last_value.html

**Exasol**
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm

### Why are the changes needed?
Support `(IGNORE | RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Jenkins test

Closes #30943 from beliefer/SPARK-30789.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 13:14:31 +00:00
Max Gekk 2afd1fb492 [SPARK-33904][SQL] Recognize spark_catalog in saveAsTable() and insertInto()
### What changes were proposed in this pull request?
In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names.

### Why are the changes needed?
1. To simplify writing of unified v1 and v2 tests
2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`:
```scala
scala> sql("CREATE NAMESPACE spark_catalog.ns")
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629)
  ... 47 elided
scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498)
  ... 47 elided
```
but `INSERT INTO` succeed:
```sql
spark-sql> create table spark_catalog.ns.tbl (c int);
spark-sql> insert into spark_catalog.ns.tbl select 0;
spark-sql> select * from spark_catalog.ns.tbl;
0
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```scala
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl")
scala> spark.table("spark_catalog.ns.tbl").show(false)
+-----+
|value|
+-----+
|0    |
|1    |
+-----+
```

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.FileFormatWriterSuite"
```

Closes #30919 from MaxGekk/insert-into-spark_catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 07:56:34 +00:00
Max Gekk 0eb4961ca8 [SPARK-33926][SQL] Improve the error message from resolving of v1 database name
### What changes were proposed in this pull request?
1. Replace `SessionCatalogAndNamespace` by `DatabaseInSessionCatalog` in resolving database name from v1 session catalog.
2. Throw more precise errors from `DatabaseInSessionCatalog`
3. Fix expected error messages in `v1.ShowTablesSuiteBase`

Closes #30947

### Why are the changes needed?
Current error message "multi-part identifier cannot be empty" may confuse users. And this error message is just a consequence of "incorrectly" applied an implicit class. For example, `SHOW TABLES IN spark_catalog`:

1. Spark cuts off `spark_catalog` from namespaces in `SessionCatalogAndNamespace`, so, `ns == Seq.empty` here: 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L365)
2. Then `ns.length != 1` is `true` and Spark tries to raise the exception at 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L367)
3.  ... but `ns.quoted` triggers implicit wrapping `Seq.empty` by `MultipartIdentifierHelper`, and hit to the second check `if (parts.isEmpty)` at 156704ba0d/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala (L120-L122)

So, Spark throws the exception at third step instead of `new AnalysisException(s"The database name is not valid: $quoted")` on the second step. And even on the second step, the exception doesn't show actual reason as it is pretty generic.

### Does this PR introduce _any_ user-facing change?
Yes in the case of v1 DDL commands when a database is not specified or nested databases is set.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DDLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```

Closes #30963 from MaxGekk/database-in-session-catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 07:52:34 +00:00
Hyukjin Kwon 403bf55cbe [SPARK-33927][BUILD] Fix Dockerfile for Spark release to work
### What changes were proposed in this pull request?

This PR proposes to fix the `Dockerfile` for Spark release.

- Port b135db3b1a to `Dockerfile`
- Upgrade Ubuntu 18.04 -> 20.04 (because of porting b135db3)
- Remove Python 2 (because of Ubuntu upgrade)
- Use built-in Python 3.8.5 (because of Ubuntu upgrade)
- Node.js 11 -> 12 (because of Ubuntu upgrade)
- Ruby 2.5 -> 2.7 (because of Ubuntu upgrade)
- Python dependencies and Jekyll + plugins upgrade to the latest as it's used in GitHub Actions build (unrelated to the issue itself)

### Why are the changes needed?

To make a Spark release :-).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via:

```bash
cd dev/create-release/spark-rm
docker build -t spark-rm --build-arg UID=$UID .
```

```
...
Successfully built 516d7943634f
Successfully tagged spark-rm:latest
```

Closes #30971 from HyukjinKwon/SPARK-33927.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 16:37:23 +09:00
Liang-Chi Hsieh 4a669f5830 [MINOR][SS] Call fetchEarliestOffsets when it is necessary
### What changes were proposed in this pull request?

This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary.

### Why are the changes needed?

To avoid unnecessary Kafka RPC calls.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30969 from viirya/ss-minor3.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 16:15:41 +09:00
gengjiaan 687f465244 [SPARK-33890][SQL] Improve the implement of trim/trimleft/trimright
### What changes were proposed in this pull request?
The current implement of trim/trimleft/trimright have somewhat redundant.

### Why are the changes needed?
Improve the implement of trim/trimleft/trimright

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #30905 from beliefer/SPARK-33890.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 06:06:17 +00:00
angerszhu 49aa6ebef1 [SPARK-32684][SQL][TESTS] Add a test case to check if null value is same as Hive's '\\N' in script transformation
### What changes were proposed in this pull request?
In hive script transform serde mode, NULL format default is `\\N`
```
String nullString = tbl.getProperty(
    serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
nullSequence = new Text(nullString);
```

I make a mistake that in Spark's code we need to fix and keep same with hive too.  So add some test case to show this issue.

### Why are the changes needed?
add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30946 from AngersZhuuuu/SPARK-32684.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 05:28:01 +00:00
Holden Karau 448494ebcf [SPARK-33874][K8S] Handle long lived sidecars
### What changes were proposed in this pull request?

For liveness check when checkAllContainers is not set, we check the liveness status of the Spark container if we can find it.

### Why are the changes needed?

Some environments may deploy long lived logs collecting side cars which outlive the Spark application. Just because they remain alive does not mean the Spark executor should keep running.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Extended the existing pod status tests.

Closes #30892 from holdenk/SPARK-33874-handle-long-lived-sidecars.

Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 14:06:34 +09:00
Liang-Chi Hsieh 951afc3acc [SPARK-33932][SS] Clean up KafkaOffsetReader API document
### What changes were proposed in this pull request?

This patch cleans up KafkaOffsetReader API document.

### Why are the changes needed?

KafkaOffsetReader API documents are duplicated among KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if the doc is centralized.

This also adds missing API doc too.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Doc only.

Closes #30961 from viirya/SPARK-33932.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 10:20:54 +09:00
Max Gekk 2b6836cdc2 [SPARK-33936][SQL] Add the version when connector's methods and interfaces were updated
### What changes were proposed in this pull request?
Add the `since` tag to methods and interfaces added recently.

### Why are the changes needed?
1. To follow the existing convention for Spark API.
2. To inform devs when Spark API was changed.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
`dev/scalastyle`

Closes #30966 from MaxGekk/spark-23889-interfaces-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-29 12:26:25 -08:00
Yuming Wang c42502493a [SPARK-33847][SQL][FOLLOWUP] Remove the CaseWhen should consider deterministic
### What changes were proposed in this pull request?

This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic.

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30960 from wangyum/SPARK-33847-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 14:35:01 +00:00
Max Gekk 16c594de79 [SPARK-33859][SQL][FOLLOWUP] Add version to SupportsPartitionManagement.renamePartition()
### What changes were proposed in this pull request?
Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface.

### Why are the changes needed?
To inform Spark devs when the method appears in the interface.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #30964 from MaxGekk/alter-table-rename-partition-v2-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 14:30:37 +00:00
angerszhu aadda4b561 [SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde
### What changes were proposed in this pull request?
For same SQL
```
SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3  as c) t
```
In hive:
```
hive> SELECT TRANSFORM(a, b, c, null)
    > ROW FORMAT DELIMITED
    > USING 'cat'
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '&'
    > FROM (select 1 as a, 2 as b, 3  as c) t;
OK
123\N	NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe
```

In Spark
```
Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
         > ROW FORMAT DELIMITED
         > USING 'cat'
         > ROW FORMAT DELIMITED
         > FIELDS TERMINATED BY '&'
         > FROM (select 1 as a, 2 as b, 3  as c) t;
1	2	3	null	NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>
```
We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001`

In hive default value is '1' to char is '\u0001'
```
bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
```

### Why are the changes needed?
Keep same behavior with hive

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30958 from AngersZhuuuu/SPARK-33930.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 23:26:27 +09:00
Yuming Wang 872107f67f [SPARK-33848][SQL][FOLLOWUP] Introduce allowList for push into (if / case) branches
### What changes were proposed in this pull request?

Introduce allowList push into (if / case) branches to fix potential bug.

### Why are the changes needed?

 Fix potential bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test.

Closes #30955 from wangyum/SPARK-33848-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:34:43 +00:00
ulysses-you 3b1b209e90 [SPARK-33909][SQL] Check rand functions seed is legal at analyer side
### What changes were proposed in this pull request?

Move seed is legal check to `CheckAnalysis`.

### Why are the changes needed?

It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #30923 from ulysses-you/SPARK-33909.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:33:06 +00:00
Max Gekk e0d2ffec31 [SPARK-33859][SQL] Support V2 ALTER TABLE .. RENAME PARTITION
### What changes were proposed in this pull request?
1. Add `renamePartition()` to the `SupportsPartitionManagement`
2. Implement `renamePartition()` in `InMemoryPartitionTable`
3. Add v2 execution node `AlterTableRenamePartitionExec`
4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement`
5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs.

### Why are the changes needed?
To have feature parity with Datasource V1.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running the unified tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #30935 from MaxGekk/alter-table-rename-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 13:29:48 +00:00
yangjie01 9d6dbe0fe5 [SPARK-33775][FOLLOWUP][TEST-MAVEN][BUILD] Suppress maven compilation warnings in Scala 2.13
### What changes were proposed in this pull request?
This pr is followup of SPARK-33775, the main change of this pr this sync suppression rules from `SparkBuild.scala` to `pom.xml` to let maven build have the same suppression ability for compilation warnings in Scala 2.13

### Why are the changes needed?
Suppress unimportant compilation warnings in Scala 2.13 with maven build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Local manual test:The suppressed compilation warnings are no longer printed to the console.

Closes #30951 from LuciferYang/SPARK-33775-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 21:42:00 +09:00
Liang-Chi Hsieh f9fe742442 [SPARK-32968][SQL] Prune unnecessary columns from CsvToStructs
### What changes were proposed in this pull request?

This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it.

### Why are the changes needed?

`CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #30912 from viirya/SPARK-32968.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 21:37:17 +09:00
Dongjoon Hyun 2627825647 [SPARK-33931][INFRA] Recover GitHub Action build_and_test job
### What changes were proposed in this pull request?

This PR aims to recover GitHub Action `build_and_test` job.

### Why are the changes needed?

Currently, `build_and_test` job fails to start because of  the following in master/branch-3.1 at least.
```
r-lib/actions/setup-rv1 is not allowed to be used in apache/spark.
Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace,
within a repository owned by apache or match the following:
adoptopenjdk/*, apache/*, gradle/wrapper-validation-action.
```
- https://github.com/apache/spark/actions/runs/449826457

![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png)

### Does this PR introduce _any_ user-facing change?

No. This is a test infra.

### How was this patch tested?

To check GitHub Action `build_and_test` job on this PR.

Closes #30959 from dongjoon-hyun/SPARK-33931.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-29 20:51:57 +09:00
yi.wu 1ef7ddd38a [SPARK-33928][SPARK-23365][TEST][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite - " Don't update target num executors when killing idle executors"
### What changes were proposed in this pull request?

Use the testing mode for the test to fix the flaky.

### Why are the changes needed?

The test is flaky:

```scala
[info] - SPARK-23365 Don't update target num executors when killing idle executors *** FAILED *** (126 milliseconds)
[info] 1 did not equal 2 (ExecutorAllocationManagerSuite.scala:1615)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info] at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$84(ExecutorAllocationManagerSuite.scala:1617)
...
```
The root cause should be the same as https://github.com/apache/spark/pull/29773 since the test run under non-testing mode.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked. Flaky is gone by running the test hundreds of times after this fix.

Closes #30956 from Ngone51/fix-flaky-SPARK-23365.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 07:35:45 +00:00
Yuming Wang f7bdea334a [SPARK-33884][SQL] Simplify CaseWhenclauses with (true and false) and (false and true)
### What changes were proposed in this pull request?

This pr simplify `CaseWhen`clauses with (true and false) and (false and true):

Expression | cond.nullable | After simplify
-- | -- | --
case when cond then true else false end | true | cond <=> true
case when cond then true else false end | false | cond
case when cond then false else true end | true | !(cond <=> true)
case when cond then false else true end | false | !cond

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30898 from wangyum/SPARK-33884.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 07:09:11 +00:00
Max Gekk 379afcd2ce [SPARK-33924][SQL][TESTS] Preserve partition metadata by INSERT INTO in v2 table catalog
### What changes were proposed in this pull request?
For `InMemoryPartitionTable` used in tests, set empty partition metadata only when a partition doesn't exists.

### Why are the changes needed?
This bug fix is needed to use `INSERT INTO .. PARTITION` in other tests.

### Does this PR introduce _any_ user-facing change?
No. It affects only the v2 table catalog used in tests.

### How was this patch tested?
Added new UT to `DataSourceV2SQLSuite`, and run the affected test suite by:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.connector.DataSourceV2SQLSuite"
```

Closes #30952 from MaxGekk/fix-insert-into-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-29 06:49:26 +00:00
HyukjinKwon b33fa53385 [SPARK-33925][CORE] Remove unused SecurityManager in Utils.fetchFile
### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/24033.
The first and last usage of that argument `SecurityManager` was removed in https://github.com/apache/spark/pull/24033.
After that,  we don't need to pass `SecurityManager` anymore in `Utils.fetchFile` and related code paths.

This PR proposes to remove it out.

### Why are the changes needed?

For better readability of codes.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually complied. GitHub Actions and Jenkins build should test it out as well.

Closes #30945 from HyukjinKwon/SPARK-33925.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-28 16:58:42 -08:00