Commit graph

28721 commits

Author SHA1 Message Date
Ruifeng Zheng ebd8b9357a [SPARK-33609][ML] word2vec reduce broadcast size
### What changes were proposed in this pull request?
1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
2, mark `wordList` and `wordVecNorms` lazy
3, avoid slicing in computation of `wordVecNorms`

### Why are the changes needed?
halve broadcast size

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #30548 from zhengruifeng/w2v_float32_transform.

Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-08 11:04:29 +08:00
Dongjoon Hyun b2a79306ef
[SPARK-33680][SQL][TESTS][FOLLOWUP] Fix more test suites to have explicit confs
### What changes were proposed in this pull request?

This is a follow-up for SPARK-33680 to remove the assumption on the default value of `spark.sql.adaptive.enabled` .

### Why are the changes needed?

According to the test result https://github.com/apache/spark/pull/30628#issuecomment-739866168, the [previous run](https://github.com/apache/spark/pull/30628#issuecomment-739641105) didn't run all tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30655 from dongjoon-hyun/SPARK-33680.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 18:59:15 -08:00
Fokko Driesprong e4d1c10760 [SPARK-32320][PYSPARK] Remove mutable default arguments
This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

```
fokkodriesprongFan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
```

```
fokkodriesprongFan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):
```

### What changes were proposed in this pull request?

Removing the mutable default arguments.

### Why are the changes needed?

Removing the mutable default arguments, and changing the signature to `Optional[...]`.

### Does this PR introduce _any_ user-facing change?

No 👍

### How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

Closes #29122 from Fokko/SPARK-32320.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-08 09:35:36 +08:00
Anton Okolnychyi 02508b68ec
[SPARK-33621][SQL] Add a way to inject data source rewrite rules
### What changes were proposed in this pull request?

This PR adds a way to inject data source rewrite rules.

### Why are the changes needed?

Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

This PR comes with a new test.

Closes #30577 from aokolnychyi/spark-33621-v3.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 15:32:10 -08:00
Wenchen Fan c0874ba9f1
[SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data.

### Why are the changes needed?

This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes #30653 from cloud-fan/minor2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 13:35:37 -08:00
Wenchen Fan 6aff215077 [SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.

### Why are the changes needed?

It's confusing for having two config while one can cover another completely.

### Does this PR introduce _any_ user-facing change?

no, it's deprecating not removing.

### How was this patch tested?

N/A

Closes #30651 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 10:50:31 -08:00
Josh Soref c62b84a043 [MINOR] Spelling sql not core
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `sql/catalyst`
* `sql/hive-thriftserver`
* `sql/hive`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30532 from jsoref/spelling-sql-not-core.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-07 08:40:29 -06:00
Kent Yao da72b87374 [SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results
### What changes were proposed in this pull request?

In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later

### Why are the changes needed?
1. udf
```scala
scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))

scala> spark.sql("select abcd()").show
scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
  at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
  ... 47 elided
```

2. spark.createDataframe

```
scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
+--------------------+
|                   c|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Structured St...|
|                    |
|<https://spark.ap...|
|                    |
|[![Jenkins Build]...|
|[![AppVeyor Build...|
|[![PySpark Covera...|
|                    |
|                    |
```

3. reader.schema

```
scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
+--------------------+
|                   a|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
```
4. etc

### Does this PR introduce _any_ user-facing change?

NO, we intend to avoid protentical breaking change

### How was this patch tested?

new tests

Closes #30586 from yaooqinn/SPARK-33641.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-07 13:40:15 +00:00
Linhong Liu d730b6bdaa [SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query
### What changes were proposed in this pull request?
The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for
CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved
object" (instead of a user-friendly error msg: "table or view not found").
This PR fixes this wrongly preprocess for CTAS using V2 catalog.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
The error message for CTAS with a non-exists table changed from:
`UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to
`AnalysisException: Table or view not found: xxx`

### How was this patch tested?
added test

Closes #30637 from linhongliu-db/fix-ctas.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-07 13:25:43 +00:00
Yuming Wang 1e0c006748 [SPARK-33617][SQL] Add default parallelism configuration for Spark SQL queries
### What changes were proposed in this pull request?

This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`.

### Why are the changes needed?

Avoid generating small files for INSERT INTO TABLE from VALUES, for example:
```sql
CREATE TABLE t1(id int) USING parquet;
INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8);
```

Before this pr:
```
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Dec  1 01:54 _SUCCESS
```

After this pr and set `spark.sql.files.minPartitionNum` to 1:
```
-rw-r--r-- 1 root root 452 Dec  1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Dec  1 01:59 _SUCCESS
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30559 from wangyum/SPARK-33617.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <yumwang@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 21:36:52 +09:00
Max Gekk 26c0493318 [SPARK-33676][SQL] Require exact matching of partition spec to the schema in V2 ALTER TABLE .. ADD/DROP PARTITION
### What changes were proposed in this pull request?
Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs).

### Why are the changes needed?
1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error:
```sql
spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b);
spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9');
Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`';
```
2. To prevent future errors caused by not fully specified partition specs.

### Does this PR introduce _any_ user-facing change?
Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands.

### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
```

Closes #30624 from MaxGekk/add-partition-full-spec.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-07 08:14:36 +00:00
Max Gekk 87c056088e
[SPARK-33671][SQL] Remove VIEW checks from V1 table commands
### What changes were proposed in this pull request?
Remove VIEW checks from the following V1 commands:
- `SHOW PARTITIONS`
- `TRUNCATE TABLE`
- `LOAD DATA`

The checks are performed earlier at:
acc211d2cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L885-L889)

### Why are the changes needed?
To improve code maintenance, and remove dead codes.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites like `v1/ShowPartitionsSuite`.

1. LOAD DATA:
acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L176-L179)
2. TRUNCATE TABLE:
acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L180-L183)
3. SHOW PARTITIONS:
- v1/ShowPartitionsSuite

Closes #30620 from MaxGekk/show-table-check-view.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 23:22:52 -08:00
Kousuke Saruta d48ef34911
[SPARK-33684][BUILD] Upgrade httpclient from 4.5.6 to 4.5.13
### What changes were proposed in this pull request?

This PR upgrades `commons.httpclient` from `4.5.6` to `4.5.13`.
4.5.6 is released over 2 years ago and now we can use more stable `4.5.13`.
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt

### Why are the changes needed?

To follow the more stable release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by the existing tests.

Closes #30634 from sarutak/upgrade-httpclient.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 23:02:36 -08:00
Dongjoon Hyun 73412ffb3a
[SPARK-33680][SQL][TESTS] Fix PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite not to depend on the default conf
### What changes were proposed in this pull request?

This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly.

### Why are the changes needed?

The unit test should not depend on the default configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

According to https://github.com/apache/spark/pull/30628 , this seems to be the only ones.

Pass the CIs.

Closes #30631 from dongjoon-hyun/SPARK-CONF-AGNO.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 19:34:54 -08:00
Kousuke Saruta e88f0d4a24
[SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build in GitHub Actions
### What changes were proposed in this pull request?

This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job.

In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows.
```
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile
```

Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`.

### Why are the changes needed?

To build with consistent configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions' workflow.

Closes #30633 from sarutak/scala-213-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 17:57:19 -08:00
Max Gekk 29096a8869 [SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED
### What changes were proposed in this pull request?
Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified.

This PR is some kind of follow up https://github.com/apache/spark/pull/16373 and https://github.com/apache/spark/pull/15515.

### Why are the changes needed?
To output an user friendly error with recommendation like
**"
... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName`
"**
instead of silently output an empty result.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running the affected test suites, in particular:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite"
```

Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 10:21:04 +09:00
Dongjoon Hyun e32de29bce [SPARK-33675][INFRA] Add GitHub Action job to publish snapshot
### What changes were proposed in this pull request?

This PR aims to add `GitHub Action` job to publish daily snapshot for **master** branch.
- https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.2.0-SNAPSHOT/

For the other branches, I'll make adjusted backports.
- For `branch-3.1`, we can specify the checkout `ref` to `branch-3.1`.
- For `branch-2.4` and `branch-3.0`, we can publish at every commit since the traffic is low.
  - https://github.com/apache/spark/pull/30630 (branch-3.0)
  - https://github.com/apache/spark/pull/30629 (branch-2.4 LTS)

### Why are the changes needed?

After this series of jobs, this will reduce our maintenance burden permanently from AmpLab Jenkins by removing the following completely.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/

For now, AmpLab Jenkins doesn't have a job for `branch-3.1`. We can do it by ourselves by `GitHub Action`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The snapshot publishing is tested here at PR trigger. Since this PR adds a scheduled job, we cannot test in this PR.
- https://github.com/dongjoon-hyun/spark/runs/1505792859

Apache Infra team finished the setup here.
- https://issues.apache.org/jira/browse/INFRA-21167

Closes #30623 from dongjoon-hyun/SPARK-33675.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 10:05:28 +09:00
Terry Kim 119539fd49 [SPARK-33663][SQL] Uncaching should not be called on non-existing temp views
### What changes were proposed in this pull request?

This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views:
```
scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.table("table")
df: org.apache.spark.sql.DataFrame = [2: int]

scala> df.createOrReplaceTempView("t2")
20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name
org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
'UnresolvedRelation [t2], [], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
	at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
	at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
	at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
	at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
	at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
```
Since `t2` does not exist yet, it shouldn't try to uncache.

### Why are the changes needed?

To fix misleading message.

### Does this PR introduce _any_ user-facing change?

Yes, the above message will not be displayed if the view doesn't exist yet.

### How was this patch tested?

Manually tested since this is a log message printed.

Closes #30608 from imback82/fix_cache_message.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 09:48:16 +09:00
Xiao Li b94ecf0734 [SPARK-33674][TEST] Show Slowpoke notifications in SBT tests
### What changes were proposed in this pull request?
This PR is to show Slowpoke notifications in the log when running tests using SBT.

For example, the test case "zero sized blocks" in ExternalShuffleServiceSuite enters the infinite loop. After this change, the log file will have a notification message every 5 minute when the test case running longer than two minutes. Below is an example message.

```
[info] ExternalShuffleServiceSuite:
[info] - groupByKey without compression (101 milliseconds)
[info] - shuffle non-zero block size (3 seconds, 186 milliseconds)
[info] - shuffle serializer (3 seconds, 189 milliseconds)
[info] *** Test still running after 2 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 7 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 12 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 17 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
```

### Why are the changes needed?
When the tests/code has bug and enters the infinite loop, it is hard to tell which test cases hit some issues from the log, especially when we are running the tests in parallel. It would be nice to show the Slowpoke notifications.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual testing in my local dev environment.

Closes #30621 from gatorsmile/addSlowpoke.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-12-06 22:36:34 +08:00
Max Gekk 48297818f3
[SPARK-33667][SQL] Respect the spark.sql.caseSensitive config while resolving partition spec in v1 SHOW PARTITIONS
### What changes were proposed in this pull request?
Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.

### Why are the changes needed?
V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
```sql
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
         > USING parquet
         > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS;
```
The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the command above works as expected:
```sql
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
year=2015/month=1
```

### How was this patch tested?
By running the affected test suites:
- `v1/ShowPartitionsSuite`
- `v2/ShowPartitionsSuite`

Closes #30615 from MaxGekk/show-partitions-case-sensitivity-test.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 02:56:08 -08:00
HyukjinKwon 5250841537
[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style
### What changes were proposed in this pull request?

This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.

Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.

### Why are the changes needed?

To tell developers that PySpark now follows NumPy documentation style.

### Does this PR introduce _any_ user-facing change?

No, it's a change in unreleased branches yet.

### How was this patch tested?

Manually tested via `make clean html` under `python/docs`:

![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)

Closes #30622 from HyukjinKwon/SPARK-33256.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 01:22:24 -08:00
Chao Sun e857e06452
[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache
### What changes were proposed in this pull request?

This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`.

### Why are the changes needed?

Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later.

### Does this PR introduce _any_ user-facing change?

Yes. Now delete from table in v2 also refreshes cache.

### How was this patch tested?

Added a test case.

Closes #30597 from sunchao/SPARK-33652.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 01:14:22 -08:00
Prashant Sharma 6317ba29a1
[SPARK-33668][K8S][TEST] Fix flaky test "Verify logging configuration is picked from the provided
### What changes were proposed in this pull request?
Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."
The test is flaking, with multiple flaked instances - the reason for the failure has been similar to:

```

The code passed to eventually never returned normally. Attempted 109 times over 3.0079882413999997 minutes. Last failure message: Failure executing: GET at:
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false. Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).. (KubernetesSuite.scala:402)

```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
From the above failures, it seems, that executor finishes too quickly and is removed by spark before the test can complete.
So, in order to mitigate this situation, one way is to turn on the flag
   "spark.kubernetes.executor.deleteOnTermination"

### Why are the changes needed?

Fixes a flaky test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.
May be a few runs of jenkins integration test, may reveal if the problem is resolved or not.

Closes #30616 from ScrapCodes/SPARK-33668/fix-flaky-k8s-integration-test.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-05 23:04:55 -08:00
Terry Kim 154f604403 [MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataConsumer.scala
### What changes were proposed in this pull request?

This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.

### Why are the changes needed?

To fix a string interpolation bug.

### Does this PR introduce _any_ user-facing change?

Yes, the string will be correctly constructed.

### How was this patch tested?

Existing tests since they were used in exception/log messages.

Closes #30609 from imback82/fix_cache_str_interporlation.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-06 12:03:14 +09:00
Wenchen Fan 1b4e35d1a8
[SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data source tables
### What changes were proposed in this pull request?

This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it.

### Why are the changes needed?

It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all.

### Does this PR introduce _any_ user-facing change?

Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed.

### How was this patch tested?

new tests

Closes #30595 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 16:48:31 -08:00
allisonwang-db 960d6af75d
[SPARK-33472][SQL][FOLLOW-UP] Update RemoveRedundantSorts comment
### What changes were proposed in this pull request?
This PR is a follow-up for #30373 that updates the comment for RemoveRedundantSorts in QueryExecution.

### Why are the changes needed?
To update an incorrect comment.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #30584 from allisonwang-db/spark-33472-followup.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 15:15:19 -08:00
Dongjoon Hyun b6b45bc695
[SPARK-33141][SQL][FOLLOW-UP] Fix Scala 2.13 compilation
### What changes were proposed in this pull request?

This PR aims to fix Scala 2.13 compilation.

### Why are the changes needed?

To recover Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GitHub Action Scala 2.13 build job.

Closes #30611 from dongjoon-hyun/SPARK-33141.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 15:04:18 -08:00
Dongjoon Hyun de9818f043
[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #30606 from dongjoon-hyun/SPARK-3.2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 14:10:42 -08:00
german d671e053e9 [SPARK-33660][DOCS][SS] Fix Kafka Headers Documentation
### What changes were proposed in this pull request?

Update kafka headers documentation, type is not longer a map but an array

[jira](https://issues.apache.org/jira/browse/SPARK-33660)

### Why are the changes needed?
To help users

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

It is only documentation

Closes #30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation.

Authored-by: german <germanschiavon@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-12-05 06:51:54 +09:00
Wenchen Fan acc211d2cf [SPARK-33141][SQL][FOLLOW-UP] Store the max nested view depth in AnalysisContext
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30289. It removes the hack in `View.effectiveSQLConf`, by putting the max nested view depth in `AnalysisContext`. Then we don't get the max nested view depth from the active SQLConf, which keeps changing during nested view resolution.

### Why are the changes needed?

remove hacks.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

If I just remove the hack, `SimpleSQLViewSuite.restrict the nested level of a view` fails. With this fix, it passes again.

Closes #30575 from cloud-fan/view.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 14:01:15 +00:00
HyukjinKwon 990bee9c58 [SPARK-33615][K8S] Make 'spark.archives' working in Kubernates
### What changes were proposed in this pull request?

This PR proposes to make `spark.archives` configuration working in Kubernates.
It works without a problem in standalone cluster but there seems a bug in Kubernates.
It fails to fetch the file on the driver side as below:

```
20/12/03 13:33:53 INFO SparkContext: Added JAR file:/tmp/spark-75004286-c83a-4369-b624-14c5d2d2a748/spark-examples_2.12-3.1.0-SNAPSHOT.jar at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar with timestamp 1607002432558
20/12/03 13:33:53 INFO SparkContext: Added archive file:///tmp/tmp4542734800151332666.txt.tar.gz#test_tar_gz at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz with timestamp 1607002432558
20/12/03 13:33:53 INFO TransportClientFactory: Successfully created connection to spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc/172.17.0.4:7078 after 83 ms (47 ms spent in bootstraps)
20/12/03 13:33:53 INFO Utils: Fetching spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz to /tmp/spark-66573e24-27a3-427c-99f4-36f06d9e9cd5/fetchFileTemp2665785666227461849.tmp
20/12/03 13:33:53 ERROR SparkContext: Error initializing SparkContext.
java.lang.RuntimeException: Stream '/files/tmp4542734800151332666.txt.tar.gz' was not found.
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:242)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
```

This is because `spark.archives` was not actually added on the driver side correctly. The changes here fix it by adding and resolving URIs correctly.

### Why are the changes needed?

`spark.archives` feature can be leveraged for many things such as Conda support. We should make it working in Kubernates as well.
This is a bug fix too.

### Does this PR introduce _any_ user-facing change?

No, this feature is not out yet.

### How was this patch tested?

I manually tested with Minikube 1.15.1. For an environment issue (?), I had to use a custom namespace, service account and roles. `default` service account does not work for me and complains it doesn't have permissions to get/list pods, etc.

```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.1.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark --namespace spark-integration-test
```

Closes #30581 from HyukjinKwon/SPARK-33615.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 19:37:03 +09:00
Jungtaek Lim (HeartSaVioR) 233a8494c8 [SPARK-27237][SS] Introduce State schema validation among query restart
## What changes were proposed in this pull request?

Please refer the description of [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) to see rationalization of this patch.

This patch proposes to introduce state schema validation, via storing key schema and value schema to `schema` file (for the first time) and verify new key schema and value schema for state are compatible with existing one. To be clear for definition of "compatible", state schema is "compatible" when number of fields are same and data type for each field is same - Spark has been allowing rename of field.

This patch will prevent query run which has incompatible state schema, which would reduce the chance to get indeterministic behavior (actually renaming of field is also the smell of semantically incompatible, but end users could just modify its name so we can't say) as well as providing more informative error message.

## How was this patch tested?

Added UTs.

Closes #24173 from HeartSaVioR/SPARK-27237.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 19:33:11 +09:00
Kousuke Saruta 976e897039
[SPARK-33640][TESTS] Extend connection timeout to DB server for DB2IntegrationSuite and its variants
### What changes were proposed in this pull request?

This PR extends the connection timeout to the DB server for DB2IntegrationSuite and its variants.

The container image ibmcom/db2 creates a database when it starts up.
The database creation can take over 2 minutes.

DB2IntegrationSuite and its variants use the container image but the connection timeout is set to 2 minutes so these suites almost always fail.
### Why are the changes needed?

To pass those suites.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the suites pass with the following commands.
```
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2IntegrationSuite"
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.v2.DB2IntegrationSuite"
$ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.DB2KrbIntegrationSuite"

Closes #30583 from sarutak/extend-timeout-for-db2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 00:12:04 -08:00
Kousuke Saruta 91baab77f7
[SPARK-33656][TESTS] Add option to keep container after tests finish for DockerJDBCIntegrationSuites for debug
### What changes were proposed in this pull request?

This PR add an option to keep container after DockerJDBCIntegrationSuites (e.g. DB2IntegrationSuite, PostgresIntegrationSuite) finish.
By setting a system property `spark.test.docker.keepContainer` to `true`, we can use this option.

### Why are the changes needed?

If some error occur during the tests, it would be useful to keep the container for debug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that the container is kept after the test by the following commands.
```
# With sbt
$ build/sbt -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite"

# With Maven
$ build/mvn -Dspark.test.docker.keepContainer=true -Pdocker-integration-tests -Phive -Phive-thriftserver -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite test

$ docker container ls
```

I also confirmed that there are no regression for all the subclasses of `DockerJDBCIntegrationSuite` with sbt/Maven.
* MariaDBKrbIntegrationSuite
* DB2KrbIntegrationSuite
* PostgresKrbIntegrationSuite
* MySQLIntegrationSuite
* PostgresIntegrationSuite
* DB2IntegrationSuite
* MsSqlServerintegrationsuite
* OracleIntegrationSuite
* v2.MySQLIntegrationSuite
* v2.PostgresIntegrationSuite
* v2.DB2IntegrationSuite
* v2.MsSqlServerIntegrationSuite
* v2.OracleIntegrationSuite

NOTE: `DB2IntegrationSuite`, `v2.DB2IntegrationSuite` and `DB2KrbIntegrationSuite` can fail due to the too much short connection timeout. It's a separate issue and I'll fix it in #30583

Closes #30601 from sarutak/keepContainer.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 23:47:43 -08:00
Yuanjian Li 325abf7957 [SPARK-33577][SS] Add support for V1Table in stream writer table API and create table if not exist by default
### What changes were proposed in this pull request?
After SPARK-32896, we have table API for stream writer but only support DataSource v2 tables. Here we add the following enhancements:

- Create non-existing tables by default
- Support both managed and external V1Tables

### Why are the changes needed?
Make the API covers more use cases. Especially for the file provider based tables.

### Does this PR introduce _any_ user-facing change?
Yes, new features added.

### How was this patch tested?
Add new UTs.

Closes #30521 from xuanyuanking/SPARK-33577.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-12-04 16:45:55 +09:00
Max Gekk 94c144bdd0 [SPARK-33571][SQL][DOCS] Add a ref to INT96 config from the doc for spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read
### What changes were proposed in this pull request?
For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by:
1. Explicitly document on which parquet types, those configs influence on
2. Refer to corresponding configs for `INT96`

### Why are the changes needed?
To avoid user confusions like reposted in SPARK-33571, and make the config description more precise.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`.

Closes #30596 from MaxGekk/clarify-rebase-docs.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 16:26:07 +09:00
Gengliang Wang e8380665c7 [SPARK-33658][SQL] Suggest using Datetime conversion functions for invalid ANSI casting
### What changes were proposed in this pull request?

Suggest users using Datetime conversion functions in the error message of invalid ANSI explicit casting.

### Why are the changes needed?

In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed.
As of now, we have introduced new functions `UNIX_SECONDS`/`UNIX_MILLIS`/`UNIX_MICROS`/`UNIX_DATE`/`DATE_FROM_UNIX_DATE`, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode.

### Does this PR introduce _any_ user-facing change?

Yes, better error messages

### How was this patch tested?

Unit test

Closes #30603 from gengliangwang/improveErrorMsgOfExplicitCast.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 16:24:41 +09:00
Huaxin Gao 15579ba1f8 [SPARK-33430][SQL] Support namespaces in JDBC v2 Table Catalog
### What changes were proposed in this pull request?
Add namespaces support in JDBC v2 Table Catalog by making ```JDBCTableCatalog``` extends```SupportsNamespaces```

### Why are the changes needed?
make v2 JDBC implementation complete

### Does this PR introduce _any_ user-facing change?
Yes. Add the following to  ```JDBCTableCatalog```

- listNamespaces
- listNamespaces(String[] namespace)
- namespaceExists(String[] namespace)
- loadNamespaceMetadata(String[] namespace)
- createNamespace
- alterNamespace
- dropNamespace

### How was this patch tested?
Add new docker tests

Closes #30473 from huaxingao/name_space.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 07:23:35 +00:00
Linhong Liu e02324f2dd [SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view
### What changes were proposed in this pull request?
Currently, in spark, the temp view is saved as its analyzed logical plan, while the permanent view
is kept in HMS with its origin SQL text. As a result, permanent and temporary views have
different behaviors in some cases. In this PR we store the SQL text for temporary view in order
to unify the behavior between permanent and temporary views.

### Why are the changes needed?
to unify the behavior between permanent and temporary views

### Does this PR introduce _any_ user-facing change?
Yes, with this PR, the temporary view will be re-analyzed when it's referred. So if the
underlying datasource changed, the view will also be updated.

### How was this patch tested?
existing and newly added test cases

Closes #30567 from linhongliu-db/SPARK-33142.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 06:48:49 +00:00
Huaxin Gao e22ddb6740 [SPARK-32405][SQL][FOLLOWUP] Remove USING _ in CREATE TABLE in JDBCTableCatalog docker tests
### What changes were proposed in this pull request?
remove USING _ in CREATE TABLE in JDBCTableCatalog docker tests

### Why are the changes needed?
Previously CREATE TABLE syntax forces users to specify a provider so we have to add a USING _ . Now the problem was fix and we need to remove it.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #30599 from huaxingao/remove_USING.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 05:43:05 +00:00
Gengliang Wang 29e415deac [SPARK-33649][SQL][DOC] Improve the doc of spark.sql.ansi.enabled
### What changes were proposed in this pull request?

Improve the documentation of SQL configuration `spark.sql.ansi.enabled`

### Why are the changes needed?

As there are more and more new features under the SQL configuration `spark.sql.ansi.enabled`, we should make it more clear about:
1. what exactly it is
2. where can users find all the features of the ANSI mode
3. whether all the features are exactly from the SQL standard

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

It's just doc change.

Closes #30593 from gengliangwang/reviseAnsiDoc.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-12-04 10:58:41 +08:00
Max Gekk 85949588b7 [SPARK-33650][SQL] Fix the error from ALTER TABLE .. ADD/DROP PARTITION for non-supported partition management table
### What changes were proposed in this pull request?
In the PR, I propose to change the order of post-analysis checks for the `ALTER TABLE .. ADD/DROP PARTITION` command, and perform the general check (does the table support partition management at all) before specific checks.

### Why are the changes needed?
The error message for the table which doesn't support partition management can mislead users:
```java
PartitionSpecs are not resolved;;
'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false
+- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable5d3ff859
```
because it says nothing about the root cause of the issue.

### Does this PR introduce _any_ user-facing change?
Yes. After the change, the error message will be:
```
Table ns1.ns2.tbl can not alter partitions
```

### How was this patch tested?
By running the affected test suite `AlterTablePartitionV2SQLSuite`.

Closes #30594 from MaxGekk/check-order-AlterTablePartition.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 16:43:15 -08:00
Weichen Xu 7e759b2d95 [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator
### What changes were proposed in this pull request?
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

### Why are the changes needed?
Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark.

CrossValidator and TrainValidateSplit support tuning these python backend estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance.

OneVsRest saving/load now only support java backend classifier due to similar issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30471 from WeichenXu123/support_pyio_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2020-12-04 08:35:50 +08:00
Wenchen Fan 63f9d474b9
[SPARK-33634][SQL][TESTS] Use Analyzer in PlanResolutionSuite
### What changes were proposed in this pull request?

Instead of using several analyzer rules, this PR uses the actual analyzer to run tests in `PlanResolutionSuite`.

### Why are the changes needed?

Make the test suite to match reality.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

test-only

Closes #30574 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 09:22:53 -08:00
Anton Okolnychyi aa13e207c9
[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete
### What changes were proposed in this pull request?

This PR provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time.

### Why are the changes needed?

The only way to support delete statements right now is to implement ``SupportsDelete``. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table).

This PR actually provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters.

Consider an example of a partitioned Hive table. If we have a delete predicate like `part_col = '2020'`, we can just drop the matching partition to satisfy this delete. In this case, the data source should return `true` from `canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like `id = 10`, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return `false` from `canDeleteWhere` and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis).

If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return `false` from `canDeleteWhere`.
### Does this PR introduce _any_ user-facing change?

Yes but it is backward compatible.

### How was this patch tested?

This PR comes with a new test.

Closes #30562 from aokolnychyi/spark-33623.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 09:12:30 -08:00
Gabor Somogyi bd711863fd [SPARK-33629][PYTHON] Make spark.buffer.size configuration visible on driver side
### What changes were proposed in this pull request?
`spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue.

### Why are the changes needed?
Apply the mentioned config on driver side.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests + manually.

Added the following code temporarily:
```
def local_connect_and_auth(port, auth_secret):
...
            sock.connect(sa)
            print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition
            sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536)))
...
```

Test:
```
#Compile Spark

echo "spark.buffer.size 10000" >> conf/spark-defaults.conf

$ ./bin/pyspark
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 21 2020 10:48:26)
Spark context Web UI available at http://192.168.0.189:4040
Spark context available as 'sc' (master = local[*], app id = local-1606999094506).
SparkSession available as 'spark'.
>>> sc.setLogLevel("TRACE")
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
...
SPARK_BUFFER_SIZE: 10000
...
[[0], [2], [3], [4], [6]]
>>>
```

Closes #30592 from gaborgsomogyi/SPARK-33629.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 01:37:44 +09:00
Wenchen Fan 0706e64c49 [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command
### What changes were proposed in this pull request?

For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false.

This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e265

This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.

### Why are the changes needed?

Changing from Hive text table to native Parquet table has many benefits:
1. be consistent with `DataFrameWriter.saveAsTable`.
2. better performance
3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result)
4. better interoperability as Parquet is a more popular open file format.

### Does this PR introduce _any_ user-facing change?

No by default. If the config is set, the behavior change is described below:

Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE | SERDEPROPERTIES]` and `LOAD DATA`.

char/varchar behavior has been taken care by https://github.com/apache/spark/pull/30412, and there is no behavior difference between data source and hive tables.

One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.

### How was this patch tested?

Re-enable the tests

Closes #30554 from cloud-fan/create-table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 15:24:44 +00:00
luluorta 512fb32b38 [SPARK-26218][SQL][FOLLOW UP] Fix the corner case of codegen when casting float to Integer
### What changes were proposed in this pull request?
This is a followup of [#27151](https://github.com/apache/spark/pull/27151). It fixes the same issue for the codegen path.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Unit test.

Closes #30585 from luluorta/SPARK-26218.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 14:58:56 +00:00
Gengliang Wang ff13f574e6 [SPARK-20044][SQL] Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
### What changes were proposed in this pull request?

Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between Date type and Numeric types.

### Why are the changes needed?

1. Explicit conversion between Date type and Numeric types is disallowed in ANSI mode. We need to provide new functions for users to complete the conversion.

2. We have introduced new functions from Bigquery for conversion between Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense to add functions for conversion between Date type and Numeric types as well.

### Does this PR introduce _any_ user-facing change?

Yes, two new datetime functions are added.

### How was this patch tested?

Unit tests

Closes #30588 from gengliangwang/dateToNumber.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 14:04:08 +00:00
Liang-Chi Hsieh 3b2ff16ee6 [SPARK-33636][PYTHON][ML][FOLLOWUP] Update since tag of labelsArray in StringIndexer
### What changes were proposed in this pull request?

This is to update `labelsArray`'s since tag.

### Why are the changes needed?

The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A. Just tag change.

Closes #30582 from viirya/SPARK-33636-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 14:34:44 +09:00