### What changes were proposed in this pull request?
While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org
### Why are the changes needed?
gnupg.net key servers are flaky and sometimes fail to resolve or return keys.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed.
Closes#30696 from holdenk/SPARK-33727-gnupg-server-is-flaky.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This upgrades snappy-java to 1.1.8.2.
### Why are the changes needed?
Minor version upgrade that includes:
- [Fixed](https://github.com/xerial/snappy-java/pull/265) an initialization issue when using a recent Mac OS X version
- Support Apple Silicon (M1, Mac-aarch64)
- Fixed the pure-java Snappy fallback logic when no native library for your platform is found.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30690 from viirya/upgrade-snappy.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`.
### Why are the changes needed?
This change allows Spark to optimize delete conditions like we optimize filters.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This PR extends the existing test cases to also cover `DeleteFromTable`.
Closes#30688 from aokolnychyi/spark-33722.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.
### Why are the changes needed?
To make it easier to maintain and read.
### Does this PR introduce _any_ user-facing change?
No. This is rather a code cleanup.
### How was this patch tested?
Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.
Closes#30682 from HyukjinKwon/SPARK-33071-SPARK-33536.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance
### Why are the changes needed?
Users can know that these functions throw runtime exceptions under ANSI mode if the result is not valid.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build doc and check it in browser:
![image](https://user-images.githubusercontent.com/1097932/101608930-34a79e80-39bb-11eb-9294-9d9b8c3f6faa.png)
Closes#30683 from gengliangwang/improveDoc.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, when a client requests FETCH_PRIOR to Thriftserver, Thriftserver reiterates from the start position. Because Thriftserver caches a query result with an array when THRIFTSERVER_INCREMENTAL_COLLECT feature is off, FETCH_PRIOR can be implemented without reiterating the result. A trait FeatureIterator is added in order to separate the implementation for iterator and an array. Also, FeatureIterator supports moves cursor with absolute position, which will be useful for the implementation of FETCH_RELATIVE, FETCH_ABSOLUTE.
### Why are the changes needed?
For better performance of Thriftserver.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
FetchIteratorSuite
Closes#30600 from Dooyoung-Hwang/refactor_with_fetch_iterator.
Authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx"
### Why are the changes needed?
For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request.
But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application xxxxx; YARN application has exited unexpectedly with state xxxxx" is logged with error level. which confuse user a lot.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
very simple patch, seems no need?
Closes#30617 from sqlwindspeaker/yarn-client-interrupt-monitor.
Authored-by: suqilong <suqilong@qiyi.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
Add migration guide for CHAR VARCHAR types
### Why are the changes needed?
for migration
### Does this PR introduce _any_ user-facing change?
doc change
### How was this patch tested?
passing ci
Closes#30654 from yaooqinn/SPARK-33641-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `MSCK REPAIR TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
Note that `MSCK REPAIR TABLE` is not supported for v2 tables.
### Why are the changes needed?
The PR makes the resolution consistent behavior consistent. For example,
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("MSCK REPAIR TABLE t") // works fine
```
, but after this PR:
```
sql("MSCK REPAIR TABLE t")
org.apache.spark.sql.AnalysisException: t is a temp view. 'MSCK REPAIR TABLE' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.
### Does this PR introduce _any_ user-facing change?
After this PR, `MSCK REPAIR TABLE t` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`.
### How was this patch tested?
Updated existing tests.
Closes#30664 from imback82/repair_table_V2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve LogisticRegression test error tolerance
### Why are the changes needed?
When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30587 from WeichenXu123/fix_lor_test.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
Upgrade the jackson dependencies to 2.10.5 and jackson-databind to 2.10.5.1
### Why are the changes needed?
Jackson dependency has vulnerability CVE-2020-25649.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#30656 from n-marion/SPARK-33695_upgrade-jackson.
Authored-by: Nicholas Marion <nmarion@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set.
The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places:
1. GROUP BY
2. join keys
3. window partition keys
This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++
### Why are the changes needed?
Fix the query result
### Does this PR introduce _any_ user-facing change?
Yes, the result of HyperLogLog++ becomes correct now.
### How was this patch tested?
a new test case, and a few more test cases that pass before this PR to improve test coverage.
Closes#30673 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `sql/core`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30531 from jsoref/spelling-sql-core.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier.
This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated existing tests.
Closes#30636 from imback82/drop_view_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Remove old statement `ShowTableStatement`
2. Introduce new command `ShowTableExtended` for `SHOW TABLE EXTENDED`.
This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393.
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.
### Does this PR introduce _any_ user-facing change?
The changes should not affect V1 tables. For V2, Spark outputs the error:
```
SHOW TABLE EXTENDED is not supported for v2 tables.
```
### How was this patch tested?
By running `SHOW TABLE EXTENDED` tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```
Closes#30645 from MaxGekk/show-table-extended-statement.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example:
`SELECT s LIKE 'm%aca' ESCAPE '%' FROM t`
`SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t`
For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`.
### Why are the changes needed?
Result corrupt.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added Unit test.
Closes#30625 from luluorta/SPARK-33677.
Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark **3.2.0**.
### Why are the changes needed?
By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously.
### Does this PR introduce _any_ user-facing change?
Yes, but this is an improvement and it's supposed to have no bugs.
### How was this patch tested?
Pass the CIs.
Closes#30628 from dongjoon-hyun/SPARK-33679.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This aims to add `-Pdocker-integration-tests` at GitHub Action job for Scala 2.13 compilation.
### Why are the changes needed?
We fixed Scala 2.13 compilation of this module at https://github.com/apache/spark/pull/30660 . This PR will prevent accidental regression at that module.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GitHub Action Scala 2.13 job.
Closes#30661 from dongjoon-hyun/SPARK-DOCKER-IT.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `ALTER [TABLE|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated existing tests.
Closes#30610 from imback82/rename_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes a build error of `OracleIntegrationSuite` with Scala 2.13.
### Why are the changes needed?
Build should pass with Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that the build pass with the following command.
```
$ build/sbt -Pdocker-integration-tests -Pscala-2.13 "docker-integration-tests/test:compile"
```
Closes#30660 from sarutak/fix-docker-integration-tests-for-scala-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
2, mark `wordList` and `wordVecNorms` lazy
3, avoid slicing in computation of `wordVecNorms`
### Why are the changes needed?
halve broadcast size
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30548 from zhengruifeng/w2v_float32_transform.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This is a follow-up for SPARK-33680 to remove the assumption on the default value of `spark.sql.adaptive.enabled` .
### Why are the changes needed?
According to the test result https://github.com/apache/spark/pull/30628#issuecomment-739866168, the [previous run](https://github.com/apache/spark/pull/30628#issuecomment-739641105) didn't run all tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#30655 from dongjoon-hyun/SPARK-33680.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds a way to inject data source rewrite rules.
### Why are the changes needed?
Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
This PR comes with a new test.
Closes#30577 from aokolnychyi/spark-33621-v3.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data.
### Why are the changes needed?
This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
updated tests
Closes#30653 from cloud-fan/minor2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.
### Why are the changes needed?
It's confusing for having two config while one can cover another completely.
### Does this PR introduce _any_ user-facing change?
no, it's deprecating not removing.
### How was this patch tested?
N/A
Closes#30651 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `sql/catalyst`
* `sql/hive-thriftserver`
* `sql/hive`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30532 from jsoref/spelling-sql-not-core.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later
### Why are the changes needed?
1. udf
```scala
scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))
scala> spark.sql("select abcd()").show
scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
... 47 elided
```
2. spark.createDataframe
```
scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
+--------------------+
| c|
+--------------------+
| # Apache Spark|
| |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Structured St...|
| |
|<https://spark.ap...|
| |
|[![Jenkins Build]...|
|[![AppVeyor Build...|
|[![PySpark Covera...|
| |
| |
```
3. reader.schema
```
scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
+--------------------+
| a|
+--------------------+
| # Apache Spark|
| |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
```
4. etc
### Does this PR introduce _any_ user-facing change?
NO, we intend to avoid protentical breaking change
### How was this patch tested?
new tests
Closes#30586 from yaooqinn/SPARK-33641.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for
CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved
object" (instead of a user-friendly error msg: "table or view not found").
This PR fixes this wrongly preprocess for CTAS using V2 catalog.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
The error message for CTAS with a non-exists table changed from:
`UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to
`AnalysisException: Table or view not found: xxx`
### How was this patch tested?
added test
Closes#30637 from linhongliu-db/fix-ctas.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`.
### Why are the changes needed?
Avoid generating small files for INSERT INTO TABLE from VALUES, for example:
```sql
CREATE TABLE t1(id int) USING parquet;
INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8);
```
Before this pr:
```
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec 1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 0 Dec 1 01:54 _SUCCESS
```
After this pr and set `spark.sql.files.minPartitionNum` to 1:
```
-rw-r--r-- 1 root root 452 Dec 1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet
-rw-r--r-- 1 root root 0 Dec 1 01:59 _SUCCESS
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30559 from wangyum/SPARK-33617.
Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <yumwang@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs).
### Why are the changes needed?
1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error:
```sql
spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b);
spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9');
Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`';
```
2. To prevent future errors caused by not fully specified partition specs.
### Does this PR introduce _any_ user-facing change?
Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands.
### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
```
Closes#30624 from MaxGekk/add-partition-full-spec.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR upgrades `commons.httpclient` from `4.5.6` to `4.5.13`.
4.5.6 is released over 2 years ago and now we can use more stable `4.5.13`.
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
### Why are the changes needed?
To follow the more stable release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be done by the existing tests.
Closes#30634 from sarutak/upgrade-httpclient.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly.
### Why are the changes needed?
The unit test should not depend on the default configurations.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
According to https://github.com/apache/spark/pull/30628 , this seems to be the only ones.
Pass the CIs.
Closes#30631 from dongjoon-hyun/SPARK-CONF-AGNO.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job.
In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows.
```
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile
```
Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`.
### Why are the changes needed?
To build with consistent configuration.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be done by GitHub Actions' workflow.
Closes#30633 from sarutak/scala-213-java11.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified.
This PR is some kind of follow up https://github.com/apache/spark/pull/16373 and https://github.com/apache/spark/pull/15515.
### Why are the changes needed?
To output an user friendly error with recommendation like
**"
... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName`
"**
instead of silently output an empty result.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By running the affected test suites, in particular:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite"
```
Closes#30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views:
```
scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
res0: org.apache.spark.sql.DataFrame = []
scala> val df = spark.table("table")
df: org.apache.spark.sql.DataFrame = [2: int]
scala> df.createOrReplaceTempView("t2")
20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name
org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
'UnresolvedRelation [t2], [], false
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
```
Since `t2` does not exist yet, it shouldn't try to uncache.
### Why are the changes needed?
To fix misleading message.
### Does this PR introduce _any_ user-facing change?
Yes, the above message will not be displayed if the view doesn't exist yet.
### How was this patch tested?
Manually tested since this is a log message printed.
Closes#30608 from imback82/fix_cache_message.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is to show Slowpoke notifications in the log when running tests using SBT.
For example, the test case "zero sized blocks" in ExternalShuffleServiceSuite enters the infinite loop. After this change, the log file will have a notification message every 5 minute when the test case running longer than two minutes. Below is an example message.
```
[info] ExternalShuffleServiceSuite:
[info] - groupByKey without compression (101 milliseconds)
[info] - shuffle non-zero block size (3 seconds, 186 milliseconds)
[info] - shuffle serializer (3 seconds, 189 milliseconds)
[info] *** Test still running after 2 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 7 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 12 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 17 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
```
### Why are the changes needed?
When the tests/code has bug and enters the infinite loop, it is hard to tell which test cases hit some issues from the log, especially when we are running the tests in parallel. It would be nice to show the Slowpoke notifications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual testing in my local dev environment.
Closes#30621 from gatorsmile/addSlowpoke.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag **spark.sql.caseSensitive**.
### Why are the changes needed?
V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
```sql
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
> USING parquet
> PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS;
```
The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the command above works as expected:
```sql
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
year=2015/month=1
```
### How was this patch tested?
By running the affected test suites:
- `v1/ShowPartitionsSuite`
- `v2/ShowPartitionsSuite`
Closes#30615 from MaxGekk/show-partitions-case-sensitivity-test.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.
Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.
### Why are the changes needed?
To tell developers that PySpark now follows NumPy documentation style.
### Does this PR introduce _any_ user-facing change?
No, it's a change in unreleased branches yet.
### How was this patch tested?
Manually tested via `make clean html` under `python/docs`:
![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)
Closes#30622 from HyukjinKwon/SPARK-33256.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`.
### Why are the changes needed?
Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later.
### Does this PR introduce _any_ user-facing change?
Yes. Now delete from table in v2 also refreshes cache.
### How was this patch tested?
Added a test case.
Closes#30597 from sunchao/SPARK-33652.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."
The test is flaking, with multiple flaked instances - the reason for the failure has been similar to:
```
The code passed to eventually never returned normally. Attempted 109 times over 3.0079882413999997 minutes. Last failure message: Failure executing: GET at:
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false. Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).. (KubernetesSuite.scala:402)
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/consolehttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/consolehttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/consolehttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
From the above failures, it seems, that executor finishes too quickly and is removed by spark before the test can complete.
So, in order to mitigate this situation, one way is to turn on the flag
"spark.kubernetes.executor.deleteOnTermination"
### Why are the changes needed?
Fixes a flaky test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
May be a few runs of jenkins integration test, may reveal if the problem is resolved or not.
Closes#30616 from ScrapCodes/SPARK-33668/fix-flaky-k8s-integration-test.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.
### Why are the changes needed?
To fix a string interpolation bug.
### Does this PR introduce _any_ user-facing change?
Yes, the string will be correctly constructed.
### How was this patch tested?
Existing tests since they were used in exception/log messages.
Closes#30609 from imback82/fix_cache_str_interporlation.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it.
### Why are the changes needed?
It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all.
### Does this PR introduce _any_ user-facing change?
Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed.
### How was this patch tested?
new tests
Closes#30595 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up for #30373 that updates the comment for RemoveRedundantSorts in QueryExecution.
### Why are the changes needed?
To update an incorrect comment.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30584 from allisonwang-db/spark-33472-followup.
Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to fix Scala 2.13 compilation.
### Why are the changes needed?
To recover Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GitHub Action Scala 2.13 build job.
Closes#30611 from dongjoon-hyun/SPARK-33141.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.
### Why are the changes needed?
Start to prepare Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Pass the CIs.
Closes#30606 from dongjoon-hyun/SPARK-3.2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Update kafka headers documentation, type is not longer a map but an array
[jira](https://issues.apache.org/jira/browse/SPARK-33660)
### Why are the changes needed?
To help users
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
It is only documentation
Closes#30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation.
Authored-by: german <germanschiavon@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>