Commit graph

29451 commits

Author SHA1 Message Date
Chao Sun ce13dcc689 [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
### What changes were proposed in this pull request?

Currently in `SpecificParquetRecordReaderBase` we use deprecated APIs in a few places from Parquet, such as `readFooter`, `ParquetInputSplit`, `new ParquetFileReader`, `filterRowGroups`, etc. This replaces these with the newer APIs. In specific this:
- Replaces `ParquetInputSplit` with `FileSplit`. We never use specific things in the former such as `rowGroupOffsets` so the swap is pretty simple.
- Removes `readFooter` calls by using `ParquetFileReader.open`
- Replace deprecated `ParquetFileReader` ctor with the newer API which takes `ParquetReadOptions`.
- Removes the unnecessary handling of case when `rowGroupOffsets` is not null. It seems this never happens.

### Why are the changes needed?

The aforementioned APIs were deprecated and is going to be removed at some point in future. This is to ensure better supportability.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a cleanup and relies on existing tests on the relevant code paths.

Closes #31667 from sunchao/SPARK-32703.

Lead-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:41 +09:00
Richard Penney 7d0743b493 [SPARK-33678][SQL] Product aggregation function
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.

This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.

This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.

### Does this PR introduce _any_ user-facing change?
No - only adds new function.

### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).

An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw

df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))

df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x  |factorial      |
+---+---------------+
|1  |1.0            |
|2  |2.0            |
|3  |6.0            |
|4  |24.0           |
|5  |120.0          |
|6  |720.0          |
|7  |5040.0         |
|8  |40320.0        |
|9  |362880.0       |
|10 |3628800.0      |
|11 |3.99168E7      |
|12 |4.790016E8     |
|13 |6.2270208E9    |
|14 |8.71782912E10  |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```

Closes #30745 from rwpenney/feature/agg-product.

Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:07 +09:00
Gabriele Nizzoli b13a4b85d4 [SPARK-34573][SQL] Avoid global locking in SQLConf object for sqlConfEntries map
### What changes were proposed in this pull request?
In the `SQLConf` object, the `sqlConfEntries` map is globally synchronized (it is a Java `Collections.synchronizedMap`): any operation, including a get, will need to acquire the lock.

An example of this is calling the `DatatType.sameType` method. This will trigger a check on `SQLConf.get.caseSensitiveAnalysis`. So every time we compare two datatypes with sameType, we hit a lock.

To avoid having multiple tasks locking on this, a better approach would be to use a map that does not lock on read (like a `ConcurrentHashMap`). This map implementation does not lock on read, and on write it only locks the map partially. The only lock that happens is on write on the same map key.

### Why are the changes needed?
Multiple tasks performing any operation that directly or indirectly trigger a query to the `SQLConf.sqlConfEntries` map, will require acquiring a global lock on that map. Something as easy as calling `DataType.sameType(...)` would be locking on the global `sqlConfEntries` lock of the `Collections.synchronizedMap`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No functionality change. Existing unit tests run normally.

Closes #31689 from gabrielenizzoli/SPARK-34573.

Authored-by: Gabriele Nizzoli <1545350+gabrielenizzoli@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-03-02 15:36:51 +09:00
Dongjoon Hyun 499cc79344 [SPARK-34503][DOCS][FOLLOWUP] Document available codecs for event log compression
### What changes were proposed in this pull request?

This PR is a follow-up of https://github.com/apache/spark/pull/31618 to document the available codecs for event log compression.

### Why are the changes needed?

Documentation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual.

Closes #31695 from dongjoon-hyun/SPARK-34503-DOC.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-01 15:42:10 -08:00
Max Gekk 70f6267de6 [SPARK-34560][SQL] Generate unique output attributes in the SHOW TABLES logical node
### What changes were proposed in this pull request?
In the PR, I propose to generate unique attributes in the logical nodes of the `SHOW TABLES` command.

Also, this PR fixes similar issues in other logical nodes:
- ShowTableExtended
- ShowViews
- ShowTableProperties
- ShowFunctions
- ShowColumns
- ShowPartitions
- ShowNamespaces

### Why are the changes needed?
This fixes the issue which is demonstrated by the example below:
```scala
scala> val show1 = sql("SHOW TABLES IN ns1")
show1: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field]

scala> val show2 = sql("SHOW TABLES IN ns2")
show2: org.apache.spark.sql.DataFrame = [namespace: string, tableName: string ... 1 more field]

scala> show1.show
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|      ns1|     tbl1|      false|
+---------+---------+-----------+

scala> show2.show
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|      ns2|     tbl2|      false|
+---------+---------+-----------+

scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show
org.apache.spark.sql.AnalysisException: Column tableName#17 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
  at org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the example above works as expected:
```scala
scala> show1.join(show2).where(show1("tableName") =!= show2("tableName")).show
+---------+---------+-----------+---------+---------+-----------+
|namespace|tableName|isTemporary|namespace|tableName|isTemporary|
+---------+---------+-----------+---------+---------+-----------+
|      ns1|     tbl1|      false|      ns2|     tbl2|      false|
+---------+---------+-----------+---------+---------+-----------+
```

### How was this patch tested?
By running the new test:
```
$  build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```

Closes #31675 from MaxGekk/fix-output-attrs.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-03-01 18:32:32 +00:00
Yikun Jiang 85b50d4258 [SPARK-34539][BUILD][INFRA] Remove stand-alone version Zinc server
### What changes were proposed in this pull request?
Cleanup all Zinc standalone server code, and realated coniguration.

### Why are the changes needed?
![image](https://user-images.githubusercontent.com/1736354/109154790-c1d3e580-77a9-11eb-8cde-835deed6e10e.png)
- Zinc is the incremental compiler to speed up builds of compilation.
- The scala-maven-plugin is the mave plugin, which is used by Spark, one of the function is to integrate the Zinc to enable the incremental compiler.
- Since Spark v3.0.0 ([SPARK-28759](https://issues.apache.org/jira/browse/SPARK-28759)), the scala-maven-plugin is upgraded to v4.X, that means Zinc v0.3.13 standalone server is useless anymore.

However, we still download, install, start the standalone Zinc server. we should remove all zinc standalone server code, and all related configuration.

See more in [SPARK-34539](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-34539) or the doc [Zinc standalone server is useless after scala-maven-plugin 4.x](https://docs.google.com/document/d/1u4kCHDx7KjVlHGerfmbcKSB0cZo6AD4cBdHSse-SBsM).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run any mvn build:
./build/mvn -DskipTests clean package -pl core
You could see the increamental compilation is still working, the stage of "scala-maven-plugin:4.3.0:compile (scala-compile-first)" with incremental compilation info, like:
```
[INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first)  spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: /root/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] compiler plugin: BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] Compiling 303 Scala sources and 27 Java sources to /root/spark/core/target/scala-2.12/test-classes ...
```

Closes #31647 from Yikun/cleanup-zinc.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-01 08:39:38 -06:00
Max Gekk 984ff396a2 [SPARK-34561][SQL] Fix drop/add columns from/to a dataset of v2 DESCRIBE TABLE
### What changes were proposed in this pull request?
In the PR, I propose to generate "stable" output attributes per the logical node of the `DESCRIBE TABLE` command.

### Why are the changes needed?
This fixes the issue demonstrated by the example:
```scala
val tbl = "testcat.ns1.ns2.tbl"
sql(s"CREATE TABLE $tbl (c0 INT) USING _")
val description = sql(s"DESCRIBE TABLE $tbl")
description.drop("comment")
```
The `drop()` method fails with the error:
```
org.apache.spark.sql.AnalysisException: Resolved attribute(s) col_name#102,data_type#103 missing from col_name#29,data_type#30,comment#31 in operator !Project [col_name#102, data_type#103]. Attribute(s) with the same name appear in the operation: col_name,data_type. Please check if the right attribute(s) are used.;
!Project [col_name#102, data_type#103]
+- LocalRelation [col_name#29, data_type#30, comment#31]

	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:51)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:50)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, `drop()`/`add()` works as expected:
```scala
description.drop("comment").show()
+---------------+---------+
|       col_name|data_type|
+---------------+---------+
|             c0|      int|
|               |         |
| # Partitioning|         |
|Not partitioned|         |
+---------------+---------+
```

### How was this patch tested?
1. Run new test:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DataSourceV2SQLSuite"
```
2. Run existing test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31676 from MaxGekk/describe-table-drop-column.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-03-01 22:20:28 +08:00
Kousuke Saruta a6cc5e625f [SPARK-34574][DOCS] Jekyll fails to generate Scala API docs for Scala 2.13
### What changes were proposed in this pull request?

This PR fixes an issue that `bundler exec jekyll` build fails to generate Scala API docs even though after `dev/change-scala-version.sh 2.13` run.

### Why are the changes needed?

The reason of this issue is that `build/sbt` in `copy_api_dirs.rb` runs without `-Pscala-2.13`.
So, it's a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I tested the following patterns manually.

* `dev/change-scala-version 2.13` and then `bundler exec jekyll build`
* `dev/change-scala-version 2.12` to change back to Scala 2.12 and then `bundler exec jekyll build`
* `dev/change-scala-version 2.13` two times to confirm the idempotency and then `bundler exec jekyll build`
* `dev/change-scala-version 2.12` two times to confirm the idempotency and then `bundler exec jekyll build`

Closes #31690 from sarutak/jekyll-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-01 15:36:33 +09:00
Shixiong Zhu 62737e140c [SPARK-34556][SQL] Checking duplicate static partition columns should respect case sensitive conf
### What changes were proposed in this pull request?

This PR makes partition spec parsing respect case sensitive conf.

### Why are the changes needed?

When parsing the partition spec, Spark will call `org.apache.spark.sql.catalyst.parser.ParserUtils.checkDuplicateKeys` to check if there are duplicate partition column names in the list. But this method is always case sensitive and doesn't detect duplicate partition column names when using different cases.

### Does this PR introduce _any_ user-facing change?

Yep. This prevents users from writing incorrect queries such as `INSERT OVERWRITE t PARTITION (c='2', C='3') VALUES (1)` when they don't enable case sensitive conf.

### How was this patch tested?

The new added test will fail without this change.

Closes #31669 from zsxwing/SPARK-34556.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-01 13:55:35 +09:00
HyukjinKwon 3d0ee9604e [SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/31636. There was one place missed in `GangliaSink`, and we should also remove `SecurityManager`.

### Why are the changes needed?

To make `GangliaSink` work.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It was found in the internal it tests in the company I work for.

Closes #31688 from HyukjinKwon/SPARK-34520-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-01 11:18:57 +09:00
Chao Sun f494c5cff9 [SPARK-33212][FOLLOWUP] Add hadoop-yarn-server-web-proxy for Hadoop 3.x profile
### What changes were proposed in this pull request?

This adds `hadoop-yarn-server-web-proxy` as dependency for Yarn and Hadoop 3.x profile (it is already a dependency for 2.x). Also excludes some dependencies from the module which are already covered by other Hadoop jars used by Spark.

### Why are the changes needed?

The class `org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter` is used by `ApplicationMaster`:
```scala
  private def addAmIpFilter(driver: Option[RpcEndpointRef], proxyBase: String) = {
    val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
    val params = client.getAmIpFilterParams(yarnConf, proxyBase)
    driver match {
      case Some(d) =>
        d.send(AddWebUIFilter(amFilter, params, proxyBase))
   ...
```
and will be loaded at runtime. Therefore, without the above jar Spark Yarn app will fail with `ClassNotFoundError`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests. Also tested manually and it worked with the fix, while was failing previously.

Closes #31642 from sunchao/SPARK-33212-followup-2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-02-28 16:37:49 -08:00
Kent Yao 1afe284ed8 [SPARK-34570][SQL] Remove dead code from constructors of [Hive]SessionStateBuilder
### What changes were proposed in this pull request?

the parameter - `options` is never used. The changes here was part of https://github.com/apache/spark/pull/30642, It got reverted for easier backporting #30642 as a hotfix by dad24543aa, this PR brings it back to master.

### Why are the changes needed?

remove unless dead code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Passing CI is enough.

Closes #31683 from yaooqinn/SPARK-34570.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-03-01 09:30:18 +09:00
Angerszhuuuu d574308864 [SPARK-34579][SQL][TEST] Fix wrong UT in SQLQuerySuite
### What changes were proposed in this pull request?
Some UT in SQLQuerySuite is  not incorrect, it have wrong table name in `withTable`, this pr to make it correct.

### Why are the changes needed?
Fix UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #31681 from AngersZhuuuu/SPARK-34569.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-02-28 16:21:42 -08:00
Shardul Mahadik 0216051aca [SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible with Hive transitive behavior
### What changes were proposed in this pull request?
SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't

1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L169)) in the coordinate and [false for invalid values](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L124)). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes).

2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L122)).

I propose that we be compatible with Hive for these behaviors

### Why are the changes needed?
To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior

### Does this PR introduce _any_ user-facing change?

The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet
1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does.
2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively.

### How was this patch tested?

Modified existing unit tests to test new behavior
Add new unit test to cover usage of `exclude` with unspecified `transitive`

Closes #31623 from shardulm94/spark-34506.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-03-01 09:10:20 +09:00
Yuming Wang d07fc3076b [SPARK-33687][SQL] Support analyze all tables in a specific database
### What changes were proposed in this pull request?

This pr add support analyze all tables in a specific database:
```g4
 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?
```

### Why are the changes needed?

1. Make it easy to analyze all tables in a specific database.
2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The feature tested by unit test.
The documentation tested by regenerating the documentation:

menu-sql.yaml |  sql-ref-syntax-aux-analyze-tables.md
-- | --
![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) | ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png)

Closes #30648 from wangyum/SPARK-33687.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-03-01 09:06:47 +09:00
Phillip Henry 5a48eb8d00 [SPARK-34415][ML] Python example
Missing Python example file for [SPARK-34415][ML] Randomization in hyperparameter optimization
 (https://github.com/apache/spark/pull/31535)

### What changes were proposed in this pull request?
For some reason (probably me being silly) a examples/src/main/python/ml/model_selection_random_hyperparameters_example.py was not pushed in a previous PR.
This PR restores that file.

### Why are the changes needed?
A single file (examples/src/main/python/ml/model_selection_random_hyperparameters_example.py) that should have been pushed as part of SPARK-34415 but was not. This was causing Lint errors as highlighted by dongjoon-hyun. Consequently, srowen asked for a new PR.

### Does this PR introduce _any_ user-facing change?
No, it merely restores a file that was overlook in SPARK-34415.

### How was this patch tested?
By running:
`bin/spark-submit examples/src/main/python/ml/model_selection_random_hyperparameters_example.py`

Closes #31687 from PhillHenry/SPARK-34415_model_selection_random_hyperparameters_example.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-28 17:01:13 -06:00
Yuming Wang 54c053afb0 [SPARK-34479][SQL] Add zstandard codec to Avro compression codec list
### What changes were proposed in this pull request?

Avro add zstandard codec since AVRO-2195. This pr add zstandard codec to Avro compression codec list.

### Why are the changes needed?

To make Avro support zstandard codec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31673 from wangyum/SPARK-34479.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-02-27 10:31:42 -08:00
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
Dongjoon Hyun 1aeafb4852 [SPARK-34559][BUILD] Upgrade to ZSTD JNI 1.4.8-6
### What changes were proposed in this pull request?

This PR aims to upgrade ZSTD JNI to 1.4.8-6.

### Why are the changes needed?

This fixes the following issue and will unblock SPARK-34479 (Support ZSTD at Avro data source).
- https://github.com/luben/zstd-jni/issues/161

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31674 from dongjoon-hyun/SPARK-34559.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-02-27 03:04:12 -08:00
Dongjoon Hyun d75821038f [SPARK-34557][BUILD] Exclude Avro's transitive zstd-jni dependency
### What changes were proposed in this pull request?

This PR aims to exclude `Apache Avro`'s transitive zstd-jni dependency.

### Why are the changes needed?

While SPARK-27733 upgrades Apache Avro from 1.8 to 1.10,
`zstd-jni` transitive dependency is created.

This PR explicitly prevents dependency conflicts.

**BEFORE**
```
$ build/sbt "core/evicted" | grep zstd
[info] 	* com.github.luben:zstd-jni:1.4.8-5 is selected over 1.4.5-12
```

**AFTER**
```
$ build/sbt "core/evicted" | grep zstd
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31670 from dongjoon-hyun/SPARK-34557.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-02-26 23:58:45 -08:00
Ruifeng Zheng 05069ff4ce [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
### What changes were proposed in this pull request?
if child rdd has only one partition or zero partition, skip the shuffle

### Why are the changes needed?
skip shuffle if possible

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #31468 from zhengruifeng/collect_limit_single_partition.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-02-27 16:48:20 +09:00
ShiKai Wang 56e664c717 [SPARK-34392][SQL] Support ZoneOffset +h:mm in DateTimeUtils. getZoneId
### What changes were proposed in this pull request?
To support +8:00 in Spark3 when execute sql
`select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")`

### Why are the changes needed?
+8:00 this format is supported in PostgreSQL,hive, presto, but not supported in Spark3
https://issues.apache.org/jira/browse/SPARK-34392

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
unit test

Closes #31624 from Karl-WangSK/zone.

Lead-authored-by: ShiKai Wang <wskqing@gmail.com>
Co-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-26 11:03:20 -06:00
HyukjinKwon 8d68f3f746 [MINOR] Add more known translations of contributors
### What changes were proposed in this pull request?

This PR adds some more known translations of contributors who contributed multiple times in Spark 3.1.1.

### Why are the changes needed?

To make release process easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

N/A (auto-generated)

Closes #31665 from HyukjinKwon/minor-add-known-translations.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-27 01:02:11 +09:00
tanel.kiis@gmail.com 67ec4f7f67 [SPARK-33971][SQL] Eliminate distinct from more aggregates
### What changes were proposed in this pull request?

Add more aggregate expressions to `EliminateDistinct` rule.

### Why are the changes needed?

Distinct aggregation can add a significant overhead. It's better to remove distinct whenever possible.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #30999 from tanelk/SPARK-33971_eliminate_distinct.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-02-26 21:59:02 +09:00
Max Gekk c1beb16cc8 [SPARK-34554][SQL] Implement the copy() method in ColumnarMap
### What changes were proposed in this pull request?
Implement `ColumnarMap.copy()` by using the `copy()` method of `ColumnarArray`.

### Why are the changes needed?
To eliminate `java.lang.UnsupportedOperationException` while using `ColumnarMap`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new tests in `ColumnarBatchSuite`.

Closes #31663 from MaxGekk/columnar-map-copy.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 21:33:14 +09:00
Liang-Chi Hsieh a9e8e0528a [SPARK-34549][BUILD] Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
### What changes were proposed in this pull request?

This patch tries to upgrade aws kinesis and java sdk version.

### Why are the changes needed?

Upgrade aws kinesis and java sdk to catch up minimum requirement for new feature like IAM role for service accounts: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #31658 from viirya/upgrade-aws-sdk.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 21:30:24 +09:00
ulysses-you 82267acfe8 [SPARK-34550][SQL] Skip InSet null value during push filter to Hive metastore
### What changes were proposed in this pull request?

Skip `InSet` null value during push filter to Hive metastore.

### Why are the changes needed?

If `InSet` contains a null value, we should skip it and push other values to metastore. To keep same behavior with `In`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31659 from ulysses-you/SPARK-34550.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 21:29:14 +09:00
Cheng Su 7d5021f5ee [SPARK-34533][SQL] Eliminate LEFT ANTI join to empty relation in AQE
### What changes were proposed in this pull request?

I discovered from review discussion - https://github.com/apache/spark/pull/31630#discussion_r581774000 , that we can eliminate LEFT ANTI join (with no join condition) to empty relation, if the right side is known to be non-empty. So with AQE, this is doable similar to https://github.com/apache/spark/pull/29484 .

### Why are the changes needed?

This can help eliminate the join operator during logical plan optimization.
Before this PR, [left side physical plan `execute()` will be called](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L192), so if left side is complicated (e.g. contain broadcast exchange operator), then some computation would happen. However after this PR, the join operator will be removed during logical plan, and nothing will be computed from left side. Potentially it can save resource for these kinds of query.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests for positive and negative queries in `AdaptiveQueryExecSuite.scala`.

Closes #31641 from c21/left-anti-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-26 11:46:27 +00:00
Wenchen Fan 73857cdd87 [SPARK-34524][SQL] Simplify v2 partition commands resolution
### What changes were proposed in this pull request?

This PR simplifies the resolution of v2 partition commands:
1. Add a common trait for v2 partition commands, so that we don't need to match them one by one in the rules.
2. Make partition spec an expression, so that it's easier to resolve them via tree node transformation.
3. Add `TruncatePartition` so that `TruncateTable` doesn't need to be a v2 partition command.
4. Simplify `CheckAnalysis` to only check if the table is partitioned. For partitioned tables, partition spec is always resolved, so we don't need to check it. The `SupportsAtomicPartitionManagement` check is also done in the runtime. Since Spark eagerly executes commands, exception in runtime will also be thrown at analysis time.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #31637 from cloud-fan/simplify.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-26 11:44:42 +00:00
HyukjinKwon ac774ec0c2 [SPARK-34553][INFRA] Rename GITHUB_API_TOKEN to GITHUB_OAUTH_KEY in translate-contributors.py
### What changes were proposed in this pull request?

This PR proposes to add an alias environment variable `GITHUB_OAUTH_KEY` for `GITHUB_API_TOKEN` in `translate-contributors.py` script.

### Why are the changes needed?

```
dev/github_jira_sync.py:GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY")
dev/github_jira_sync.py:        request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/github_jira_sync.py:        request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/merge_spark_pr.py:GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY")
dev/merge_spark_pr.py:        if GITHUB_OAUTH_KEY:
dev/merge_spark_pr.py:            request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/run-tests-jenkins.py:    github_oauth_key = os.environ["GITHUB_OAUTH_KEY"]
```

Spark uses `GITHUB_OAUTH_KEY` for GitHub token, but `translate-contributors.py` script alone uses `GITHUB_API_TOKEN`. We should better match to make it easier to run the script

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

I manually tested by running this script.

Closes #31662 from HyukjinKwon/minor-gh-token-name.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 20:20:31 +09:00
HyukjinKwon 5b92531937 [SPARK-34551][INFRA] Fix credit related scripts to recover, drop Python 2 and work with Python 3
### What changes were proposed in this pull request?

This PR proposes to make the scripts working by:
- Recovering credit related scripts that were broken from https://github.com/apache/spark/pull/29563
    `raw_input` does not exist in `releaseutils` but only in Python 2
- Dropping Python 2 in these scripts because we dropped Python 2 in https://github.com/apache/spark/pull/28957
- Making these scripts workin with Python 3

### Why are the changes needed?

To unblock the release.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only change.

### How was this patch tested?

I manually tested against Spark 3.1.1 RC3.

Closes #31660 from HyukjinKwon/SPARK-34551.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 20:19:33 +09:00
Max Gekk 5c7d019b60 [SPARK-34543][SQL] Respect the spark.sql.caseSensitive config while resolving partition spec in v1 SET LOCATION
### What changes were proposed in this pull request?
Preprocess the partition spec passed to the V1 `ALTER TABLE .. SET LOCATION` implementation `AlterTableSetLocationCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.

### Why are the changes needed?
V1 `ALTER TABLE .. SET LOCATION` is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
```sql
spark-sql> CREATE TABLE tbl (id INT, part INT) PARTITIONED BY (part);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'tbl' PARTITION (part=0);
Location: file:/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0
spark-sql> ALTER TABLE tbl ADD PARTITION (part=1);
spark-sql> SELECT * FROM tbl;
0	0
```
Create new partition folder in the file system:
```
$ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa
```
Set new location for the partition part=1:
```sql
spark-sql> ALTER TABLE tbl PARTITION (part=1) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa';
spark-sql> SELECT * FROM tbl;
0	0
0	1
spark-sql> ALTER TABLE tbl ADD PARTITION (PART=2);
spark-sql> SELECT * FROM tbl;
0	0
0	1
```
Set location for a partition in the upper case:
```
$ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb
```
```sql
spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb';
Error in query: Partition spec is invalid. The spec (PART) must match the partition spec (part) defined in table '`default`.`tbl`'
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the command above works as expected:
```sql
spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb';
spark-sql> SELECT * FROM tbl;
0	0
0	1
0	2
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31651 from MaxGekk/set-location-case-sense.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-26 15:28:57 +08:00
Dongjoon Hyun 4d428a821b Revert "[SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests"
This reverts commit b17754a8cb.
2021-02-25 17:10:58 -08:00
yangjie01 0d3a9cd3c9 [SPARK-34535][SQL] Cleanup unused symbol in Orc related code
### What changes were proposed in this pull request?
Cleanup unused symbol in Orc related code as follows:

- `OrcDeserializer` : parameter `dataSchema` in constructor
- `OrcFilters`  : parameter `schema ` in method `convertibleFilters`.
- `OrcPartitionReaderFactory`: ignore return value of `OrcUtils.orcResultSchemaString` in  method `buildReader(file: PartitionedFile)`

### Why are the changes needed?
Cleanup code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31644 from LuciferYang/cleanup-orc-unused-symbol.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-26 09:20:40 +09:00
Dongjoon Hyun 1967760277 [SPARK-34505][BUILD] Upgrade Scala to 2.13.5
### What changes were proposed in this pull request?

This PR aims to update from Scala 2.13.4 to Scala 2.13.5 for Apache Spark 3.2.

### Why are the changes needed?

Scala 2.13.5 is a maintenance release for 2.13 line and improves Java 13, 14, 15, 16, and 17 support.
- https://github.com/scala/scala/releases/tag/v2.13.5

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the GitHub Action `Scala 2.13` job and manual test.

I verified the following locally and all passed.
```
$ dev/change-scala-version.sh 2.13
$ build/sbt test -Pscala-2.13
```

Closes #31620 from dongjoon-hyun/SPARK-34505.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-25 15:27:46 -08:00
Wenchen Fan dffb01f28a [SPARK-34152][SQL][FOLLOWUP] Do not uncache the temp view if it doesn't exist
### What changes were proposed in this pull request?

This PR fixes a mistake in https://github.com/apache/spark/pull/31273. When CREATE OR REPLACE a temp view, we need to uncache the to-be-replaced existing temp view. However, we shouldn't uncache if there is no existing temp view.

This doesn't cause real issues because the uncache action is failure-safe. But it produces a lot of warning messages.

### Why are the changes needed?

Avoid unnecessary warning logs.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manually run tests and check the warning messages.

Closes #31650 from cloud-fan/warnning.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-25 15:25:41 -08:00
Liang-Chi Hsieh f7ac2d655c [SPARK-34474][SQL] Remove unnecessary Union under Distinct/Deduplicate
### What changes were proposed in this pull request?

This patch proposes to let optimizer to remove unnecessary `Union` under `Distinct`/`Deduplicate`.

### Why are the changes needed?

For an `Union` under `Distinct`/`Deduplicate`, if its children are all the same, we can just keep one among them and remove the `Union`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #31595 from viirya/remove-union.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-02-25 12:41:07 -08:00
Yuming Wang 4a3200b08a [SPARK-34436][SQL] DPP support LIKE ANY/ALL expression
### What changes were proposed in this pull request?

This pr make DPP support LIKE ANY/ALL expression:
```sql
SELECT date_id, product_id FROM fact_sk f
JOIN dim_store s
ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%')
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31563 from wangyum/SPARK-34436.

Lead-authored-by: Yuming Wang <yumwang@apache.org>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-25 18:07:39 +08:00
Max Gekk c56af69cdf [SPARK-34518][SQL] Rename AlterTableRecoverPartitionsCommand to RepairTableCommand
### What changes were proposed in this pull request?
Rename the execution node `AlterTableRecoverPartitionsCommand` for the commands:
- `MSCK REPAIR TABLE table [{ADD|DROP|SYNC} PARTITIONS]`
- `ALTER TABLE table RECOVER PARTITIONS`

to `RepairTableCommand`.

### Why are the changes needed?
1. After the PR https://github.com/apache/spark/pull/31499, `ALTER TABLE table RECOVER PARTITIONS` is equal to `MSCK REPAIR TABLE table ADD PARTITIONS`. And mapping of the generic command `MSCK REPAIR TABLE` to the more specific execution node `AlterTableRecoverPartitionsCommand` can confuse devs in the future.
2. `ALTER TABLE table RECOVER PARTITIONS` does not support any options/extensions. So, additional parameters `enableAddPartitions` and `enableDropPartitions` in `AlterTableRecoverPartitionsCommand` confuse as well.

### Does this PR introduce _any_ user-facing change?
No because this is internal API.

### How was this patch tested?
By running the existing test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt "test:testOnly *AlterTableRecoverPartitionsParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite"
$ build/sbt "test:testOnly *MsckRepairTableParserSuite"
```

Closes #31635 from MaxGekk/rename-recover-partitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-25 09:32:41 +00:00
HyukjinKwon 8a1e172b51 [SPARK-34520][CORE] Remove unused SecurityManager references
### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945.
Many of references in `SecurityManager` were introduced from SPARK-1189, and related usages were removed later from https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. This PR proposes to remove them out.

### Why are the changes needed?

For better readability of codes.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually complied. GitHub Actions and Jenkins build should test it out as well.

Closes #31636 from HyukjinKwon/SPARK-34520.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-24 20:38:03 -08:00
HyukjinKwon 22383e312d [SPARK-34531][CORE] Remove Experimental API tag in PrometheusServlet
### What changes were proposed in this pull request?

The endpoints of Prometheus metrics are properly marked and documented as an experimental (SPARK-31674). The class `PrometheusServlet` itself is not the part of an API so this PR proposes to remove it.

### Why are the changes needed?

To avoid marking a non-API as an API.

### Does this PR introduce _any_ user-facing change?

No, the class is already `private[spark]`.

### How was this patch tested?

Existing tests should cover.

Closes #31640 from HyukjinKwon/SPARK-34531.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-24 18:11:25 -08:00
Gabor Somogyi 44eadb943b [SPARK-34497][SQL] Fix built-in JDBC connection providers to restore JVM security context changes
### What changes were proposed in this pull request?
Some of the built-in JDBC connection providers are changing the JVM security context to do the authentication which is fine. The problematic part is that executors can be reused by another query. The following situation leads to incorrect behaviour:
* Query1 opens JDBC connection and changes JVM security context in Executor1
* Query2 tries to open JDBC connection but it realizes there is already an entry for that DB type in Executor1
* Query2 is not changing JVM security context and uses Query1 keytab and principal
* Query2 fails with authentication error

In this PR I've changed to code such a way that JVM security context is changed all the time but only temporarily until the connection built-up and then rolled back. Since `getConnection` is synchronised with `SecurityConfigurationLock` it ends-up in correct behaviour without any race.

### Why are the changes needed?
Incorrect JVM security context handling.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit + integration tests.

Closes #31622 from gaborgsomogyi/SPARK-34497.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-25 09:25:17 +09:00
“attilapiros” b17754a8cb [SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests
### What changes were proposed in this pull request?

From [minikube version v1.1.0](https://github.com/kubernetes/minikube/blob/v1.1.0/CHANGELOG.md) kubectl is available as a command. So the kubeconfig settings can be accessed like:

```
$ minikube kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt
    server: https://127.0.0.1:32788
  name: minikube
contexts:
- context:
    cluster: minikube
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt
    client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key
```

Here the vm-driver was docker and the server port (https://127.0.0.1:32788) is different from the hardcoded 8443.

So the main part of this PR is introducing kubernetes client configuration based on the kubeconfig (output of `minikube kubectl config view`) in case of minikube versions after v1.1.0 and the old legacy way of configuration is also kept as minikube version should be supported back to v0.34.1 .

Moreover as the old style of config parsing pattern wasn't sufficient in my case as when the `minikube kubectl config view` is called kubectl downloading message might be included before the first key I changed it even for the existent keys to be a consistent pattern in this file.

The old parsing in an example:
```
private val HOST_PREFIX = "host:"

val hostString = statusString.find(_.contains(s"$HOST_PREFIX "))

val status1 = hostString.get.split(HOST_PREFIX)(1)
```

The new parsing:
```
private val HOST_PREFIX = "host: "

val hostString = statusString.find(_.contains(HOST_PREFIX))

hostString.get.split(HOST_PREFIX)(1)
```

So the PREFIX is extended with the extra space at the declaration (this way the two separate string operation are more safe and consistent with each other) and the replace is changed to split and getting the 2nd string from the result (which is guaranteed to contain only the text after the PREFIX when the PREFIX is a contained substring).

Finally there is tiny change in `dev-run-integration-tests.sh` to introduce `--skip-building-dependencies` which switchs off building of maven dependencies of `kubernetes-integration-tests` from the Spark project.
This could be used when only the `kubernetes-integration-tests` should be rebuilded as only the tests are modified.

### Why are the changes needed?

Kubernetes client configuration based on kubeconfig settings is more reliable and provides a solution which is minikube version independent.

### Does this PR introduce _any_ user-facing change?

No. This is only test code.

### How was this patch tested?

tested manually on two minikube versions.

Minikube  v0.34.1:

```
$ minikube version
minikube version: v0.34.1

$ grep "version\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log
20/12/12 12:52:25.135 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v0.34.1
20/12/12 12:52:25.761 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.99.103:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/apiserver.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/apiserver.key
```

Minikube v1.15.1
```
$ minikube version

minikube version: v1.15.1
commit: 23f40a012abb52eff365ff99a709501a61ac5876

$ grep "version\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log

20/12/13 06:25:55.086 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v1.15.1
20/12/13 06:25:55.597 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.64.4:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key

$ minikube kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt
    server: https://192.168.64.4:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt
    client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key
```

Closes #30751 from attilapiros/SPARK-32617.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-02-24 11:46:27 -08:00
ulysses-you 999d3b89b6 [SPARK-34515][SQL] Fix NPE if InSet contains null value during getPartitionsByFilter
### What changes were proposed in this pull request?

Skip null value during rewrite `InSet` to `>= and <=` at getPartitionsByFilter.

### Why are the changes needed?

Spark will convert `InSet` to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning partition . At this case, if values contain a null, we will get such exception 
 
```
java.lang.NullPointerException
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
 at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
 at java.util.TimSort.sort(TimSort.java:220)
 at java.util.Arrays.sort(Arrays.java:1438)
 at scala.collection.SeqLike.sorted(SeqLike.scala:659)
 at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
 at scala.collection.AbstractSeq.sorted(Seq.scala:45)
 at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
 at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
 at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
 at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
 at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

Add test.

Closes #31632 from ulysses-you/SPARK-34515.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-24 21:32:19 +08:00
Wenchen Fan 87409c42bc [SPARK-31891][SQL][DOCS][FOLLOWUP] Fix typo in the description of MSCK REPAIR TABLE
### What changes were proposed in this pull request?
Fix typo and highlight that `ADD PARTITIONS` is the default.

### Why are the changes needed?
Fix a typo which can mislead users.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
n/a

Closes #31633 from MaxGekk/repair-table-drop-partitions-followup.

Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-24 21:13:58 +09:00
Cheng Su 6ef57d31cd [SPARK-34514][SQL] Push down limit for LEFT SEMI and LEFT ANTI join
### What changes were proposed in this pull request?

I found out during code review of https://github.com/apache/spark/pull/31567#discussion_r577379572, where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty.

Why it's safe to push down limit:

The semantics of LEFT SEMI join without condition:
(1). if right side is non-empty, output all rows from left side.
(2). if right side is empty, output nothing.

The semantics of LEFT ANTI join without condition:
(1). if right side is non-empty, output nothing.
(2). if right side is empty, output all rows from left side.

With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side.
NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows.

Reference: physical operator implementation for LEFT SEMI / LEFT ANTI join without condition - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204 .

### Why are the changes needed?

Better performance. Save CPU and IO for these joins, as limit being pushed down before join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `LimitPushdownSuite.scala` and `SQLQuerySuite.scala`.

Closes #31630 from c21/limit-pushdown.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-24 10:23:01 +00:00
beliefer 14934f42d0 [SPARK-33599][SQL][FOLLOWUP] Group exception messages in catalyst/analysis
### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/30717
Maybe some contributors don't know the job and added some exception by the old way.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #31316 from beliefer/SPARK-33599-followup.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-24 07:28:44 +00:00
Terry Kim 714ff73d4a [SPARK-34152][SQL] Make CreateViewStatement.child to be LogicalPlan's children so that it's resolved in analyze phase
### What changes were proposed in this pull request?

This PR proposes to make `CreateViewStatement.child` to be `LogicalPlan`'s `children` so that it's resolved in the analyze phase.

### Why are the changes needed?

Currently, the `CreateViewStatement.child` is resolved when the create view command runs, which is inconsistent with other plan resolutions. For example, you may see the following in the physical plan:
```
== Physical Plan ==
Execute CreateViewCommand (1)
   +- CreateViewCommand (2)
         +- Project (4)
            +- UnresolvedRelation (3)
```

### Does this PR introduce _any_ user-facing change?

Yes. For the example, you will now see the resolved plan:
```
== Physical Plan ==
Execute CreateViewCommand (1)
   +- CreateViewCommand (2)
         +- Project (5)
            +- SubqueryAlias (4)
               +- LogicalRelation (3)
```

### How was this patch tested?

Updated existing tests.

Closes #31273 from imback82/spark-34152.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-24 06:50:11 +00:00
Gengliang Wang 5d9cfd727c [SPARK-34246][SQL] New type coercion syntax rules in ANSI mode
### What changes were proposed in this pull request?

In Spark ANSI mode, the type coercion rules are based on the type precedence lists of the input data types.
As per the section "Type precedence list determination" of "ISO/IEC 9075-2:2011
Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation)", the type precedence lists of primitive data types are as following:

- Byte: Byte, Short, Int, Long, Decimal, Float, Double
- Short: Short, Int, Long, Decimal, Float, Double
- Int: Int, Long, Decimal, Float, Double
- Long: Long, Decimal, Float, Double
- Decimal: Any wider Numeric type
- Float: Float, Double
- Double: Double
- String: String
- Date: Date, Timestamp
- Timestamp: Timestamp
- Binary: Binary
- Boolean: Boolean
- Interval: Interval

As for complex data types, Spark will determine the precedent list recursively based on their sub-types.

With the definition of type precedent list, the general type coercion rules are as following:
- Data type S is allowed to be implicitly cast as type T iff T is in the precedence list of S
- Comparison is allowed iff the data type precedence list of both sides has at least one common element. When evaluating the comparison, Spark casts both sides as the tightest common data type of their precedent lists.
- There should be at least one common data type among all the children's precedence lists for the following operators. The data type of the operator is the tightest common precedent data type.
```
 In, Except(odd), Intersect, Greatest, Least, Union, If, CaseWhen, CreateArray, Array Concat,Sequence, MapConcat, CreateMap
```

- For complex types (struct, array, map), Spark recursively looks into the element type and applies the rules above. If the element nullability is converted from true to false, add runtime null check to the elements.

Note: this new type coercion system will allow implicit converting String type literals as other primitive types, in case of breaking too many existing Spark SQL queries. This is a special rule and it is not from the ANSI SQL standard.
### Why are the changes needed?

The current type coercion rules are complex. Also, they are very hard to describe and understand. For details please refer the attached documentation "Default Type coercion rules of Spark"
[Default Type coercion rules of Spark.pdf](https://github.com/apache/spark/files/5874362/Default.Type.coercion.rules.of.Spark.pdf)

This PR is to create a new and strict type coercion system under ANSI mode. The rules are simple and clean, so that users can follow them easily

### Does this PR introduce _any_ user-facing change?

Yes,  new implicit cast syntax rules in ANSI mode. All the details are in the first section of this description.

### How was this patch tested?

Unit tests

Closes #31349 from gengliangwang/ansiImplicitConversion.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2021-02-24 13:40:58 +08:00
Max Gekk f64fc22466 [SPARK-34290][SQL] Support v2 TRUNCATE TABLE
### What changes were proposed in this pull request?
Implement the v2 execution node for the `TRUNCATE TABLE` command.

### Why are the changes needed?
To have feature parity with DS v1, and support truncation of v2 tables.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running the unified tests for v1 and v2 tables:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *TruncateTableSuite"
```

Closes #31605 from MaxGekk/truncate-table-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-24 05:21:11 +00:00