Commit graph

28989 commits

Author SHA1 Message Date
Kousuke Saruta acf0a4fac2 [SPARK-33999][BUILD] Make sbt unidoc success with JDK11
### What changes were proposed in this pull request?

This PR fixes an issue that `sbt unidoc` fails with JDK11.
With the current master, `sbt unidoc` fails because the generated Java sources cause syntax error.
As of JDK11, the default doclet seems to refuse such syntax error.

Usually, it's enough to specify  `--ignore-source-errors` option when `javadoc` runs to suppress the syntax error but unfortunately, we will then get an internal error.

```
[error] javadoc: error - An internal exception has occurred.
[error] 	(java.lang.NullPointerException)
[error] Please file a bug against the javadoc tool via the Java bug reporting page
[error] (http://bugreport.java.com) after checking the Bug Database (http://bugs.java.com)
[error] for duplicates. Include error messages and the following diagnostic in your report. Thank you.
[error] java.lang.NullPointerException
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2340)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2398)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2348)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1659)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types$DefaultTypeVisitor.visit(Types.java:4857)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2343)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2329)
[error] 	at jdk.compiler/com.sun.tools.javac.model.JavacTypes.erasure(JavacTypes.java:134)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1069)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1048)
[error] 	at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1695)
[error] 	at java.compiler11.0.9.1/javax.lang.model.util.AbstractTypeVisitor6.visit(AbstractTypeVisitor6.java:104)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils.asTypeElement(Utils.java:1086)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.setContext(LinkInfoImpl.java:410)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.<init>(LinkInfoImpl.java:285)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLink(LinkFactoryImpl.java:184)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLinks(LinkFactoryImpl.java:167)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.links.LinkFactory.getLink(LinkFactory.java:196)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.getLink(HtmlDocletWriter.java:679)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedClassLink(HtmlDocletWriter.java:814)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedStrongClassLink(HtmlDocletWriter.java:839)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addPartialInfo(AbstractTreeWriter.java:185)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:92)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:94)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:129)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:112)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generatePackageTreeFile(PackageTreeWriter.java:115)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generate(PackageTreeWriter.java:92)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDoclet.generatePackageFiles(HtmlDoclet.java:312)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.startGeneration(AbstractDoclet.java:210)
[error] 	at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.run(AbstractDoclet.java:114)
[error] 	at jdk.javadoc/jdk.javadoc.doclet.StandardDoclet.run(StandardDoclet.java:72)
[error] 	at jdk.javadoc/jdk.javadoc.internal.tool.Start.parseAndExecute(Start.java:588)
[error] 	at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:432)
[error] 	at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:345)
[error] 	at jdk.javadoc/jdk.javadoc.internal.tool.Main.execute(Main.java:63)
[error] 	at jdk.javadoc/jdk.javadoc.internal.tool.Main.main(Main.java:52)
```

I found the internal error happens when a generated Java class is from a Scala class which is package private and generic.
I also found that if we don't generate class hierarchy tree in the JavaDoc, we can suppress the internal error for JDK11 and later.

### Why are the changes needed?

Make the build success with sbt and JDK11.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the following command successfully finish with JDK8 and JDK11.
```
$ build/sbt -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Pmesos -Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud clean unidoc
```

I also confirmed html files are successfully generated under `target/javaunidoc`.

Closes #31023 from sarutak/fix-genjavadoc-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 19:03:28 +09:00
HyukjinKwon 329850c667 [SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/29703.
It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there:
- https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html
- https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support
- http://crs4.github.io/pydoop/_pydoop1/installation.html

### Why are the changes needed?

To avoid the environment variables is unexpectedly conflicted.

### Does this PR introduce _any_ user-facing change?

It renames the environment variable but it's not released yet.

### How was this patch tested?

Existing unittests will test.

Closes #31028 from HyukjinKwon/SPARK-32017-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 17:21:32 +09:00
HyukjinKwon 356fdc9a7f [SPARK-34007][BUILD] Downgrade scala-maven-plugin to 4.3.0
### What changes were proposed in this pull request?

This PR is a partial revert of https://github.com/apache/spark/pull/30456 by downgrading scala-maven-plugin from 4.4.0 to 4.3.0.

Currently, when you run the docker release script (`./dev/create-release/do-release-docker.sh`), it fails to compile as below during incremental compilation with zinc for an unknown reason:

```
[INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ...
[ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes
java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package
java.lang.ClassLoader.checkCerts(ClassLoader.java:891)
java.lang.ClassLoader.preDefineClass(ClassLoader.java:661)
java.lang.ClassLoader.defineClass(ClassLoader.java:754)
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
java.net.URLClassLoader.access$100(URLClassLoader.java:74)
java.net.URLClassLoader$1.run(URLClassLoader.java:369)
java.net.URLClassLoader$1.run(URLClassLoader.java:363)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:362)
java.lang.ClassLoader.loadClass(ClassLoader.java:418)
java.lang.ClassLoader.loadClass(ClassLoader.java:351)
java.lang.Class.getDeclaredMethods0(Native Method)
java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
java.lang.Class.privateGetPublicMethods(Class.java:2902)
java.lang.Class.getMethods(Class.java:1615)
sbt.internal.inc.ClassToAPI$.toDefinitions0(ClassToAPI.scala:170)
sbt.internal.inc.ClassToAPI$.$anonfun$toDefinitions$1(ClassToAPI.scala:123)
scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
sbt.internal.inc.ClassToAPI$.toDefinitions(ClassToAPI.scala:123)
sbt.internal.inc.ClassToAPI$.$anonfun$process$1(ClassToAPI.scala:3
```

This happens when it builds Spark with Hadoop 2. It doesn't reproduce when you build this alone. It should follow the sequence of build in the release script.

This is fixed by downgrading. Looks like there is a regression in scala-maven-plugin somewhere between 4.4.0 and 4.3.0.

### Why are the changes needed?

To unblock the release.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It can be tested as below:

```bash
./dev/create-release/do-release-docker.sh -d $WORKING_DIR
```

Closes #31031 from HyukjinKwon/SPARK-34007.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 17:20:08 +09:00
Max Gekk 122f8f0fdb [SPARK-33919][SQL][TESTS] Unify v1 and v2 SHOW NAMESPACES tests
### What changes were proposed in this pull request?
1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs.
2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite"
```

Closes #30937 from MaxGekk/unify-show-namespaces-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-05 07:30:59 +00:00
tanel.kiis@gmail.com f252a9334e [SPARK-33935][SQL] Fix CBO cost function
### What changes were proposed in this pull request?

Changed the cost function in CBO to match documentation.

### Why are the changes needed?

The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as:
```
The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight).
```
The implementation in `JoinReorderDP.betterThan` does not match this documentaiton:
```
def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
      if (other.planCost.card == 0 || other.planCost.size == 0) {
        false
      } else {
        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
        relativeRows * conf.joinReorderCardWeight +
          relativeSize * (1 - conf.joinReorderCardWeight) < 1
      }
    }
```

This different implementation has an unfortunate consequence:
given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes.

A example values, that have this fenomen with the default weight value (0.7):
A.card = 500, B.card = 300
A.size = 30, B.size = 80
Both A betterThan B and B betterThan A would have score above 1 and would return false.

This happens with several of the TPCDS queries.

The new implementation does not have this behavior.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and existing UTs

Closes #30965 from tanelk/SPARK-33935_cbo_cost_function.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-01-05 16:00:24 +09:00
fwang12 a071826f72 [SPARK-33100][SQL] Ignore a semicolon inside a bracketed comment in spark-sql
### What changes were proposed in this pull request?
Now the spark-sql does not support parse the sql statements with bracketed comments.
For the sql statements:
```
/* SELECT 'test'; */
SELECT 'test';
```
Would be split to two statements:
The first one: `/* SELECT 'test'`
The second one: `*/ SELECT 'test'`

Then it would throw an exception because the first one is illegal.
In this PR, we ignore the content in bracketed comments while splitting the sql statements.
Besides, we ignore the comment without any content.

### Why are the changes needed?
Spark-sql might split the statements inside bracketed comments and it is not correct.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #29982 from turboFei/SPARK-33110.

Lead-authored-by: fwang12 <fwang12@ebay.com>
Co-authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-01-05 15:55:30 +09:00
LantaoJin a7d3fcd354 [SPARK-34000][CORE] Fix stageAttemptToNumSpeculativeTasks java.util.NoSuchElementException
### What changes were proposed in this pull request?
From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`.
```
21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded)
21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the
previous stage needs to be re-run, or because a different copy of the task has already succeeded).
21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default
21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception
java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
        at scala.collection.MapLike.default(MapLike.scala:235)
        at scala.collection.MapLike.default$(MapLike.scala:234)
        at scala.collection.AbstractMap.default(Map.scala:63)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
        at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)
```

### Why are the changes needed?
To avoid throwing the java.util.NoSuchElementException

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue.

Closes #31025 from LantaoJin/SPARK-34000.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 21:37:26 -08:00
Kent Yao f0ffe0cd65 [SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer
### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/29643, we move  the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer
 because it and resolveOperatorsUpWithNewOutput are called in the analyzer.
For example,

PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput
with
```logtalk
[info] - char/varchar resolution in sub query  *** FAILED *** (367 milliseconds)
[info]   java.lang.RuntimeException: This method should not be called in the analyzer
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
[info]   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267)
```
### Why are the changes needed?

trivial bugfix
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

Closes #31013 from yaooqinn/SPARK-33992.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-05 05:34:11 +00:00
Terry Kim 15a863fd54 [SPARK-34001][SQL][TESTS] Remove unused runShowTablesSql() in DataSourceV2SQLSuite.scala
### What changes were proposed in this pull request?

After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method.

### Why are the changes needed?

To remove unused method.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test.

Closes #31022 from imback82/33382-followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 21:32:49 -08:00
Terry Kim 6b00fdc756 [SPARK-33998][SQL] Provide an API to create an InternalRow in V2CommandExec
### What changes were proposed in this pull request?

There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code.

### Why are the changes needed?

To clean up duplicate code.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test since this is just refactoring.

Closes #31020 from imback82/refactor_v2_command.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-05 05:32:36 +00:00
Chongguang LIU 976e97a80d [SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode
### What changes were proposed in this pull request?

Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter.

### Why are the changes needed?

For ansiMode.

### Does this PR introduce _any_ user-facing change?

Yes.
When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter.
When spark.sql.ansi.enabled = false, same behaviour as before.

### How was this patch tested?

Ansi mode is tested with existing tests.
End-to-end tests have been added.

Closes #30807 from chongguang/SPARK-33794.

Authored-by: Chongguang LIU <chongguang.liu@laposte.fr>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com bb6d6b5602 [SPARK-33964][SQL] Combine distinct unions in more cases
### What changes were proposed in this pull request?

Added the `RemoveNoopOperators` rule to optimization batch `Union`.  Also made sure that the `RemoveNoopOperators` would be idempotent.

### Why are the changes needed?

In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them.
The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UTs and the output of `PlanStabilitySuite`

Closes #30996 from tanelk/SPARK-33964_combine_unions.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 11:01:31 +09:00
angerszhu 559f411da8 [SPARK-33908][CORE][FOLLOWUP] Correct Scaladoc of resolveDependencyPaths/resolveMavenDependencies
### What changes were proposed in this pull request?
Fix un-correct doc of last change https://github.com/apache/spark/pull/30922#discussion_r551453193

### Why are the changes needed?
FIx doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Builds finished correctly.

Closes #31016 from AngersZhuuuu/SPARK-33908-FOLLOW-UP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 15:44:42 -08:00
Koert Kuipers 9b4173fa95 [SPARK-33894][SQL] Change visibility of private case classes in mllib to avoid runtime compilation errors with Scala 2.13
### What changes were proposed in this pull request?
Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass]

### Why are the changes needed?
Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this:
```
[info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()"
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests now pass for Scala 2.13

Closes #31018 from koertkuipers/feat-visibility-scala213.

Authored-by: Koert Kuipers <koert@tresata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 15:40:32 -08:00
Max Gekk 84c1f43669 [SPARK-33987][SQL] Refresh cache in v2 ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command.
2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog.

### Why are the changes needed?
The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987.

### Does this PR introduce _any_ user-facing change?
Yes, it could if users have v2 table catalogs.

### How was this patch tested?
By running unified tests for `ALTER TABLE .. DROP PARTITION`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31017 from MaxGekk/drop-partition-refresh-cache-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 15:00:48 -08:00
William Hyun 90f4ecf8cc [SPARK-33996][BUILD] Upgrade checkstyle plugins
### What changes were proposed in this pull request?

This PR aims to upgrade `checkstyle` Maven plugins and its dependency, `com.puppycrawl.tools:checkstyle`.

### Why are the changes needed?

The changes are needed to support Java 14+ better.
- https://checkstyle.org/releasenotes.html#Release_8.39
- https://checkstyle.org/releasenotes.html#Release_8.38
- https://checkstyle.org/releasenotes.html#Release_8.37
- https://checkstyle.org/releasenotes.html#Release_8.36
- https://checkstyle.org/releasenotes.html#Release_8.35
- https://checkstyle.org/releasenotes.html#Release_8.34
- https://checkstyle.org/releasenotes.html#Release_8.33
- https://checkstyle.org/releasenotes.html#Release_8.32
- https://checkstyle.org/releasenotes.html#Release_8.31
- https://checkstyle.org/releasenotes.html#Release_8.30

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CI.

Closes #31019 from williamhyun/checkstyle.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 14:54:16 -08:00
Kent Yao ac4651a7d1 [SPARK-33980][SS] Invalidate char/varchar in spark.readStream.schema
### What changes were proposed in this pull request?

invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in da72b87374

### Why are the changes needed?

bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false`

### Does this PR introduce _any_ user-facing change?

yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false`

### How was this patch tested?

new tests

Closes #31003 from yaooqinn/SPARK-33980.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 12:59:45 -08:00
HyukjinKwon d6322bf70c [SPARK-33983][PYTHON] Update cloudpickle to v1.6.0
### What changes were proposed in this pull request?

This PR proposes to upgrade cloudpickle from 1.5.0 to 1.6.0.
It virtually contains one fix:

4510be850d

From a cursory look, this isn't a regression, and not even properly supported in Python:

```python
>>> import pickle
>>> pickle.dumps({}.keys())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'dict_keys' object
```

So it seems fine not to backport.

### Why are the changes needed?

To leverage bug fixes from the cloudpickle upstream.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub actions build will test it out.

Closes #31007 from HyukjinKwon/cloudpickle-upgrade.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:36:31 -08:00
Takeshi Yamamuro 414d323d6c [SPARK-33988][SQL][TEST] Add an option to enable CBO in TPCDSQueryBenchmark
### What changes were proposed in this pull request?

This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled.

### Why are the changes needed?

To monitor performance chaneges with CBO enabled.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #31011 from maropu/AddOptionForCBOInTPCDSBenchmark.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:31:20 -08:00
Max Gekk fc3f22645e [SPARK-33990][SQL][TESTS] Remove partition data by v2 ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests.

### Why are the changes needed?
This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running tests suites for v1 and v2 catalogs:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #31014 from MaxGekk/fix-drop-partition-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:26:39 -08:00
HyukjinKwon 6b86aa0b52 [SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1
### What changes were proposed in this pull request?

This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).

### Why are the changes needed?

To leverage fixes from the upstream in Py4J.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub Actions will test it out.

Closes #31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:23:38 -08:00
Terry Kim ddc0d5148a [SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables
### What changes were proposed in this pull request?

This PR proposes to implement `DESCRIBE COLUMN` for v2 tables.

Note that `isExnteded` option is not implemented in this PR.

### Why are the changes needed?

Parity with v1 tables.

### Does this PR introduce _any_ user-facing change?

Yes, now, `DESCRIBE COLUMN` works for v2 tables.
```scala
sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show
```
```
+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|      data|
|data_type|    string|
|  comment|     hello|
+---------+----------+
```

Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.`

### How was this patch tested?

Added new test.

Closes #30881 from imback82/describe_col_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 16:14:33 +00:00
angerszhu 8583a4605f [SPARK-33844][SQL] InsertIntoHiveDir command should check col name too
### What changes were proposed in this pull request?

In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.

When we use spark 2.4 with UT
```
  test("insert overwrite directory with comma col name") {
    withTempDir { dir =>
      val path = dir.toURI.getPath

      val v1 =
        s"""
           | INSERT OVERWRITE DIRECTORY '${path}'
           | STORED AS TEXTFILE
           | SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
         """.stripMargin

      sql(v1).explain(true)

      sql(v1).show()
    }
  }
```
failed with as below since column name contains `,` then column names and column types size not equal.
```
19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter:  [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
	at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
	at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
	at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

```

After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)

And in script transform, we parse column name  to avoid this problem
554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)

So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.

### Why are the changes needed?
More save use serde

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #30850 from AngersZhuuuu/SPARK-33844.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 09:43:15 +00:00
Dongjoon Hyun 271c4f6e00 [SPARK-33978][SQL] Support ZSTD compression in ORC data source
### What changes were proposed in this pull request?

This PR aims to support ZSTD compression in ORC data source.

### Why are the changes needed?

Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost.
- https://issues.apache.org/jira/browse/ORC-363

**BEFORE**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none.
```

**AFTER**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
```

```bash
$ orc-tools meta /tmp/zstd
Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230]
Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
File Version: 0.12 with ORC_14
Rows: 1
Compression: ZSTD
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 230 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.2.0
```

### Does this PR introduce _any_ user-facing change?

Yes, this is a new feature.

### How was this patch tested?

Pass the newly added test case.

Closes #31002 from dongjoon-hyun/SPARK-33978.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 00:54:47 -08:00
Max Gekk 8b3fb43f40 [SPARK-33965][SQL][TESTS] Recognize spark_catalog by CACHE TABLE in Hive table names
### What changes were proposed in this pull request?
Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below:
```scala
      case UnresolvedRelation(ident, _, _) =>
        if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) {
```
added by https://github.com/apache/spark/pull/30883.

### Why are the changes needed?
1. To have feature parity with v1 In-Memory catalog.
2. To be able to write unified tests for In-Memory and Hive external catalogs.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #30997 from MaxGekk/cache-table-spark_catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 08:28:26 +00:00
Hoa 0b647fe69c [SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
### What changes were proposed in this pull request?
JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now.

### Why are the changes needed?
Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Close #30902

Closes #30902 from saikocat/SPARK-33888.

Lead-authored-by: Hoa <hoameomu@gmail.com>
Co-authored-by: Hoa <saikocatz@gmail.com>
Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn>
Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 06:53:12 +00:00
angerszhu adac633f93 [SPARK-33934][SQL] Add SparkFile's root dir to env property PATH
### What changes were proposed in this pull request?
In hive we always use
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`.

This pr add a SparkFile's root dir path to execution env property `PATH`, then  sub-processor will find `scrip.py` as program under `PATH`.

### Why are the changes needed?
Support SQL migration form Hive to Spark.

### Does this PR introduce _any_ user-facing change?
User can direct use script file name as program in script transform SQL.

```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
### How was this patch tested?
UT

Closes #30973 from AngersZhuuuu/SPARK-33934.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-04 15:46:49 +09:00
Yuming Wang 2a68ed71e4 [SPARK-33954][SQL] Some operator missing rowCount when enable CBO
### What changes were proposed in this pull request?

This pr fix some operator missing rowCount when enable CBO, e.g.:
```scala
spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("set spark.sql.cbo.planStats.enabled=true")
spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost")
```

Before this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
   +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
      +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
         +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```

After this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
   +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
      +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
         +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)

```

### Why are the changes needed?

 [`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30987 from wangyum/SPARK-33954.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 05:53:14 +00:00
gengjiaan b037930952 [SPARK-33951][SQL] Distinguish the error between filter and distinct
### What changes were proposed in this pull request?
The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use.

### Why are the changes needed?
increase readability and ease of use.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test

Closes #30982 from beliefer/SPARK-33951.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 05:44:00 +00:00
Max Gekk 67195d0d97 [SPARK-33950][SQL] Refresh cache in v1 ALTER TABLE .. DROP PARTITION
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table.

### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
spark-sql> CACHE TABLE tbl1;
spark-sql> SELECT * FROM tbl1;
0	0
1	1
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
0	0
1	1
```
The last query must not return `0	0` since it was deleted by previous command.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
1	1
```

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #30983 from MaxGekk/drop-partition-refresh-cache.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-04 04:11:39 +00:00
Ruifeng Zheng 6b7527e381 [SPARK-33398] Fix loading tree models prior to Spark 3.0
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.

### Why are the changes needed?
to support load old tree model in 3.0/3.1/master

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added testsuites

Closes #30889 from zhengruifeng/fix_tree_load.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-03 11:52:46 -06:00
Liang-Chi Hsieh 963c60fe49 [SPARK-33955][SS] Add latest offsets to source progress
### What changes were proposed in this pull request?

This patch proposes to add latest offset to source progress for streaming queries.

### Why are the changes needed?

Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress.

### Does this PR introduce _any_ user-facing change?

Yes, for new metric about latest source offset in source progress.

### How was this patch tested?

Unit test. Manually test in Spark cluster:

```
    "description" : "KafkaV2[Subscribe[page_view_events]]",
    "startOffset" : {
      "page_view_events" : {
        "2" : 582370921,
        "4" : 391910836,
        "1" : 631009201,
        "3" : 406601346,
        "0" : 195799112
      }
    },
    "endOffset" : {
      "page_view_events" : {
        "2" : 583764414,
        "4" : 392338002,
        "1" : 632183480,
        "3" : 407101489,
        "0" : 197304028
      }
    },
    "latestOffset" : {
      "page_view_events" : {
        "2" : 589852545,
        "4" : 394204277,
        "1" : 637313869,
        "3" : 409286602,
        "0" : 203878962
      }
    },
    "numInputRows" : 4999997,
    "inputRowsPerSecond" : 29287.70501405811,
```

Closes #30988 from viirya/latest-offset.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-03 01:31:38 -08:00
Liang-Chi Hsieh cfd4a08398 [SPARK-33962][SS] Fix incorrect min partition condition
### What changes were proposed in this pull request?

This patch fixes an incorrect condition when comparing offset range size and min partition config.

### Why are the changes needed?

When calculating offset ranges, we consider `minPartitions` configuration. If `minPartitions` is not set or is less than or equal the size of given ranges, it means there are enough partitions at Kafka so we don't need to split offsets to satisfy min partition requirement. But the current condition is `offsetRanges.size > minPartitions.get` and is not correct. Currently `getRanges` will split offsets in unnecessary case.

Besides, in non-split case, we can assign preferred executor location and reuse `KafkaConsumer`. So unnecessary splitting offset range will miss the chance to reuse `KafkaConsumer`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Manual test in Spark cluster with Kafka.

Closes #30994 from viirya/ss-minor4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-03 01:29:12 -08:00
Max Gekk fc7d0165d2 [SPARK-33963][SQL] Canonicalize HiveTableRelation w/o table stats
### What changes were proposed in this pull request?
Skip table stats in canonicalizing of `HiveTableRelation`.

### Why are the changes needed?
The changes fix a regression comparing to Spark 3.0, see SPARK-33963.

### Does this PR introduce _any_ user-facing change?
Yes. After changes Spark behaves as in the version 3.0.1.

### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```

Closes #30995 from MaxGekk/fix-caching-hive-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-03 11:23:46 +09:00
Yuming Wang 6c5ba8169a [SPARK-33959][SQL] Improve the statistics estimation of the Tail
### What changes were proposed in this pull request?

This pr improve the statistics estimation of the `Tail`:

```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1")
println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats)
```

Before this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=3.8 KiB)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```

After this pr:
```
== Optimized Logical Plan ==
Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
+- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
```

### Why are the changes needed?

Import statistics estimation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30991 from wangyum/SPARK-33959.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-03 10:59:12 +09:00
Dongjoon Hyun 1c25bea0bb [SPARK-33961][BUILD] Upgrade SBT to 1.4.6
### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.4.6 to fix the SBT regression.

### Why are the changes needed?

[SBT 1.4.6](https://github.com/sbt/sbt/releases/tag/v1.4.6) has the following fixes
- Updates to Coursier 2.0.8, which fixes the cache directory setting on Windows
- Fixes performance regression in shell tab completion
- Fixes match error when using withDottyCompat
- Fixes thread-safety in AnalysisCallback handler

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30993 from dongjoon-hyun/SPARK-SBT-1.4.6.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-02 14:49:03 -08:00
Yuming Wang 4cd680581a [SPARK-33956][SQL] Add rowCount for Range operator
### What changes were proposed in this pull request?

This pr add rowCount for `Range` operator:
```scala
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("select id from range(100)").explain("cost")
```

Before this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
```

After this pr:
```
== Optimized Logical Plan ==
Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100)
```

### Why are the changes needed?

 [`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30989 from wangyum/SPARK-33956.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-02 08:58:48 -08:00
William Hyun bd346f4a2d [SPARK-33957][BUILD] Update commons-lang3 to 3.11
### What changes were proposed in this pull request?

This PR aims to update commons-lang3 to 3.11 to support Java 16+ better.

### Why are the changes needed?

commons-lang3 has the following bug fixes and Java 16 support.
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?
Pass the CIs.

Closes #30990 from williamhyun/Commons-lang3.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-01 19:59:17 -08:00
Baohe Zhang 45df6db906 [SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics
### What changes were proposed in this pull request?
Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.

### Why are the changes needed?
App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

val x = sc.makeRDD(1 to 100000, 5)
x.count()

Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png)

Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test, rerun the steps of bug reproduce and see the bug is gone.

Closes #30920 from baohe-zhang/SPARK-33906.

Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:34:55 -08:00
Kent Yao ed9f728801 [SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options
### What changes were proposed in this pull request?

While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled.

### Why are the changes needed?

bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

 new tests

Closes #30978 from yaooqinn/SPARK-33944.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:20:31 -08:00
angerszhu 771c538620 [SPARK-33084][SQL][TESTS][FOLLOW-UP] Fix Scala 2.13 UT failure
### What changes were proposed in this pull request?
Fix UT according to  https://github.com/apache/spark/pull/29966#issuecomment-752830046

Change StructType construct from
```
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
```
to
```
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)
```
The whole udf class is :

```
package org.apache.spark.examples.sql

import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

class Spark33084 extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  def inputSchema: StructType = new StructType().add("inputColumn", LongType)

  // Data types of values in the aggregation buffer
  def bufferSchema: StructType =
    new StructType().add("sum", LongType).add("count", LongType)
  // The data type of the returned value
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}
```

### Why are the changes needed?
Fix UT for scala 2.13

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30980 from AngersZhuuuu/spark-33084-followup.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:18:31 -08:00
yi.wu 3fe5614a7c [SPARK-31946][CORE] Make worker/executor decommission signal configurable
### What changes were proposed in this pull request?

This PR proposed to make worker/executor decommission signal configurable.

* Added confs: `spark.worker.decommission.signal` / `spark.executor.decommission.signal`
* Rename `WorkerSigPWRReceived`/ `ExecutorSigPWRReceived` to `WorkerDecomSigReceived`/ `ExecutorDecomSigReceived`

### Why are the changes needed?

The current signal `PWR` can't work on macOS since it's not compliant with POSIX while macOS does.  So the developers currently can't do end-to-end decommission test on their macOS environment.

Besides, the configuration becomes more flexible for users in case the default signal (`PWR`) gets conflicted with their own applications/environment.

### Does this PR introduce _any_ user-facing change?

No (it's a new API for 3.2)

### How was this patch tested?

Manually tested.

Closes #30968 from Ngone51/configurable-decom-signal.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-31 13:13:02 -08:00
Pradyumn Agrawal (pradyumn.ag) 13e8c28409 [SPARK-33942][DOCS] Remove hiveClientCalls.count in CodeGenerator metrics docs
### What changes were proposed in this pull request?
Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor

### Why are the changes needed?
Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md)

Closes #30976 from coderbond007/SPARK-33942.

Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-30 17:25:46 -08:00
yangjie01 85de644733 [SPARK-33804][CORE] Fix compilation warnings about 'view bounds are deprecated'
### What changes were proposed in this pull request?

There are only 3 compilation warnings related to `view bounds are deprecated` in Spark Code:
```
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:35: view bounds are deprecated; use an implicit parameter instead.
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/rdd/SequenceFileRDDFunctions.scala:55: view bounds are deprecated; use an implicit parameter instead.
```

This pr try to fix these compilation warnings.

### Why are the changes needed?
Fix compilation warnings about ` view bounds are deprecated`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30924 from LuciferYang/SPARK-33804.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-30 13:57:44 -06:00
Liang-Chi Hsieh f38265ddda [SPARK-33907][SQL] Only prune columns of from_json if parsing options is empty
### What changes were proposed in this pull request?

As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing.

This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed.

### Why are the changes needed?

It is to avoid unexpected behavior change regarding parsing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30970 from viirya/SPARK-33907-3.2.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-12-30 09:57:15 -08:00
gengjiaan ba974ea8e4 [SPARK-30789][SQL] Support (IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
### What changes were proposed in this pull request?
All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS | RESPECT NULLS. For example:
```
LEAD (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering )
```

```
LAG (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering )
```

```
NTH_VALUE (expr, offset)
[ IGNORE NULLS | RESPECT NULLS ]
OVER
( [ PARTITION BY window_partition ]
[ ORDER BY window_ordering
 frame_clause ] )
```

The mainstream database or engine supports this syntax contains:
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html

**Presto**
https://prestodb.io/docs/current/functions/window.html

**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm

**Teradata**
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w

**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/lead.html
https://docs.snowflake.com/en/sql-reference/functions/lag.html
https://docs.snowflake.com/en/sql-reference/functions/nth_value.html
https://docs.snowflake.com/en/sql-reference/functions/first_value.html
https://docs.snowflake.com/en/sql-reference/functions/last_value.html

**Exasol**
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm

### Why are the changes needed?
Support `(IGNORE | RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Jenkins test

Closes #30943 from beliefer/SPARK-30789.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 13:14:31 +00:00
Max Gekk 2afd1fb492 [SPARK-33904][SQL] Recognize spark_catalog in saveAsTable() and insertInto()
### What changes were proposed in this pull request?
In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names.

### Why are the changes needed?
1. To simplify writing of unified v1 and v2 tests
2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`:
```scala
scala> sql("CREATE NAMESPACE spark_catalog.ns")
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629)
  ... 47 elided
scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl")
org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl.
  at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498)
  ... 47 elided
```
but `INSERT INTO` succeed:
```sql
spark-sql> create table spark_catalog.ns.tbl (c int);
spark-sql> insert into spark_catalog.ns.tbl select 0;
spark-sql> select * from spark_catalog.ns.tbl;
0
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```scala
scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl")
scala> spark.table("spark_catalog.ns.tbl").show(false)
+-----+
|value|
+-----+
|0    |
|1    |
+-----+
```

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.FileFormatWriterSuite"
```

Closes #30919 from MaxGekk/insert-into-spark_catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 07:56:34 +00:00
Max Gekk 0eb4961ca8 [SPARK-33926][SQL] Improve the error message from resolving of v1 database name
### What changes were proposed in this pull request?
1. Replace `SessionCatalogAndNamespace` by `DatabaseInSessionCatalog` in resolving database name from v1 session catalog.
2. Throw more precise errors from `DatabaseInSessionCatalog`
3. Fix expected error messages in `v1.ShowTablesSuiteBase`

Closes #30947

### Why are the changes needed?
Current error message "multi-part identifier cannot be empty" may confuse users. And this error message is just a consequence of "incorrectly" applied an implicit class. For example, `SHOW TABLES IN spark_catalog`:

1. Spark cuts off `spark_catalog` from namespaces in `SessionCatalogAndNamespace`, so, `ns == Seq.empty` here: 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L365)
2. Then `ns.length != 1` is `true` and Spark tries to raise the exception at 0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L367)
3.  ... but `ns.quoted` triggers implicit wrapping `Seq.empty` by `MultipartIdentifierHelper`, and hit to the second check `if (parts.isEmpty)` at 156704ba0d/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala (L120-L122)

So, Spark throws the exception at third step instead of `new AnalysisException(s"The database name is not valid: $quoted")` on the second step. And even on the second step, the exception doesn't show actual reason as it is pretty generic.

### Does this PR introduce _any_ user-facing change?
Yes in the case of v1 DDL commands when a database is not specified or nested databases is set.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DDLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```

Closes #30963 from MaxGekk/database-in-session-catalog.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-30 07:52:34 +00:00
Hyukjin Kwon 403bf55cbe [SPARK-33927][BUILD] Fix Dockerfile for Spark release to work
### What changes were proposed in this pull request?

This PR proposes to fix the `Dockerfile` for Spark release.

- Port b135db3b1a to `Dockerfile`
- Upgrade Ubuntu 18.04 -> 20.04 (because of porting b135db3)
- Remove Python 2 (because of Ubuntu upgrade)
- Use built-in Python 3.8.5 (because of Ubuntu upgrade)
- Node.js 11 -> 12 (because of Ubuntu upgrade)
- Ruby 2.5 -> 2.7 (because of Ubuntu upgrade)
- Python dependencies and Jekyll + plugins upgrade to the latest as it's used in GitHub Actions build (unrelated to the issue itself)

### Why are the changes needed?

To make a Spark release :-).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via:

```bash
cd dev/create-release/spark-rm
docker build -t spark-rm --build-arg UID=$UID .
```

```
...
Successfully built 516d7943634f
Successfully tagged spark-rm:latest
```

Closes #30971 from HyukjinKwon/SPARK-33927.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 16:37:23 +09:00
Liang-Chi Hsieh 4a669f5830 [MINOR][SS] Call fetchEarliestOffsets when it is necessary
### What changes were proposed in this pull request?

This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary.

### Why are the changes needed?

To avoid unnecessary Kafka RPC calls.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30969 from viirya/ss-minor3.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-30 16:15:41 +09:00