### What changes were proposed in this pull request?
Update migration guide according to https://github.com/apache/spark/pull/30942#issuecomment-755054562
### Why are the changes needed?
update migration guide.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#31051 from AngersZhuuuu/SPARK-32685-FOLLOW-UP.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds the support of the latest mkdocs, and makes the sidebar properly show. It works in lower versions too.
Before:
![Screen Shot 2021-01-06 at 5 11 56 PM](https://user-images.githubusercontent.com/6477701/103745131-4e7fe400-5042-11eb-9c09-84f9f95e9fb9.png)
After:
![Screen Shot 2021-01-06 at 5 10 53 PM](https://user-images.githubusercontent.com/6477701/103745139-5049a780-5042-11eb-8ded-30b6f7ef48aa.png)
### Why are the changes needed?
This is a regression in the documentation.
### Does this PR introduce _any_ user-facing change?
Technically no. It's not related yet. It fixes the list on the sidebar appears properly.
### How was this patch tested?
Manually built the docs via `./sql/create-docs.sh` and `open ./sql/site/index.html`
Closes#31061 from HyukjinKwon/SPARK-34022.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side.
Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary.
### Why are the changes needed?
To make the usage similar with Scala side, and for consistency.
### Does this PR introduce _any_ user-facing change?
No, this is not released yet.
### How was this patch tested?
GitHub Actions and Jenkins build will test it out.
Closes#31062 from HyukjinKwon/SPARK-30681.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Skip files if they are binary or very large to fit the configMap's max size.
### Why are the changes needed?
Config map cannot hold binary files and there is also a limit on how much data a configMap can hold.
This limit can be configured by the k8s cluster admin. This PR, skips such files (with a warning) instead of failing with weird runtime errors.
If such files are not skipped, then it would result in mount errors or encoding errors (if binary files are submitted).
### Does this PR introduce _any_ user-facing change?
yes, in simple words avoids possible errors due to negligence (for example, placing a large file or a binary file in SPARK_CONF_DIR) and thus improves user experience.
### How was this patch tested?
Added relevant tests and improved existing tests.
Closes#30472 from ScrapCodes/SPARK-32221/avoid-conf-propagate-errors.
Lead-authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Co-authored-by: Prashant Sharma <prashant@apache.org>
Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>
### What changes were proposed in this pull request?
We should optimize Like Any/All by LikeSimplification to improve performance.
### Why are the changes needed?
Optimize Like Any/All
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#30975 from beliefer/SPARK-33938.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Follow comment and fix. flaky test https://github.com/apache/spark/pull/30973#issuecomment-754852130.
This flaky test is similar as https://github.com/apache/spark/pull/30896
Some task's failed with root cause but in driver may return error without root cause , change. UT to check with status exit code since different root cause's exit code is not same.
### Why are the changes needed?
Fix flaky test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#31046 from AngersZhuuuu/SPARK-33934-FOLLOW-UP.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to adjust the order of check in KafkaTokenUtil.needTokenUpdate, so that short-circuit applies on the non-delegation token cases (insecure + secured without delegation token) and remedies the performance regression heavily.
### Why are the changes needed?
There's a serious performance regression between Spark 2.4 vs Spark 3.0 on read path against Kafka data source.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually ran a reproducer (https://github.com/codegorillauk/spark-kafka-read with modification to just count instead of writing to Kafka topic) with measuring the time.
> the branch applying the change with adding measurement
https://github.com/HeartSaVioR/spark/commits/debug-SPARK-33635-v3.0.1
> the branch only adding measurement
https://github.com/HeartSaVioR/spark/commits/debug-original-ver-SPARK-33635-v3.0.1
> the result (before the fix)
count: 10280000
Took 41.634007047 secs
21/01/06 13:16:07 INFO KafkaDataConsumer: debug ver. 17-original
21/01/06 13:16:07 INFO KafkaDataConsumer: Total time taken to retrieve: 82118 ms
> the result (after the fix)
count: 10280000
Took 7.964058475 secs
21/01/06 13:08:22 INFO KafkaDataConsumer: debug ver. 17
21/01/06 13:08:22 INFO KafkaDataConsumer: Total time taken to retrieve: 987 ms
Closes#31056 from HeartSaVioR/SPARK-33635.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Change `FrameLessOffsetWindowFunction` as sealed abstract class so that simplify pattern match.
### Why are the changes needed?
Simplify pattern match
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Jenkins test
Closes#31026 from beliefer/SPARK-30789-followup.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Filter out the driver entity when updating the exclusion status of live executors(including the driver), so the UI won't be marked as excluded in the UI even if the node that hosts the driver has been marked as excluded.
### Why are the changes needed?
Before this change, if we run spark with the standalone mode and with spark.blacklist.enabled=true. The driver will be marked as excluded when the host that hosts that driver has been marked as excluded. While it's incorrect because the exclude list feature will exclude executors only and the driver is still active.
![image](https://user-images.githubusercontent.com/26694233/103238740-35c05180-4911-11eb-99a2-c87c059ba0cf.png)
After the fix, the driver won't be marked as excluded.
![image](https://user-images.githubusercontent.com/26694233/103238806-6f915800-4911-11eb-80d5-3c99266cfd0a.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test. Reopen the UI and see the driver is no longer marked as excluded.
Closes#30954 from baohe-zhang/SPARK-33029.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.
In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.
For our application, here's what a log entry looks like before these changes were applied:
`20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s`
this indicates that we're spending more time reading rows than operating on the rows.
After these changes, it looks like this:
`20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s
`
### Why are the changes needed?
Metrics shouldn't mislead?
### Does this PR introduce _any_ user-facing change?
Aside from no longer misleading, no
### How was this patch tested?
unit tests passed. Field test results seem plausible
Closes#31021 from WamBamBoozle/input_timing.
Authored-by: Tom.Howland <Tom.Howland@target.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table.
2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command.
### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
spark-sql> CACHE TABLE tbl1;
spark-sql> SELECT * FROM tbl1;
0 0
1 1
spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2);
spark-sql> SELECT * FROM tbl1;
0 0
1 1
```
The last query must not return `0 2` since `0 0` was renamed by previous command.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2);
spark-sql> SELECT * FROM tbl1;
0 2
1 1
```
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```
Closes#31044 from MaxGekk/rename-partition-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after https://github.com/apache/spark/pull/28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` .
This PR fix this issue and add UT.
### Why are the changes needed?
Keep consistent behavior of migration guide.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added UT
Closes#31039 from AngersZhuuuu/SPARK-25780-Follow-up.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Switch log level from warn to debug when the spark container is not present in the pod's container statuses.
### Why are the changes needed?
There are many non-critical situations where the Spark container may not be present, and the warning log level is too high.
### Does this PR introduce _any_ user-facing change?
Log message change.
### How was this patch tested?
N/A
Closes#31047 from holdenk/SPARK-33874-follow-up.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#30870 from beliefer/SPARK-33542.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Instead of taking parameter '-Paarch64' when maven build
to activate the profile based on OS settings automatically,
than we can use same command to build on aarch64.
### What changes were proposed in this pull request?
Activate profile 'aarch64' based on OS
### Why are the changes needed?
After this change, we build spark using the same command for aarch64 as x86.
### Does this PR introduce _any_ user-facing change?
No.
After this change, no need to taking parameter '-Paarch64' when build, but take the parameter works also.
### How was this patch tested?
ARM daily CI.
Closes#31036 from huangtianhua/SPARK-34009.
Authored-by: huangtianhua <huangtianhua223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to use python3 instead of python in SQL documentation build.
After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too.
This blocks release because the release container does not have `python` but only `python3`.
### Why are the changes needed?
To unblock the release.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
I manually ran the script
Closes#31041 from HyukjinKwon/SPARK-34010.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `sbt unidoc` fails with JDK11.
With the current master, `sbt unidoc` fails because the generated Java sources cause syntax error.
As of JDK11, the default doclet seems to refuse such syntax error.
Usually, it's enough to specify `--ignore-source-errors` option when `javadoc` runs to suppress the syntax error but unfortunately, we will then get an internal error.
```
[error] javadoc: error - An internal exception has occurred.
[error] (java.lang.NullPointerException)
[error] Please file a bug against the javadoc tool via the Java bug reporting page
[error] (http://bugreport.java.com) after checking the Bug Database (http://bugs.java.com)
[error] for duplicates. Include error messages and the following diagnostic in your report. Thank you.
[error] java.lang.NullPointerException
[error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2340)
[error] at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2398)
[error] at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2348)
[error] at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1659)
[error] at jdk.compiler/com.sun.tools.javac.code.Types$DefaultTypeVisitor.visit(Types.java:4857)
[error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2343)
[error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2329)
[error] at jdk.compiler/com.sun.tools.javac.model.JavacTypes.erasure(JavacTypes.java:134)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1069)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1048)
[error] at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1695)
[error] at java.compiler11.0.9.1/javax.lang.model.util.AbstractTypeVisitor6.visit(AbstractTypeVisitor6.java:104)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils.asTypeElement(Utils.java:1086)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.setContext(LinkInfoImpl.java:410)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.<init>(LinkInfoImpl.java:285)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLink(LinkFactoryImpl.java:184)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLinks(LinkFactoryImpl.java:167)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.links.LinkFactory.getLink(LinkFactory.java:196)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.getLink(HtmlDocletWriter.java:679)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedClassLink(HtmlDocletWriter.java:814)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedStrongClassLink(HtmlDocletWriter.java:839)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addPartialInfo(AbstractTreeWriter.java:185)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:92)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:94)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:129)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:112)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generatePackageTreeFile(PackageTreeWriter.java:115)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generate(PackageTreeWriter.java:92)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDoclet.generatePackageFiles(HtmlDoclet.java:312)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.startGeneration(AbstractDoclet.java:210)
[error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.run(AbstractDoclet.java:114)
[error] at jdk.javadoc/jdk.javadoc.doclet.StandardDoclet.run(StandardDoclet.java:72)
[error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.parseAndExecute(Start.java:588)
[error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:432)
[error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:345)
[error] at jdk.javadoc/jdk.javadoc.internal.tool.Main.execute(Main.java:63)
[error] at jdk.javadoc/jdk.javadoc.internal.tool.Main.main(Main.java:52)
```
I found the internal error happens when a generated Java class is from a Scala class which is package private and generic.
I also found that if we don't generate class hierarchy tree in the JavaDoc, we can suppress the internal error for JDK11 and later.
### Why are the changes needed?
Make the build success with sbt and JDK11.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed the following command successfully finish with JDK8 and JDK11.
```
$ build/sbt -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Pmesos -Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud clean unidoc
```
I also confirmed html files are successfully generated under `target/javaunidoc`.
Closes#31023 from sarutak/fix-genjavadoc-java11.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a partial revert of https://github.com/apache/spark/pull/30456 by downgrading scala-maven-plugin from 4.4.0 to 4.3.0.
Currently, when you run the docker release script (`./dev/create-release/do-release-docker.sh`), it fails to compile as below during incremental compilation with zinc for an unknown reason:
```
[INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ...
[ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes
java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package
java.lang.ClassLoader.checkCerts(ClassLoader.java:891)
java.lang.ClassLoader.preDefineClass(ClassLoader.java:661)
java.lang.ClassLoader.defineClass(ClassLoader.java:754)
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
java.net.URLClassLoader.access$100(URLClassLoader.java:74)
java.net.URLClassLoader$1.run(URLClassLoader.java:369)
java.net.URLClassLoader$1.run(URLClassLoader.java:363)
java.security.AccessController.doPrivileged(Native Method)
java.net.URLClassLoader.findClass(URLClassLoader.java:362)
java.lang.ClassLoader.loadClass(ClassLoader.java:418)
java.lang.ClassLoader.loadClass(ClassLoader.java:351)
java.lang.Class.getDeclaredMethods0(Native Method)
java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
java.lang.Class.privateGetPublicMethods(Class.java:2902)
java.lang.Class.getMethods(Class.java:1615)
sbt.internal.inc.ClassToAPI$.toDefinitions0(ClassToAPI.scala:170)
sbt.internal.inc.ClassToAPI$.$anonfun$toDefinitions$1(ClassToAPI.scala:123)
scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
sbt.internal.inc.ClassToAPI$.toDefinitions(ClassToAPI.scala:123)
sbt.internal.inc.ClassToAPI$.$anonfun$process$1(ClassToAPI.scala:3
```
This happens when it builds Spark with Hadoop 2. It doesn't reproduce when you build this alone. It should follow the sequence of build in the release script.
This is fixed by downgrading. Looks like there is a regression in scala-maven-plugin somewhere between 4.4.0 and 4.3.0.
### Why are the changes needed?
To unblock the release.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
It can be tested as below:
```bash
./dev/create-release/do-release-docker.sh -d $WORKING_DIR
```
Closes#31031 from HyukjinKwon/SPARK-34007.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs.
2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite"
```
Closes#30937 from MaxGekk/unify-show-namespaces-tests.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Changed the cost function in CBO to match documentation.
### Why are the changes needed?
The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as:
```
The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight).
```
The implementation in `JoinReorderDP.betterThan` does not match this documentaiton:
```
def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
if (other.planCost.card == 0 || other.planCost.size == 0) {
false
} else {
val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
relativeRows * conf.joinReorderCardWeight +
relativeSize * (1 - conf.joinReorderCardWeight) < 1
}
}
```
This different implementation has an unfortunate consequence:
given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes.
A example values, that have this fenomen with the default weight value (0.7):
A.card = 500, B.card = 300
A.size = 30, B.size = 80
Both A betterThan B and B betterThan A would have score above 1 and would return false.
This happens with several of the TPCDS queries.
The new implementation does not have this behavior.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New and existing UTs
Closes#30965 from tanelk/SPARK-33935_cbo_cost_function.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Now the spark-sql does not support parse the sql statements with bracketed comments.
For the sql statements:
```
/* SELECT 'test'; */
SELECT 'test';
```
Would be split to two statements:
The first one: `/* SELECT 'test'`
The second one: `*/ SELECT 'test'`
Then it would throw an exception because the first one is illegal.
In this PR, we ignore the content in bracketed comments while splitting the sql statements.
Besides, we ignore the comment without any content.
### Why are the changes needed?
Spark-sql might split the statements inside bracketed comments and it is not correct.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#29982 from turboFei/SPARK-33110.
Lead-authored-by: fwang12 <fwang12@ebay.com>
Co-authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`.
```
21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded)
21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the
previous stage needs to be re-run, or because a different copy of the task has already succeeded).
21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default
21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception
java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
at scala.collection.MapLike.default(MapLike.scala:235)
at scala.collection.MapLike.default$(MapLike.scala:234)
at scala.collection.AbstractMap.default(Map.scala:63)
at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)
```
### Why are the changes needed?
To avoid throwing the java.util.NoSuchElementException
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue.
Closes#31025 from LantaoJin/SPARK-34000.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer
because it and resolveOperatorsUpWithNewOutput are called in the analyzer.
For example,
PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput
with
```logtalk
[info] - char/varchar resolution in sub query *** FAILED *** (367 milliseconds)
[info] java.lang.RuntimeException: This method should not be called in the analyzer
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150)
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29)
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161)
[info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
[info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267)
```
### Why are the changes needed?
trivial bugfix
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#31013 from yaooqinn/SPARK-33992.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method.
### Why are the changes needed?
To remove unused method.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test.
Closes#31022 from imback82/33382-followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code.
### Why are the changes needed?
To clean up duplicate code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test since this is just refactoring.
Closes#31020 from imback82/refactor_v2_command.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter.
### Why are the changes needed?
For ansiMode.
### Does this PR introduce _any_ user-facing change?
Yes.
When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter.
When spark.sql.ansi.enabled = false, same behaviour as before.
### How was this patch tested?
Ansi mode is tested with existing tests.
End-to-end tests have been added.
Closes#30807 from chongguang/SPARK-33794.
Authored-by: Chongguang LIU <chongguang.liu@laposte.fr>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent.
### Why are the changes needed?
In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them.
The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UTs and the output of `PlanStabilitySuite`
Closes#30996 from tanelk/SPARK-33964_combine_unions.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix un-correct doc of last change https://github.com/apache/spark/pull/30922#discussion_r551453193
### Why are the changes needed?
FIx doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Builds finished correctly.
Closes#31016 from AngersZhuuuu/SPARK-33908-FOLLOW-UP.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass]
### Why are the changes needed?
Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this:
```
[info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()"
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests now pass for Scala 2.13
Closes#31018 from koertkuipers/feat-visibility-scala213.
Authored-by: Koert Kuipers <koert@tresata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command.
2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog.
### Why are the changes needed?
The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987.
### Does this PR introduce _any_ user-facing change?
Yes, it could if users have v2 table catalogs.
### How was this patch tested?
By running unified tests for `ALTER TABLE .. DROP PARTITION`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31017 from MaxGekk/drop-partition-refresh-cache-v2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in da72b87374
### Why are the changes needed?
bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false`
### Does this PR introduce _any_ user-facing change?
yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false`
### How was this patch tested?
new tests
Closes#31003 from yaooqinn/SPARK-33980.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to upgrade cloudpickle from 1.5.0 to 1.6.0.
It virtually contains one fix:
4510be850d
From a cursory look, this isn't a regression, and not even properly supported in Python:
```python
>>> import pickle
>>> pickle.dumps({}.keys())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'dict_keys' object
```
So it seems fine not to backport.
### Why are the changes needed?
To leverage bug fixes from the cloudpickle upstream.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub actions build will test it out.
Closes#31007 from HyukjinKwon/cloudpickle-upgrade.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled.
### Why are the changes needed?
To monitor performance chaneges with CBO enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked.
Closes#31011 from maropu/AddOptionForCBOInTPCDSBenchmark.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests.
### Why are the changes needed?
This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running tests suites for v1 and v2 catalogs:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#31014 from MaxGekk/fix-drop-partition-v2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).
### Why are the changes needed?
To leverage fixes from the upstream in Py4J.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub Actions will test it out.
Closes#31009 from HyukjinKwon/SPARK-33984.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to implement `DESCRIBE COLUMN` for v2 tables.
Note that `isExnteded` option is not implemented in this PR.
### Why are the changes needed?
Parity with v1 tables.
### Does this PR introduce _any_ user-facing change?
Yes, now, `DESCRIBE COLUMN` works for v2 tables.
```scala
sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show
```
```
+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name| data|
|data_type| string|
| comment| hello|
+---------+----------+
```
Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.`
### How was this patch tested?
Added new test.
Closes#30881 from imback82/describe_col_v2.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma.
When we use spark 2.4 with UT
```
test("insert overwrite directory with comma col name") {
withTempDir { dir =>
val path = dir.toURI.getPath
val v1 =
s"""
| INSERT OVERWRITE DIRECTORY '${path}'
| STORED AS TEXTFILE
| SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false")
""".stripMargin
sql(v1).explain(true)
sql(v1).show()
}
}
```
failed with as below since column name contains `,` then column names and column types size not equal.
```
19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements!
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85)
at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119)
at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ',':
6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)
And in script transform, we parse column name to avoid this problem
554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)
So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well.
### Why are the changes needed?
More save use serde
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#30850 from AngersZhuuuu/SPARK-33844.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below:
```scala
case UnresolvedRelation(ident, _, _) =>
if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) {
```
added by https://github.com/apache/spark/pull/30883.
### Why are the changes needed?
1. To have feature parity with v1 In-Memory catalog.
2. To be able to write unified tests for In-Memory and Hive external catalogs.
### Does this PR introduce _any_ user-facing change?
Should not.
### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"
```
Closes#30997 from MaxGekk/cache-table-spark_catalog.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now.
### Why are the changes needed?
Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Close#30902Closes#30902 from saikocat/SPARK-33888.
Lead-authored-by: Hoa <hoameomu@gmail.com>
Co-authored-by: Hoa <saikocatz@gmail.com>
Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn>
Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In hive we always use
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`.
This pr add a SparkFile's root dir path to execution env property `PATH`, then sub-processor will find `scrip.py` as program under `PATH`.
### Why are the changes needed?
Support SQL migration form Hive to Spark.
### Does this PR introduce _any_ user-facing change?
User can direct use script file name as program in script transform SQL.
```
add file /path/to/script.py;
select transform(col1, col2, ..)
using 'script.py' as (col1, col2, ...)
from ...
```
### How was this patch tested?
UT
Closes#30973 from AngersZhuuuu/SPARK-33934.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr fix some operator missing rowCount when enable CBO, e.g.:
```scala
spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
spark.sql("set spark.sql.cbo.enabled=true")
spark.sql("set spark.sql.cbo.planStats.enabled=true")
spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
+- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
+- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```
After this pr:
```
== Optimized Logical Plan ==
RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
+- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
+- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
```
### Why are the changes needed?
[`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need the row count.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#30987 from wangyum/SPARK-33954.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use.
### Why are the changes needed?
increase readability and ease of use.
### Does this PR introduce _any_ user-facing change?
'Yes'.
### How was this patch tested?
Jenkins test
Closes#30982 from beliefer/SPARK-33951.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table.
### Why are the changes needed?
This fixes the issues portrayed by the example:
```sql
spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0);
spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0;
spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1;
spark-sql> CACHE TABLE tbl1;
spark-sql> SELECT * FROM tbl1;
0 0
1 1
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
0 0
1 1
```
The last query must not return `0 0` since it was deleted by previous command.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes for the example above:
```sql
...
spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0);
spark-sql> SELECT * FROM tbl1;
1 1
```
### How was this patch tested?
By running the affected test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```
Closes#30983 from MaxGekk/drop-partition-refresh-cache.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.
### Why are the changes needed?
to support load old tree model in 3.0/3.1/master
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuites
Closes#30889 from zhengruifeng/fix_tree_load.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>