### What changes were proposed in this pull request?
This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this;
```
case class TestRule extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput {
case p Project(projList, _) =>
val newPlan = p.copy(projectList = projList.map { _.transform {
// Assigns a new `ExprId` for references
case a: AttributeReference => Alias(a, a.name)()
}}.asInstanceOf[Seq[NamedExpression]])
val attrMapping = p.output.zip(newPlan.output)
newPlan -> attrMapping
}
}
```
Then, this rule is applied into a plan below;
```
(3) Project [a#5, b#6]
+- (2) Project [a#5, b#6]
+- (1) Project [a#5, b#6]
+- LocalRelation <empty>, [a#5, b#6]
```
In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule.
### Why are the changes needed?
bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests in `QueryPlanSuite`.
Closes#29911 from maropu/QueryPlanBug.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add checks for the cases when JDBC v2 Table Catalog commands fail.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running `JDBCTableCatalogSuite`.
Closes#29945 from MaxGekk/jdbcv2-negative-tests.
Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As of today,
- SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository.
- SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions.
This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0.
```
<hive.group>org.spark-project.hive</hive.group>
<hive.version>1.2.1.spark2</hive.version>
```
For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it.
### Why are the changes needed?
- First, Apache Spark community should not use the unofficial forked release of another Apache project.
- Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far.
### Does this PR introduce _any_ user-facing change?
No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`.
### How was this patch tested?
1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366)
2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382)
3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.)
4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected)
Closes#29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers:
Before
```scala
AdaptiveSparkPlan isFinalPlan=false
+- == Current Plan ==
SortMergeJoin [key#13], [a#23], Inner
:- Sort [key#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
...
+- == Initial Plan ==
SortMergeJoin [key#13], [a#23], Inner
:- Sort [key#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
...
```
After
```scala
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [key#13], [a#23], Inner
:- Sort [key#13 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
...
```
For SQL `EXPLAIN` output:
Before
```scala
AdaptiveSparkPlan (8)
+- == Current Plan ==
Sort (7)
+- Exchange (6)
...
+- == Initial Plan ==
Sort (7)
+- Exchange (6)
...
```
After
```scala
AdaptiveSparkPlan (8)
+- Sort (7)
+- Exchange (6)
...
```
### Why are the changes needed?
To simplify the AQE plan string by removing the redundant plan information.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Modified the existing unit test.
Closes#29915 from allisonwang-db/aqe-explain.
Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime.
### Why are the changes needed?
Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156
However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
unit test
Closes#29919 from yuningzh-db/add_optimizer_test.
Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to do column pruning for `JsonToStructs` expression if we only require some fields from it.
### Why are the changes needed?
`JsonToStructs` takes a schema parameter used to tell `JacksonParser` what fields are needed to parse. If `JsonToStructs` is followed by `GetStructField`. We can prune the schema to only parse certain field.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#29900 from viirya/SPARK-32958.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
(This is a followup PR of #29585) The PR modified `RuleExecutor#isPlanIntegral` code for checking if a plan has globally-unique attribute IDs, but this check made Jenkins maven test jobs much longer (See [the Dongjoon comment](https://github.com/apache/spark/pull/29585#issuecomment-702461314) and thanks, dongjoon-hyun !). To recover running time for the Jenkins tests, this PR intends to update the code to run plan integrity check only for effective plans.
### Why are the changes needed?
To recover running time for Jenkins tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29928 from maropu/PR29585-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider).
The PR contains the following changes:
* Added connection provider developer API
* Made JDBC connection providers constructor to noarg => needed to load them w/ service loader
* Connection providers are now loaded w/ service loader
* Added tests to load providers independently
* Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config
### Why are the changes needed?
No custom authentication possibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Existing + additional unit tests
* Docker integration tests
* Tested manually the newly created external JDBC connection provider
Closes#29024 from gaborgsomogyi/SPARK-32001.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism.
The feature is to add a physical plan rule right after `EnsureRequirements`:
The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan.
Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure.
The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan .
### Why are the changes needed?
To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR).
### Does this PR introduce _any_ user-facing change?
A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature.
### How was this patch tested?
Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala`
Closes#29804 from c21/bucket-rule.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark.
### Why are the changes needed?
We support java enum but failed with scala enum, it's better to keep the same behavior.
Here is a example.
```
package test
object TestEnum extends Enumeration {
type TestEnum = Value
val E1, E2, E3 = Value
}
import TestEnum._
case class TestClass(i: Int, e: TestEnum) {
}
import test._
Seq(TestClass(1, TestEnum.E1)).toDS
```
Before this PR
```
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum
- field (class: "scala.Enumeration.Value", name: "e")
- root class: "test.TestClass"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881)
```
After this PR
`org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]`
### Does this PR introduce _any_ user-facing change?
Yes, user can make case class which include scala enumeration field as dataset.
### How was this patch tested?
Add test.
Closes#29403 from ulysses-you/SPARK-32585.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
### What changes were proposed in this pull request?
After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow:
```
- SPARK-32459: UDF should not fail on WrappedArray *** FAILED ***
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)"
```
The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method.
The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13.
### Why are the changes needed?
We need to support a Scala 2.13 build
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: All tests passed.
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```
**Before**
```
Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0
*** 1 TEST FAILED ***
```
**After**
```
Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```
Closes#29903 from LuciferYang/fix-udfsuite.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`.
This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278
### Why are the changes needed?
For better logical plan integrity checking.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29585 from maropu/PlanIntegrityTest.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This pr fix estimate statistics issue if child has 0 bytes.
### Why are the changes needed?
The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example:
![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#29894 from wangyum/SPARK-33018.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`.
The other parameters in this UT are also changed to smaller values to keep the behavior the same.
### Why are the changes needed?
The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#29842 from tanelk/SPARK-32970.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to optimize from_json + to_json expression chain.
### Why are the changes needed?
To optimize json expression chain that could be manually generated or generated automatically during query optimization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#29828 from viirya/SPARK-32948.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value
### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running `./dev/scalastyle`.
Closes#29892 from MaxGekk/doc-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Compute the current date at the specified time zone using timestamp taken at the start of query evaluation.
### Why are the changes needed?
According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`.
Closes#29889 from MaxGekk/fix-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE.
We should override the `prettyName` and `sql`.
### Why are the changes needed?
Make the name of nth_value correct.
To show the ignoreNulls parameter correctly.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#29886 from beliefer/improve-nth_value.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add canonicalization rules for commutative bitwise operations.
### Why are the changes needed?
Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#29794 from tanelk/SPARK-32927.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`.
### Why are the changes needed?
On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error.
Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue.
### Does this PR introduce _any_ user-facing change?
Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes.
### How was this patch tested?
Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs.
Manually tested on JDK8u and JDK11u and observed expected behavior:
- JDK8u: the test case triggers the "Malformed class name" issue and the fix works;
- JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped.
Closes#29875 from rednaxelafx/spark-32999-getsimplename.
Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it.
### Why are the changes needed?
Ensure that we will not hit the weird cases that trigger ConstantFolding.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The existing tests.
Closes#29798 from gatorsmile/refactorUneval.
Lead-authored-by: gatorsmile <gatorsmile@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader.
### Why are the changes needed?
Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example:
Application 1 for initializing and starting the streaming job:
```
val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"
// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)
// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
```
Application 2 for appending new data:
```
// Append new data into the path
spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save()
```
Check result:
```
// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
```
### Does this PR introduce _any_ user-facing change?
Yes, a new API added.
### How was this patch tested?
New UT added and integrated testing.
Closes#29756 from xuanyuanking/SPARK-32885.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add test to cover Hive UDF whose input contains complex decimal type.
Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`.
### Why are the changes needed?
For better test coverage with Hive which we compatible or not.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#29863 from ulysses-you/SPARK-32877-test.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("REFRESH TABLE t") // 't' is resolved to a temp view
```
### Does this PR introduce _any_ user-facing change?
After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`.
### How was this patch tested?
Added a new test
Closes#29866 from imback82/refresh_table_consistent.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
This moves and refactors the parallel listing utilities from `InMemoryFileIndex` to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings:
- Created a `HadoopFSUtils` class under core
- Moved `InMemoryFileIndex.bulkListLeafFiles` into `HadoopFSUtils.parallelListLeafFiles`. It now depends on a `SparkContext` instead of `SparkSession` in SQL. Also added a few parameters which used to be read from `SparkSession.conf`: `ignoreMissingFiles`, `ignoreLocality`, `parallelismThreshold`, `parallelismMax ` and `filterFun` (for additional filtering support but we may be able to merge this with `filter` parameter in future).
- Moved `InMemoryFileIndex.listLeafFiles` into `HadoopFSUtils.listLeafFiles` with similar changes above.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Currently the locality-aware parallel listing mechanism only applies to `InMemoryFileIndex`. By moving this to core, we can potentially reuse the same mechanism for other code paths as well.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Since this is mostly a refactoring, it relies on existing unit tests such as those for `InMemoryFileIndex`.
Closes#29471 from sunchao/SPARK-32381.
Lead-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this.
### Why are the changes needed?
To fix documentation
### Does this PR introduce _any_ user-facing change?
Doc change.
### How was this patch tested?
Manual.
Closes#29853 from RussellSpitzer/SPARK-32977.
Authored-by: Russell Spitzer <russell.spitzer@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove unnecessary code.
### Why are the changes needed?
General housekeeping. Might be a slight performance improvement, especially on big-endian systems.
There is no need for separate code paths for big- and little-endian
platforms in putDoubles and putFloats anymore (since PR #24861). On
all platforms values are encoded in native byte order and can just
be copied directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29815 from mundaym/clean-putfloats.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
MurmurHash3 and xxHash64 interpret sequences of bytes as integers
encoded in little-endian byte order. This requires a byte reversal
on big endian platforms.
I've left the hashInt and hashLong functions as-is for now. My
interpretation of these functions is that they perform the hash on
the integer value as if it were serialized in little-endian byte
order. Therefore no byte reversal is necessary.
### What changes were proposed in this pull request?
Modify hash functions to produce correct results on big-endian platforms.
### Why are the changes needed?
Hash functions produce incorrect results on big-endian platforms which, amongst other potential issues, causes test failures.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests run on the IBM Z (s390x) platform which uses a big-endian byte order.
Closes#29762 from mundaym/fix-hashes.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR addresses two issues related to the `Relation: view text` test in `DataSourceV2SQLSuite`.
1. The test has the following block:
```scala
withView("view1") { v1: String =>
sql(...)
}
```
Since `withView`'s signature is `withView(v: String*)(f: => Unit): Unit`, the `f` that will be executed is ` v1: String => sql(..)`, which is just defining the anonymous function, and _not_ executing it.
2. Once the test is fixed to run, it actually fails. The reason is that the v2 session catalog implementation used in tests does not correctly handle `V1Table` for views in `loadTable`. And this results in views resolved to `ResolvedTable` instead of `ResolvedView`, causing the test failure: f1dc479d39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L1007-L1011)
### Why are the changes needed?
Fixing a bug in test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#29811 from imback82/fix_minor_test.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace current examples for `percentile_approx()` with **only one** input value by example **with multiple values** in the input column.
### Why are the changes needed?
Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- by running `ExpressionInfoSuite`
- `./dev/scalastyle`
Closes#29841 from MaxGekk/example-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns **one of elements** (or array of elements) from the input column.
### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`./dev/scalastyle`
Closes#29835 from MaxGekk/doc-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This is a minor followup of https://github.com/apache/spark/pull/29601 , to preserve the attribute name in `SubqueryBroadcastExec.output`.
### Why are the changes needed?
During explain, it's better to see the origin column name instead of always "key".
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests.
Closes#29839 from cloud-fan/followup2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/29475.
This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475
### Why are the changes needed?
The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#29838 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds foldable propagation from `Aggregate` as per: https://github.com/apache/spark/pull/29771#discussion_r490412031
### Why are the changes needed?
This is an improvement as `Aggregate`'s `aggregateExpressions` can contain foldables that can be propagated up.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT.
Closes#29816 from peter-toth/SPARK-32951-foldable-propagation-from-aggregate.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In current mode, when explain a SQL plan with HiveTableRelation, it will show so many info about HiveTableRelation's prunedPartition, this make plan hard to read, this pr make this information simpler.
Before:
![image](https://user-images.githubusercontent.com/46485123/93012078-aeeca080-f5cf-11ea-9286-f5c15eadbee3.png)
For UT
```
test("Make HiveTableScanExec message simple") {
withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
withTable("df") {
spark.range(30)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("hive")
.mode("overwrite")
.saveAsTable("df")
val df = sql("SELECT df.id, df.k FROM df WHERE df.k < 2")
df.explain(true)
}
}
}
```
After this pr will show
```
== Parsed Logical Plan ==
'Project ['df.id, 'df.k]
+- 'Filter ('df.k < 2)
+- 'UnresolvedRelation [df], []
== Analyzed Logical Plan ==
id: bigint, k: bigint
Project [id#11L, k#12L]
+- Filter (k#12L < cast(2 as bigint))
+- SubqueryAlias spark_catalog.default.df
+- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L]]
== Optimized Logical Plan ==
Filter (isnotnull(k#12L) AND (k#12L < 2))
+- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]]
== Physical Plan ==
Scan hive default.df [id#11L, k#12L], HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]], [isnotnull(k#12L), (k#12L < 2)]
```
In my pr, I will construct `HiveTableRelation`'s `simpleString` method to avoid show too much unnecessary info in explain plan. compared to what we had before,I decrease the detail metadata of each partition and only retain the partSpec to show each partition was pruned. Since for detail information, we always don't see this in Plan but to use DESC EXTENDED statement.
### Why are the changes needed?
Make plan about HiveTableRelation more readable
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes#29739 from AngersZhuuuu/HiveTableScan-meta-location-info.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add optional `allowMissingColumns` argument to SparkR `unionByName`.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
`unionByName` supports `allowMissingColumns`.
### How was this patch tested?
Existing unit tests. New unit tests targeting this feature.
Closes#29813 from zero323/SPARK-32799.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up PR to https://github.com/apache/spark/pull/29771 and just adds a new test case.
### Why are the changes needed?
To have better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT.
Closes#29802 from peter-toth/SPARK-32635-fix-foldable-propagation-followup.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/29660 and https://github.com/apache/spark/pull/29689 there are 13 remaining failed cases of sql core module with Scala 2.13.
The reason for the remaining failed cases is the optimization result of `CostBasedJoinReorder` maybe different with same input in Scala 2.12 and Scala 2.13 if there are more than one same cost candidate plans.
In this pr give a way to make the optimization result deterministic as much as possible to pass all remaining failed cases of `sql/core` module in Scala 2.13, the main change of this pr as follow:
- Change to use `LinkedHashMap` instead of `Map` to store `foundPlans` in `JoinReorderDP.search` method to ensure same iteration order with same insert order because iteration order of `Map` behave differently under Scala 2.12 and 2.13
- Fixed `StarJoinCostBasedReorderSuite` affected by the above change
- Regenerate golden files affected by the above change.
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: All tests passed.
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```
**Before**
```
Tests: succeeded 8485, failed 13, canceled 1, ignored 52, pending 0
*** 13 TESTS FAILED ***
```
**After**
```
Tests: succeeded 8498, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```
Closes#29711 from LuciferYang/SPARK-32808-3.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Found via discussion https://github.com/apache/spark/pull/29746#issuecomment-694726504
and the root cause it that hive-1.2 does not recognize NULL
```scala
sbt.ForkMain$ForkError: java.sql.SQLException: Unrecognized column type: NULL
at org.apache.hive.jdbc.JdbcColumn.typeStringToHiveType(JdbcColumn.java:160)
at org.apache.hive.jdbc.HiveResultSetMetaData.getHiveType(HiveResultSetMetaData.java:48)
at org.apache.hive.jdbc.HiveResultSetMetaData.getPrecision(HiveResultSetMetaData.java:86)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$35(SparkThriftServerProtocolVersionsSuite.scala:358)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$35$adapted(SparkThriftServerProtocolVersionsSuite.scala:351)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:66)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$34(SparkThriftServerProtocolVersionsSuite.scala:351)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176)
at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
at org.scalatest.Suite.run(Suite.scala:1112)
at org.scalatest.Suite.run$(Suite.scala:1094)
at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
In this PR, we simply ignore these checks for hive 1.2
### Why are the changes needed?
fix jenkins
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
test itself.
Closes#29803 from yaooqinn/SPARK-32874-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to replace deprecated `isFile` and `isDirectory` methods.
```diff
- fs.isDirectory(hadoopPath)
+ fs.getFileStatus(hadoopPath).isDirectory
```
```diff
- fs.isFile(new Path(inProgressLog))
+ fs.getFileStatus(new Path(inProgressLog)).isFile
```
### Why are the changes needed?
It shows deprecation warnings.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull
```
[warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information.
[warn] if (!fs.isFile(new Path(inProgressLog))) {
```
```
[warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information.
[warn] if (fs.isDirectory(hadoopPath)) {
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins.
Closes#29796 from williamhyun/filesystem.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.
### Why are the changes needed?
To fix incorrect percentile calculation, see an example in SPARK-32908.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.
Closes#29784 from MaxGekk/fix-percentile_approx-2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Recently, we added code to log plan changes in the preparation phase in `QueryExecution` for execution (https://github.com/apache/spark/pull/29544). This PR intends to apply the same fix for logging plan changes in AQE.
### Why are the changes needed?
Easy debugging for AQE plans
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#29774 from maropu/PlanChangeLogForAQE.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children.
Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`):
```scala
val a = Seq("1").toDF("col1").withColumn("col2", lit("1"))
val b = Seq("2").toDF("col1").withColumn("col2", lit("2"))
val aub = a.union(b)
val c = aub.filter($"col1" === "2").cache()
val d = Seq("2").toDF( "col4")
val r = d.join(aub, $"col2" === $"col4").select("col4")
val l = c.select("col2")
val df = l.join(r, $"col2" === $"col4", "LeftOuter")
df.show()
```
foldable propagation happens incorrectly:
```
Join LeftOuter, (col2#6 = col4#34) Join LeftOuter, (col2#6 = col4#34)
!:- Project [col2#6] :- Project [1 AS col2#6]
: +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Union : +- Union
: :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]
: : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))
: : +- *(1) LocalTableScan [value#1] : : +- *(1) LocalTableScan [value#1]
: +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]
: +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))
: +- *(2) LocalTableScan [value#10] : +- *(2) LocalTableScan [value#10]
+- Project [col4#34] +- Project [col4#34]
+- Join Inner, (col2#6 = col4#34) +- Join Inner, (col2#6 = col4#34)
:- Project [value#31 AS col4#34] :- Project [value#31 AS col4#34]
: +- LocalRelation [value#31] : +- LocalRelation [value#31]
+- Project [col2#6] +- Project [col2#6]
+- Union false, false +- Union false, false
:- Project [1 AS col2#6] :- Project [1 AS col2#6]
: +- LocalRelation [value#1] : +- LocalRelation [value#1]
+- Project [2 AS col2#15] +- Project [2 AS col2#15]
+- LocalRelation [value#10] +- LocalRelation [value#10]
```
and so the result is wrong:
```
+----+----+
|col2|col4|
+----+----+
| 1|null|
+----+----+
```
After this PR foldable propagation will not happen incorrectly and the result is correct:
```
+----+----+
|col2|col4|
+----+----+
| 2| 2|
+----+----+
```
### Why are the changes needed?
To fix a correctness issue.
### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Existing and new UTs.
Closes#29771 from peter-toth/SPARK-32635-fix-foldable-propagation.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>