### What changes were proposed in this pull request?
Adds `nth_value` function to SparkR.
### Why are the changes needed?
Feature parity. The function has been already added to [Scala](https://issues.apache.org/jira/browse/SPARK-27951) and [Python](https://issues.apache.org/jira/browse/SPARK-33020).
### Does this PR introduce _any_ user-facing change?
Yes. New function is exposed to R users.
### How was this patch tested?
New unit tests.
Closes#29905 from zero323/SPARK-33030.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade Apache ORC to 1.5.12.
### Why are the changes needed?
This brings us the latest bug patches like the followings.
- ORC-644 nested struct evolution does not respect to orc.force.positional.evolution
- ORC-667 Positional mapping for nested struct types should not applied by default
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI.
Closes#29930 from dongjoon-hyun/SPARK-ORC-1.5.12.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fixes `SparkBuild.scala` to recognize build settings for Scala 2.13.
In `SparkBuild.scala`, a variable `scalaBinaryVersion` is hardcoded as `2.12`.
So, an environment variable `SPARK_SCALA_VERSION` is also to be `2.12`.
This issue causes some test suites (e.g. `SparkSubmitSuite`) to be error.
```
===== TEST OUTPUT FOR o.a.s.deploy.SparkSubmitSuite: 'user classpath first in driver' =====
20/10/02 08:55:30.234 redirect stderr for command /home/kou/work/oss/spark-scala-2.13/bin/spark-submit INFO Utils: Error: Could not find or load m
ain class org.apache.spark.launcher.Main
20/10/02 08:55:30.235 redirect stderr for command /home/kou/work/oss/spark-scala-2.13/bin/spark-submit INFO Utils: /home/kou/work/oss/spark-scala-
2.13/bin/spark-class: line 96: CMD: bad array subscript
```
The reason of this error is that environment variables `SPARK_JARS_DIR` and `LAUNCH_CLASSPATH` is defined in `bin/spark-class` as follows.
```
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
```
### Why are the changes needed?
To build for Scala 2.13 successfully.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests for `core` module finish successfully.
```
build/sbt -Pscala-2.13 clean "core/test"
```
Closes#29927 from sarutak/fix-sparkbuild-for-scala-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back.
### Why are the changes needed?
To avoid maintaining the installation steps by ourselve.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions build in this PR should test it.
Closes#29931 from HyukjinKwon/recover-r-build.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider).
The PR contains the following changes:
* Added connection provider developer API
* Made JDBC connection providers constructor to noarg => needed to load them w/ service loader
* Connection providers are now loaded w/ service loader
* Added tests to load providers independently
* Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config
### Why are the changes needed?
No custom authentication possibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Existing + additional unit tests
* Docker integration tests
* Tested manually the newly created external JDBC connection provider
Closes#29024 from gaborgsomogyi/SPARK-32001.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism.
The feature is to add a physical plan rule right after `EnsureRequirements`:
The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan.
Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure.
The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan .
### Why are the changes needed?
To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR).
### Does this PR introduce _any_ user-facing change?
A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature.
### How was this patch tested?
Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala`
Closes#29804 from c21/bucket-rule.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This change updates the test file location in #29872 to proper path.
### Why are the changes needed?
ExecutorSummarySuite.scala should be in core/src/test/scala instead of core/src/test/java.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests
Closes#29926 from shrutig/SPARK-32996.
Authored-by: Shruti Gumma <shruti_gumma@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR fixes the description how to build Spark for Scala 2.13 with sbt.
In the current doc, how to build Spark for Scala 2.13 with sbt is described like:
![scala-2 13-build-before](https://user-images.githubusercontent.com/4736016/94816248-80c3e900-0436-11eb-9bc2-99af5786971a.png)
But build fails with this command because scala-2.13 profile is not enabled and scala-parallel-collections is absent.
```
[error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:23: object parallel is not a member of package collection
```
The correct command should be:
```
build/sbt -Pspark-2.13 compile
```
### Why are the changes needed?
The build command is wrong.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I checked that `sbt -Pspark-2.13` is correct with the following command:
```
build/sbt -Dscala.version=2.13.3 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile
```
I also build the modified doc and checked the generated html:
![spark-scala-2 13-build-doc-after](https://user-images.githubusercontent.com/4736016/94869259-f2745500-047f-11eb-89e5-20816f3ed24d.png)
Closes#29921 from sarutak/fix-scala-2.13-build-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark.
### Why are the changes needed?
We support java enum but failed with scala enum, it's better to keep the same behavior.
Here is a example.
```
package test
object TestEnum extends Enumeration {
type TestEnum = Value
val E1, E2, E3 = Value
}
import TestEnum._
case class TestClass(i: Int, e: TestEnum) {
}
import test._
Seq(TestClass(1, TestEnum.E1)).toDS
```
Before this PR
```
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum
- field (class: "scala.Enumeration.Value", name: "e")
- root class: "test.TestClass"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881)
```
After this PR
`org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]`
### Does this PR introduce _any_ user-facing change?
Yes, user can make case class which include scala enumeration field as dataset.
### How was this patch tested?
Add test.
Closes#29403 from ulysses-you/SPARK-32585.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade Apache Hive `hive-storage-api` library from 2.7.1 to 2.7.2.
### Why are the changes needed?
[storage-api 2.7.2](https://github.com/apache/hive/commits/rel/storage-release-2.7.2/storage-api) has the following extension and can be used when users uses a provided orc dependency.
[HIVE-22959](dade9919d9 (diff-ccfc9dd7584117f531322cda3a29f3c3)) : Extend storage-api to expose FilterContext
[HIVE-23215](361925d2f3 (diff-ccfc9dd7584117f531322cda3a29f3c3)) : Make FilterContext and MutableFilterContext interfaces
### Does this PR introduce _any_ user-facing change?
Yes. This is a dependency change.
### How was this patch tested?
Pass the existing tests.
Closes#29923 from dongjoon-hyun/SPARK-33047.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow:
```
- SPARK-32459: UDF should not fail on WrappedArray *** FAILED ***
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)"
```
The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method.
The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13.
### Why are the changes needed?
We need to support a Scala 2.13 build
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: All tests passed.
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```
**Before**
```
Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0
*** 1 TEST FAILED ***
```
**After**
```
Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```
Closes#29903 from LuciferYang/fix-udfsuite.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix Typo
### Why are the changes needed?
To maintain consistency.
Correct table name should be used for SELECT command.
### Does this PR introduce _any_ user-facing change?
Yes. Now CREATE FUNCTION doc will show the correct name of table.
### How was this patch tested?
Manually. Doc changes.
Closes#29920 from iRakson/fixTypo.
Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Upgrade to the latest available version of jQuery (3.5.1).
### Why are the changes needed?
There are some CVE-s reported (CVE-2020-11022, CVE-2020-11023) affecting older versions of jQuery. Although Spark UI is read-only and those CVEs doesn't seem to affect Spark, using the latest version of this library can help to handle vulnerability reports of security scans.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual tests and checked the jQuery 3.5 upgrade guide.
Closes#29902 from peter-toth/SPARK-32723-upgrade-to-jquery-3.5.1.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
according to https://github.com/apache/spark/pull/29881#discussion_r496648397
we need add condition `Utils.isWindows`
### Why are the changes needed?
add strict condition of judging path is window path
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes#29909 from AngersZhuuuu/SPARK-33023.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade `kubernetes-client` library to track fabric8's declared compatibility for k8s 1.18.0:
https://github.com/fabric8io/kubernetes-client#compatibility-matrix
### Why are the changes needed?
According to fabric8, 4.9.2 is incompatible with k8s 1.18.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Not tested yet.
Closes#29888 from laflechejonathan/jlf/fabric8Ugprade.
Authored-by: jlafleche <jlafleche@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`.
This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278
### Why are the changes needed?
For better logical plan integrity checking.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29585 from maropu/PlanIntegrityTest.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still.
### Why are the changes needed?
Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions.
- [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default
### Does this PR introduce _any_ user-facing change?
Yes. This changes the default behavior. Users can override this conf.
### How was this patch tested?
Manual.
**BEFORE (spark-3.0.1-bin-hadoop3.2)**
```scala
scala> sc.version
res0: String = 3.0.1
scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res1: String = 2
```
**AFTER**
```scala
scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res0: String = 1
```
Closes#29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr fix estimate statistics issue if child has 0 bytes.
### Why are the changes needed?
The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example:
![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#29894 from wangyum/SPARK-33018.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter.
### Why are the changes needed?
Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM:
1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one.
2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested.
3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`.
4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array.
5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array.
6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill.
7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail.
With the changes in the PR the following will happen instead:
1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one.
2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested.
3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`.
4. `UnsafeInMemorySorter` frees the old pointer array.
5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`).
6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step.
7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`.
Closes#29785 from tomvanbussel/SPARK-32901.
Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`.
The other parameters in this UT are also changed to smaller values to keep the behavior the same.
### Why are the changes needed?
The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#29842 from tanelk/SPARK-32970.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to optimize from_json + to_json expression chain.
### Why are the changes needed?
To optimize json expression chain that could be manually generated or generated automatically during query optimization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#29828 from viirya/SPARK-32948.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value
### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running `./dev/scalastyle`.
Closes#29892 from MaxGekk/doc-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API.
### Why are the changes needed?
To support the consistent APIs
### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new PySpark function API.
### How was this patch tested?
Unittest was added.
Closes#29899 from HyukjinKwon/SPARK-33020.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Compute the current date at the specified time zone using timestamp taken at the start of query evaluation.
### Why are the changes needed?
According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`.
Closes#29889 from MaxGekk/fix-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move functions related test cases from `test_context.py` to `test_functions.py`.
### Why are the changes needed?
To group the similar test cases.
### Does this PR introduce _any_ user-facing change?
Nope, test-only.
### How was this patch tested?
Jenkins and GitHub Actions should test.
Closes#29898 from HyukjinKwon/SPARK-33021.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE.
We should override the `prettyName` and `sql`.
### Why are the changes needed?
Make the name of nth_value correct.
To show the ignoreNulls parameter correctly.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#29886 from beliefer/improve-nth_value.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When `peakMemoryMetrics` in `ExecutorSummary` is `Option.empty`, then the `ExecutorMetricsJsonSerializer#serialize` method does not execute the `jsonGenerator.writeObject` method. This causes the json to be generated with `peakMemoryMetrics` key added to the serialized string, but no corresponding value.
This causes an error to be thrown when it is the next key `attributes` turn to be added to the json:
`com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
`
### Why are the changes needed?
At the start of the Spark job, if `peakMemoryMetrics` is `Option.empty`, then it causes
a `com.fasterxml.jackson.core.JsonGenerationException` to be thrown when we navigate to the Executors tab in Spark UI.
Complete stacktrace:
> com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
> at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2080)
> at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:161)
> at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725)
> at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:721)
> at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:166)
> at com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:145)
> at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:26)
> at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents$(IterableSerializerModule.scala:25)
> at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
> at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
> at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:250)
> at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
> at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
> at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094)
> at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404)
> at org.apache.spark.ui.exec.ExecutorsPage.allExecutorsDataScript$1(ExecutorsTab.scala:64)
> at org.apache.spark.ui.exec.ExecutorsPage.render(ExecutorsTab.scala:76)
> at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
> at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
> at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
> at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
> at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
> at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
> at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
> at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
> at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
> at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
> at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.sparkproject.jetty.server.Server.handle(Server.java:505)
> at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370)
> at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
> at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
> at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103)
> at org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
> at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
> at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
> at java.base/java.lang.Thread.run(Thread.java:834)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#29872 from shrutig/SPARK-32996.
Authored-by: Shruti Gumma <shruti_gumma@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to promote the stability annotation to `Evolving` for MLEvent traits/classes.
### Why are the changes needed?
The feature is released in Spark 3.0.0 having SPARK-26818 as the last change in Feb. 2020, and haven't changed in Spark 3.0.1. (There's no change more than a half of year.)
While we'd better to wait for some minor releases to consider the API as stable, it would worth to promote to Evolving so that we clearly state that we support the API.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Just changed the annotation, no tests required.
Closes#29887 from HeartSaVioR/SPARK-33011.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document PySpark specific packaging guidelines.
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
```
cd python/docs
make clean html
```
Closes#29806 from fhoering/add_doc_python_packaging.
Lead-authored-by: Fabian Höring <f.horing@criteo.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add canonicalization rules for commutative bitwise operations.
### Why are the changes needed?
Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#29794 from tanelk/SPARK-32927.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow:
- Specified `Seq` to `scala.collection.Seq` in case match `Seq` scene and `x.asInstanceOf[Seq[T]]` scene
- Use `Row.getSeq[T]` instead of `Row.getAs[Seq]`
- Manual call `toMap` method to convert `MapView` to `Map` in Scala 2.13
- Change the tol in the last test to 0.75 to pass `RandomForestRegressorSuite#training with sample weights` in Scala 2.13
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass GitHub 2.13 Build Action
Do the follow:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl mllib -Pscala-2.13 -am
mvn test -pl mllib -Pscala-2.13 -fn
```
**Before**
```
[ERROR] Errors:
[ERROR] JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver...
[ERROR] JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure...
[ERROR] JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ...
Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0
*** 51 TESTS FAILED ***
```
**After**
```
[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0
All tests passed.
```
Closes#29857 from LuciferYang/fix-mllib-2.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, update the comment: `Note, the relevant columns must also be set in inputCols` -> `Note, the relevant columns should also be set in inputCols`;
2, add a check, and if there are `categoricalCols` not set in `inputCols`, log.warn it;
### Why are the changes needed?
1, there is no check to make sure `categoricalCols` are all set in `inputCols`, to keep existing behavior, update this comments;
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
repl
Closes#29868 from zhengruifeng/feature_hash_cat_doc.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds two `type: ignores`, one in `pyspark.install` and one in related tests.
### Why are the changes needed?
To satisfy MyPy type checks. It seems like we originally missed some changes that happened around merge of
31a16fbb40
```
python/pyspark/install.py:30: error: Need type annotation for 'UNSUPPORTED_COMBINATIONS' (hint: "UNSUPPORTED_COMBINATIONS: List[<type>] = ...") [var-annotated]
python/pyspark/tests/test_install_spark.py:105: error: Cannot find implementation or library stub for module named 'xmlrunner' [import]
python/pyspark/tests/test_install_spark.py:105: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Existing tests.
- MyPy tests
```
mypy --show-error-code --no-incremental --config python/mypy.ini python/pyspark
```
Closes#29878 from zero323/SPARK-32714-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
pre-compute the output indices of numerical columns, instead of computing them on each row.
### Why are the changes needed?
for a numerical column, its output index is a hash of its `col_name`, we can pre-compute it at first, instead of computing it on each row.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29850 from zhengruifeng/hash_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`.
### Why are the changes needed?
On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error.
Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue.
### Does this PR introduce _any_ user-facing change?
Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes.
### How was this patch tested?
Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs.
Manually tested on JDK8u and JDK11u and observed expected behavior:
- JDK8u: the test case triggers the "Malformed class name" issue and the fix works;
- JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped.
Closes#29875 from rednaxelafx/spark-32999-getsimplename.
Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`HashingTF` use `util.collection.OpenHashMap` instead of `mutable.HashMap`
### Why are the changes needed?
according to `util.collection.OpenHashMap` 's doc:
> This map is about 5X faster than java.util.HashMap, while using much less space overhead.
according to performance tests like ([Simple microbenchmarks comparing Scala vs Java mutable map performance ](https://gist.github.com/pchiusano/1423303)), `mutable.HashMap` maybe more inefficient than `java.util.HashMap`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29852 from zhengruifeng/hashingtf_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to support dynamic PVC creation and deletion in K8s driver.
**Configuration**
This PR reuses the existing PVC volume configs.
```
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp2
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=200Gi
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
```
**PVC**
```
$ kubectl get pvc | grep driver
tpcds-d6087874c6705564-driver-pvc-0 Bound pvc-fae914a2-ca5c-4e1e-8aba-54a35357d072 200Gi RWO gp2 12m
```
**Disk**
```
$ k exec -it tpcds-d6087874c6705564-driver -- df -h | grep data
/dev/nvme5n1 197G 61M 197G 1% /data
```
```
$ k exec -it tpcds-d6087874c6705564-driver -- ls -al /data
total 28
drwxr-xr-x 5 root root 4096 Sep 25 18:06 .
drwxr-xr-x 1 root root 63 Sep 25 18:06 ..
drwxr-xr-x 66 root root 4096 Sep 25 18:09 blockmgr-2c9a8cc5-a05c-45fe-a58e-b8f42da88a57
drwx------ 2 root root 16384 Sep 25 18:06 lost+found
drwx------ 4 root root 4096 Sep 25 18:07 spark-0448efe7-da2c-4f3a-bd3c-769aadb11dd6
```
**NOTE**
This should be used carefully because Apache Spark doesn't delete driver pod automatically. Since the driver PVC shares the lifecycle of driver pod, it will exist after the job completion until the pod deletion. However, if the users are already using pre-populated PVCs, this isn't a regression at all in terms of the cost.
```
$ k get pod -l spark-role=driver
NAME READY STATUS RESTARTS AGE
tpcds-d6087874c6705564-driver 0/1 Completed 0 35m
```
### Why are the changes needed?
Like executors, driver also needs larger PVC.
### Does this PR introduce _any_ user-facing change?
Yes. This is a new feature.
### How was this patch tested?
Pass the newly added test case.
Closes#29873 from dongjoon-hyun/SPARK-32997.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it.
### Why are the changes needed?
Ensure that we will not hit the weird cases that trigger ConstantFolding.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The existing tests.
Closes#29798 from gatorsmile/refactorUneval.
Lead-authored-by: gatorsmile <gatorsmile@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader.
### Why are the changes needed?
Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example:
Application 1 for initializing and starting the streaming job:
```
val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"
// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)
// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
```
Application 2 for appending new data:
```
// Append new data into the path
spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save()
```
Check result:
```
// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
```
### Does this PR introduce _any_ user-facing change?
Yes, a new API added.
### How was this patch tested?
New UT added and integrated testing.
Closes#29756 from xuanyuanking/SPARK-32885.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add test to cover Hive UDF whose input contains complex decimal type.
Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`.
### Why are the changes needed?
For better test coverage with Hive which we compatible or not.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add test.
Closes#29863 from ulysses-you/SPARK-32877-test.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("REFRESH TABLE t") // 't' is resolved to a temp view
```
### Does this PR introduce _any_ user-facing change?
After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`.
### How was this patch tested?
Added a new test
Closes#29866 from imback82/refresh_table_consistent.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
This moves and refactors the parallel listing utilities from `InMemoryFileIndex` to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings:
- Created a `HadoopFSUtils` class under core
- Moved `InMemoryFileIndex.bulkListLeafFiles` into `HadoopFSUtils.parallelListLeafFiles`. It now depends on a `SparkContext` instead of `SparkSession` in SQL. Also added a few parameters which used to be read from `SparkSession.conf`: `ignoreMissingFiles`, `ignoreLocality`, `parallelismThreshold`, `parallelismMax ` and `filterFun` (for additional filtering support but we may be able to merge this with `filter` parameter in future).
- Moved `InMemoryFileIndex.listLeafFiles` into `HadoopFSUtils.listLeafFiles` with similar changes above.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Currently the locality-aware parallel listing mechanism only applies to `InMemoryFileIndex`. By moving this to core, we can potentially reuse the same mechanism for other code paths as well.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Since this is mostly a refactoring, it relies on existing unit tests such as those for `InMemoryFileIndex`.
Closes#29471 from sunchao/SPARK-32381.
Lead-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
When I tried to verify that the `resource-managers/yarn` module passed all UTs in Scala 2.13 , I found that there is a
issue related to classpath order maybe blocked the UTs because there are more than one `servlet-api` dependency in spark now:
- One is `javax.servlet:javax.servlet-api:3.10:compile` config in core/pom.xml,
- The other is `jakarta.servlet:jakarta.servlet-api:4.0.3:test` cascaded by `org.glassfish.jersey.test-framework.providers`
we can use `mvn dependency:tree` to check it .
So when we execute `resource-managers/yarn` module test use
```
mvn clean test -pl resource-managers/yarn -Pyarn
```
or
```
mvn clean test -pl resource-managers/yarn -Pyarn -Pscala-2.13
```
and if the position of `javax.servlet-api` in the in classpath is before `jakarta.servlet-api`, there are some cases failed in `YarnClusterSuite`, `YarnShuffleIntegrationSuite` and `YarnShuffleAuthSuite`.
The failed reason as follow:
```
20/09/18 19:14:07.486 launcher-proc-1 INFO YarnClusterDriver: Exception in thread "main" java.lang.ExceptionInInitializerError
...
20/09/18 19:14:07.486 launcher-proc-1 INFO YarnClusterDriver: Caused by: java.lang.SecurityException: class "javax.servlet.http.HttpSessionIdListener"'s signer information does not match signer information of other classes in the same package
...
```
### Why are the changes needed?
Avoid UTs error caused by classpath order .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass 2.13 Build GitHub Action and do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Pscala-2.13 -am
mvn clean test -pl resource-managers/yarn -Pyarn -Pscala-2.13
```
```
Tests: succeeded 136, failed 0, canceled 1, ignored 0, pending 0
All tests passed.
```
Closes#29824 from LuciferYang/yarn-tests-deps.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The main change of this pr is add a manual sort to `defaultConf ++ driverConf` before constructing `--conf` options to ensure options has same order in Scala 2.12 and Scala 2.13.
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass GitHub 2.13 Build Action
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl resource-managers/mesos -Pscala-2.13 -Pmesos -am
mvn test -pl resource-managers/mesos -Pscala-2.13 -Pmesos
```
**Before**
```
Tests: succeeded 106, failed 1, canceled 0, ignored 0, pending 0
*** 1 TESTS FAILED ***
```
**After**
```
Tests: succeeded 107, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29865 from LuciferYang/SPARK-32987-2.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>