### What changes were proposed in this pull request?
Remove `com.github.rdblue:brotli-codec:0.1.1` dependency.
### Why are the changes needed?
As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`.
Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA tests.
Closes#34059 from gengliangwang/removeDeps.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better.
### Why are the changes needed?
Scala 2.12.15 improves compatibility with JDK 17 and 18:
https://github.com/scala/scala/releases/tag/v2.12.15
- Avoids IllegalArgumentException in JDK 17+ for lambda deserialization
- Upgrades to ASM 9.2, for JDK 18 support in optimizer
### Does this PR introduce _any_ user-facing change?
Yes, this is a Scala version change.
### How was this patch tested?
Pass the CIs
Closes#33999 from dongjoon-hyun/SPARK-36759.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.
### Why are the changes needed?
We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#33912 from viirya/SPARK-36670.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Improve API docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33824 from gengliangwang/auditDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
As followup to raise memory for two places forgotten.
### Why are the changes needed?
Raise memory for GHA.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#33658 from viirya/increasing-mem-ga-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
In stage-level resource scheduling, the allocated 3rd party resources can be obtained in TaskContext using resources() interface, however there is no API to get how many cpus are allocated for the task. Will add a cpus() interface to TaskContext to complement resources(). Althrough the task cpu requests can be got from profile, it's more convenient to get it inside the task code without the need to pass profile from driver side to the executor side.
### What changes were proposed in this pull request?
Add cpus() interface in TaskContext and modify relevant code.
### Why are the changes needed?
TaskContext has resources() to get 3rd party resources allocated. the is no API to get CPU allocated for the task.
### Does this PR introduce _any_ user-facing change?
Add cpus() interface for TaskContext
### How was this patch tested?
Unit tests
Closes#33385 from xwu99/taskcontext-cpus.
Lead-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
Trying to adjust build memory settings and serial execution to re-enable GA.
### Why are the changes needed?
GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
GA
Closes#33447 from viirya/test-ga.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to set `MaxMetaspaceSize` to `2g` because it's increasing the native memory consumption unlimitedly by default. The unlimited increasing memory causes GitHub Action flakiness. The value I observed during `hive` module test was over 1.8G and growing.
- https://docs.oracle.com/javase/10/gctuning/other-considerations.htm#JSGCT-GUID-BFB89453-60C0-42AC-81CA-87D59B0ACE2E
> Starting with JDK 8, the permanent generation was removed and the class metadata is allocated in native memory. The amount of native memory that can be used for class metadata is by default unlimited. Use the option -XX:MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata.
In addition, I increased the following memory limit to 4g consistently from two places.
```xml
- <jvmArg>-Xms2048m</jvmArg>
- <jvmArg>-Xmx2048m</jvmArg>
+ <jvmArg>-Xms4g</jvmArg>
+ <jvmArg>-Xmx4g</jvmArg>
```
```scala
- javaOptions += "-Xmx3g",
+ javaOptions ++= "-Xmx4g -XX:MaxMetaspaceSize=2g".split(" ").toSeq,
```
### Why are the changes needed?
This will reduce the flakiness in CI environment by limiting the memory usage explicitly.
When we limit it with `1g`, Hive module fails with `OOM` like the following.
```
java.lang.OutOfMemoryError: Metaspace
Error: Exception in thread "dispatcher-event-loop-110" java.lang.OutOfMemoryError: Metaspace
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33405 from dongjoon-hyun/SPARK-36195.
Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Kyle Bendickson <kbendickson@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is the initial work of add checksum support of shuffle. This is a piece of https://github.com/apache/spark/pull/32385. And this PR only adds checksum functionality at the shuffle writer side.
Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation:
* `BypassMergeSortShuffleWriter` - wrap on each partition file
* `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting
* `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting
\* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime.
And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
Yes, added a new conf: `spark.shuffle.checksum`.
### How was this patch tested?
Added unit tests.
Closes#32401 from Ngone51/add-checksum-files.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`.
### Why are the changes needed?
`0.18` includes a bug fix for `Scala 2.13`.
```
This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases.
```
https://github.com/lightbend/genjavadoc/releases/tag/v0.18
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Built the doc for Scala 2.13.
```
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```
Closes#33383 from sarutak/upgrade-genjavadoc-0.18.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR upgrades SBT to `1.5.5`.
### Why are the changes needed?
SBT `1.5.5` was released, which includes 16 improvements/bug fixes.
https://github.com/sbt/sbt/releases/tag/v1.5.5
* Fixes remote caching not managing resource files
* Fixes launcher causing NoClassDefFoundError when launching sbt 1.4.0 - 1.4.2
* Fixes cross-Scala suffix conflict warning involving _3
* Fixes binaryScalaVersion of 3.0.1-SNAPSHOT
* Fixes carriage return in supershell progress state
* Fixes IntegrationTest configuration not tagged as test in BSP
* Fixes BSP task error handling
* Fixes handling of invalid range positions returned by Javac
* Fixes local class analysis
* Adds buildTarget/resources support for BSP
* Adds build.sbt support for BSP import
* Tracks source dependencies using OriginalTreeAttachments in Scala 2.13
* Reduces overhead in Analysis protobuf deserialization
* Minimizes unnecessary information in signature analysis
* Enables compile-to-jar for local Javac
* Enables Zinc cycle reporting when Scalac is not invoked
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#33312 from sarutak/upgrade-sbt-1.5.5.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to update MiMa based on Apache Spark 3.1.1 (the first release on 3.1 line) for Apache Spark 3.2.0 release.
### Why are the changes needed?
Old MiMa rules hides the breaking changes in Apache Spark 3.2.0. We need to audit and document it correctly in MiMa exclusion file. This issue is discussed here originally.
- https://github.com/apache/spark/pull/33196#issuecomment-873249068
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the CIs
Closes#33199 from dongjoon-hyun/SPARK-36004.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/31019 that forgot to update SBT's to match.
### Why are the changes needed?
To use the same version in both Maven and SBT.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI should test them.
Closes#33207 from HyukjinKwon/SPARK-33996.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to upgrade sbt-antlr4 from 0.8.2 to 0.8.3 per the guides at https://github.com/ihji/sbt-antlr4
I can't find an official proper docs for this.
### Why are the changes needed?
To stick to the guides in https://github.com/ihji/sbt-antlr4, and leverage the fixes included.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI in this PR should tests it out.
Closes#33208 from HyukjinKwon/SPARK-36010.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR aims to update `master` branch version to 3.3.0-SNAPSHOT.
### Why are the changes needed?
Start to prepare Apache Spark 3.3.0 and the published snapshot version should not conflict with `branch-3.2`.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Pass the CIs.
Closes#33196 from dongjoon-hyun/SPARK-35996.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR removes sbt-avro plugin dependency.
In the current master, Build with SBT depends on the plugin but it seems never used.
Originally, the plugin was introduced for `flume-sink` in SPARK-1729 (#807) but `flume-sink` is no longer in Spark repository.
After SBT was upgraded to 1.x in SPARK-21708 (#29286), `avroGenerate` part was introduced in `object SQL` in `SparkBuild.scala`.
It's confusable but I understand `Test / avroGenerate := (Compile / avroGenerate).value` is for suppressing sbt-avro for `sql` sub-module.
In fact, Test/compile will fail if `Test / avroGenerate :=(Compile / avroGenerate).value` is commented out.
`sql` sub-module contains `parquet-compat.avpr` and `parquet-compat.avdl` but according to `sql/core/src/test/README.md`, they are intended to be handled by `gen-avro.sh`.
Also, in terms of Maven build, there seems to be no definition to handle `*.avpr` or `*.avdl`.
Based on the above, I think we can remove `sbt-avro`.
### Why are the changes needed?
If `sbt-avro` is really no longer used, it's confusable that `sbt-avro` related configurations are in `SparkBuild.scala`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#33190 from sarutak/remove-avro-from-sbt.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR changes `SparkBuild.scala` to edit `config.properties` in `yarn` sub-module in build with SBT like as build with Maven does.
### Why are the changes needed?
yarn sub-module contains config.properties.
```
spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
```
The `${spark.yarn.isHadoopProvided}` part is replaced with `true` or `false` in build depending on whether Hadoop is provided or not (specified by -Phadoop-provided).
The edited config.properties will be loaded at runtime to control how to populate Hadoop-related classpath.
If we build with Maven, these process works but doesn't with SBT.
If we build with SBT and deploy apps on YARN, the following warning appears and classpath is not populated correctly.
```
21/06/29 10:51:20 WARN config.package: Can not load the default value of `spark.yarn.isHadoopProvided` from `org/apache/spark/deploy/yarn/config.properties` with error, java.lang.IllegalArgumentException: For input string: "${spark.yarn.isHadoopProvided}". Using `false` as a default value.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Built with SBT and extracted `config.properties` from the build artifact and confirmed `${spark.yarn.isHadoopProvided} was correctly edited with `true` or `false`.
```
cat org/apache/spark/deploy/yarn/config.properties
spark.yarn.isHadoopProvided = false # In case build with -Pyarn and without -Phadoop-provided
spark.yarn.isHadoopProvided = true # In case build with -Pyarn and -Phadoop-provided
```
I also confirmed the warning message shown above no longer appears.
Closes#33121 from sarutak/sbt-yarn-config-properties.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade ASM to 9.1
### Why are the changes needed?
The latest `xbean-asm9-shaded` is built with ASM 9.1.
- https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20
- 5e0e3c0c64/pom.xml (L67)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33130 from dongjoon-hyun/SPARK-35928.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade `sbt-mima-plugin` to 0.9.2 for Apache Spark 3.2.0.
### Why are the changes needed?
`sbt-mima-plugin` 0.9.2 has the following updates including `Scala 3 initial support`.
- https://github.com/lightbend/mima/releases/tag/0.9.2
- https://github.com/lightbend/mima/releases/tag/0.9.1
- https://github.com/lightbend/mima/releases/tag/0.9.0
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs. Also, I manually deleted some lines from MiMiExclusion and verified that it's detected correctly.
Closes#32981 from dongjoon-hyun/SPARK-35830.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to run `KubernetesLocalDiskShuffleDataIOSuite` on a dedicated JVM.
### Why are the changes needed?
In Jenkins environment, `KubernetesLocalDiskShuffleDataIOSuite` and `ExternalShuffleServiceSuite` currently hit issues.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140019/
![Screen Shot 2021-06-19 at 10 33 20 AM](https://user-images.githubusercontent.com/9700541/122650832-d9810200-d0e9-11eb-9f2a-4fb44bb874f3.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins.
Closes#32976 from dongjoon-hyun/SPARK-35593-3.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.4.
### Why are the changes needed?
SBT 1.5.4 is released 5 days ago.
- https://github.com/sbt/sbt/releases/tag/v1.5.4
This will bring the latest bug fixes like the following.
- Fixes BSP on ARM Macs by keeping JNI server socket to keep using JNI
- Fixes compiler ClassLoader list to use compilerJars.toList (For Scala 3, this drops support for 3.0.0-M2)
- Fixes undercompilation of package object causing "Symbol 'type X' is missing from the classpath"
- Fixes overcompilation with scalac -release flag
- Fixes build/exit notification not closing BSP channel
- Fixes POM file's Maven repository ID character restriction to match that of Maven
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#32966 from dongjoon-hyun/SPARK-35818.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This change is for [SPARK-35757](https://issues.apache.org/jira/browse/SPARK-35757) and does the following:
1. adds bitwise AND operation to BitArray (similar to existing `putAll` method)
2. adds an intersect operation for combining bloom filters using bitwise AND operation (similar to existing `mergeInPlace` method).
### Why are the changes needed?
The current bloom filter library only allows combining two bloom filters using OR operation. It is useful to have AND operation as well.
### Does this PR introduce _any_ user-facing change?
No, just adds new methods.
### How was this patch tested?
Just the existing tests.
Closes#32907 from kudhru/master.
Lead-authored-by: kudhru <gargdhruv36@gmail.com>
Co-authored-by: Dhruv Kumar <kudhru@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to upgrade SBT to 1.5.3.
### Why are the changes needed?
This release seems to include a bug fix for Scala 2.13.6+ and Scala 3.
https://github.com/sbt/sbt/releases/tag/v1.5.3
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#32792 from sarutak/upgrade-sbt-1.5.3.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There are several pr to fix compilation warnings related to `procedure syntax` like SPARK-29291, SPARK-33352 and SPARK-35526, in order to prevent the recurrence of similar problems, this pr add a compile arg to convert `procedure syntax` related compilation warnings to compilation errors in Scala 2.13.
### Why are the changes needed?
Prevent the recurrence of compilation warnings related to `procedure syntax is deprecated`
### Does this PR introduce _any_ user-facing change?
`procedure syntax` is no longer allowed in Spark code with Scala 2.13, for constructors methods definition should be `this(...) = { }` not `this(...) { }`, for without `return type` methods definition should be `def methodName(...): Unit = {}` not `def methodName(...) {}`.
### How was this patch tested?
- Pass the GitHub Action Scala 2.13 job
- Manual test:
Do some code change like:
```
Index: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
===================================================================
-67,7 +67,7
private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
extends SparkListener with ThreadSafeRpcEndpoint with Logging {
- def this(sc: SparkContext) = {
+ def this(sc: SparkContext) {
this(sc, new SystemClock)
}
Index: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
===================================================================
-720,7 +720,7
}
}
- def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus): Unit = {
+ def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) {
shuffleStatuses(shuffleId).addMergeResult(reduceId, status)
}
```
**sbt with Scala 2.13 profile compile failed as follows:***
```
[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70:29: procedure syntax is deprecated for constructors: add `=`, as in method definition
[error] def this(sc: SparkContext) {
[error] ^
[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723:79: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type
[error] def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) {
[error] ^
[error] two errors found
[error] (core / Compile / compileIncremental) Compilation failed
[error] Total time: 136 s (02:16), completed May 31, 2021 10:06:50 AM
Error: Process completed with exit code 1.
```
**maven with Scala 2.13 profile compile failed as follows:**
```
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
[ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type
[ERROR] two errors found
```
Closes#32710 from LuciferYang/SPARK-35574.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade ASM to 7.3.1.
- https://issues.apache.org/jira/browse/XBEAN-323
- https://asm.ow2.io/versions.html
### Why are the changes needed?
ASM 7.3.1 bring following changes
- new V15 constant
- experimental support for PermittedSubtypes and RecordComponent
- bug fixes
- - 317885: SKIP_DEBUG now skips MethodParameters attributes
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ran with the existing UTs
Closes#32634 from vinodkc/br_build_upgrade_asm.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.2 for better Scala 2.13.x support.
### Why are the changes needed?
SBT 1.5.2 Release Note: https://github.com/sbt/sbt/releases/tag/v1.5.2
- Fixes ConcurrentModificationException while compiling Scala 2.13.4 and Java sources zinc
- Uses -Duser.home instead of $HOME to download launcher JAR
- Fixes -client by making it the same as --client
- Fixes metabuild ClassLoader missing util-interface
- Fixes sbt new leaving behind target directory
- Fixes "zip END header not found" error during pushRemoteCache
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#32565 from dongjoon-hyun/SPARK-35417.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrades `GenJavadoc` to `0.17`.
### Why are the changes needed?
This version seems to include a fix for an issue which can happen with Scala 2.13.5.
https://github.com/lightbend/genjavadoc/releases/tag/v0.17
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed build succeed with the following commands.
```
# For Scala 2.12
$ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc
# For Scala 2.13
build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
```
Closes#32392 from sarutak/upgrade-genjavadoc-0.17.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.1.
### Why are the changes needed?
https://github.com/sbt/sbt/releases/tag/v1.5.1
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Pass the SBT CIs (Build/Test/Docs/Plugins).
Closes#32382 from lipzhu/SPARK-35254.
Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`WritablePartitionedIterator` define in `WritablePartitionedPairCollection.scala` and there are two implementation of these trait, but the code for these two implementations is duplicate.
The main change of this pr is turn the `WritablePartitionedIterator` from a trait into a default implementation class because there is only one implementation now.
### Why are the changes needed?
Cleanup duplicate code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#32232 from LuciferYang/writable-partitioned-iterator.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
### What changes were proposed in this pull request?
This PR proposes a change that allows us to build SparkR with SBT.
### Why are the changes needed?
In the current master, SparkR can be built only with Maven.
It's helpful if we can built it with SBT.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that I can build SparkR on Ubuntu 20.04 with the following command.
```
build/sbt -Psparkr package
```
Closes#32285 from sarutak/sbt-sparkr.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The following logs will print when Jenkins execute [PySpark pip packaging tests](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137500/console):
```
copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
```
There will be 2 different versions of netty4 jars copied to the jars directory, but the `netty-xxx-4.1.50.Final.jar` not in maven `dependency:tree `, but spark only needs to rely on `netty-all-xxx.jar`.
So this pr try to add new `ExclusionRule`s to `SparkBuild.scala` to exclude unnecessary netty 4 dependencies.
### Why are the changes needed?
Make sure that only `netty-all-xxx.jar` is used in the test to avoid possible jar conflicts.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Check Jenkins log manually, there should be only
`copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars`
and there should be no such logs as
```
copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
```
Closes#32230 from LuciferYang/SPARK-35134.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch introduces a VectorizedBLAS class which implements such hardware-accelerated BLAS operations. This feature is hidden behind the "vectorized" profile that you can enable by passing "-Pvectorized" to sbt or maven.
The Vector API has been introduced in JDK 16. Following discussion on the mailing list, this API is introduced transparently and needs to be enabled explicitely.
### Why are the changes needed?
Whenever a native BLAS implementation isn't available on the system, Spark automatically falls back onto a Java implementation. With the recent release of the Vector API in the OpenJDK [1], we can use hardware acceleration for such operations.
This change was also discussed on the mailing list. [2]
### Does this PR introduce _any_ user-facing change?
It introduces a build-time profile called `vectorized`. You can pass it to sbt and mvn with `-Pvectorized`. There is no change to the end-user of Spark and it should only impact Spark developpers. It is also disabled by default.
### How was this patch tested?
It passes `build/sbt mllib-local/test` with and without `-Pvectorized` with JDK 16. This patch also introduces benchmarks for BLAS.
The benchmark results are as follows:
```
[info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 37 37 0 271.5 3.7 1.0X
[info] vector 24 25 4 416.1 2.4 1.5X
[info]
[info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 70 70 0 143.2 7.0 1.0X
[info] vector 35 35 2 288.7 3.5 2.0X
[info]
[info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 50 51 1 199.8 5.0 1.0X
[info] vector 15 15 0 648.7 1.5 3.2X
[info]
[info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 34 34 0 295.6 3.4 1.0X
[info] vector 19 19 0 531.2 1.9 1.8X
[info]
[info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 25 25 1 399.0 2.5 1.0X
[info] vector 8 9 1 1177.3 0.8 3.0X
[info]
[info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 27 27 0 0.0 26651.5 1.0X
[info] vector 21 21 0 0.0 20646.3 1.3X
[info]
[info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 36 36 0 0.0 35501.4 1.0X
[info] vector 22 22 0 0.0 21930.3 1.6X
[info]
[info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 20 20 0 0.0 20283.3 1.0X
[info] vector 9 9 0 0.1 8657.7 2.3X
[info]
[info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 30 30 0 0.0 29845.8 1.0X
[info] vector 10 10 1 0.1 9695.4 3.1X
[info]
[info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 182 182 0 0.5 1820.0 1.0X
[info] vector 160 160 1 0.6 1597.6 1.1X
[info]
[info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 211 211 1 0.5 2106.2 1.0X
[info] vector 156 157 0 0.6 1564.4 1.3X
[info]
[info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j 276 276 0 0.4 2757.8 1.0X
[info] vector 137 137 0 0.7 1365.1 2.0X
```
/cc srowen xkrogen
[1] https://openjdk.java.net/jeps/338
[2] https://mail-archives.apache.org/mod_mbox/spark-dev/202012.mbox/%3cDM5PR2101MB11106162BB3AF32AD29C6C79B0C69DM5PR2101MB1110.namprd21.prod.outlook.com%3eCloses#30810 from luhenry/master.
Lead-authored-by: Ludovic Henry <luhenry@microsoft.com>
Co-authored-by: Ludovic Henry <git@ludovic.dev>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
SBT 1.5.0 deprecates `in` syntax from 0.13.x, so build file adjustment
is recommended.
See https://www.scala-sbt.org/1.x/docs/Migrating-from-sbt-013x.html#Migrating+to+slash+syntax
### Why are the changes needed?
Removes significant amount of deprecation warnings and prepares to syntax removal in next versions of SBT.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build should pass on GH Actions.
Closes#32115 from gemelen/feature/sbt-1.5-fixes.
Authored-by: Denis Pyshev <git@gemelen.net>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.0.
### Why are the changes needed?
SBT 1.5.0 is released yesterday with the built-in Scala 3 support.
- https://github.com/sbt/sbt/releases/tag/v1.5.0
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the SBT CIs (Build/Test/Docs/Plugins).
Closes#32055 from dongjoon-hyun/SPARK-34959.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR is to support nested column type in Spark ORC vectorized reader. Currently ORC vectorized reader [does not support nested column type (struct, array and map)](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138). We implemented nested column vectorized reader for FB-ORC in our internal fork of Spark. We are seeing performance improvement compared to non-vectorized reader when reading nested columns. In addition, this can also help improve the non-nested column performance when reading non-nested and nested columns together in one query.
Before this PR:
* `OrcColumnVector` is the implementation class for Spark's `ColumnVector` to wrap Hive's/ORC's `ColumnVector` to read `AtomicType` data.
After this PR:
* `OrcColumnVector` is an abstract class to keep interface being shared between multiple implementation class of orc column vectors, namely `OrcAtomicColumnVector` (for `AtomicType`), `OrcArrayColumnVector` (for `ArrayType`), `OrcMapColumnVector` (for `MapType`), `OrcStructColumnVector` (for `StructType`). So the original logic to read `AtomicType` data is moved from `OrcColumnVector` to `OrcAtomicColumnVector`. The abstract class of `OrcColumnVector` is needed here because of supporting nested column (i.e. nested column vectors).
* A utility method `OrcColumnVectorUtils.toOrcColumnVector` is added to create Spark's `OrcColumnVector` from Hive's/ORC's `ColumnVector`.
* A new user-facing config `spark.sql.orc.enableNestedColumnVectorizedReader` is added to control enabling/disabling vectorized reader for nested columns. The default value is false (i.e. disabling by default). For certain tables having deep nested columns, vectorized reader might take too much memory for each sub-column vectors, compared to non-vectorized reader. So providing a config here to work around OOM for query reading wide and deep nested columns if any. We plan to enable it by default on 3.3. Leave it disable in 3.2 in case for any unknown bugs.
### Why are the changes needed?
Improve query performance when reading nested columns from ORC file format.
Tested with locally adding a small benchmark in `OrcReadBenchmark.scala`. Seeing more than 1x run time improvement.
```
Running benchmark: SQL Nested Column Scan
Running case: Native ORC MR
Stopped after 2 iterations, 37850 ms
Running case: Native ORC Vectorized (Enabled Nested Column)
Stopped after 2 iterations, 15892 ms
Running case: Native ORC Vectorized (Disabled Nested Column)
Stopped after 2 iterations, 37954 ms
Running case: Hive built-in ORC
Stopped after 2 iterations, 35118 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
SQL Nested Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------
Native ORC MR 18706 18925 310 0.1 17839.6 1.0X
Native ORC Vectorized (Enabled Nested Column) 7625 7946 455 0.1 7271.6 2.5X
Native ORC Vectorized (Disabled Nested Column) 18415 18977 796 0.1 17561.5 1.0X
Hive built-in ORC 17469 17559 127 0.1 16660.1 1.1X
```
Benchmark:
```
nestedColumnScanBenchmark(1024 * 1024)
def nestedColumnScanBenchmark(values: Int): Unit = {
val benchmark = new Benchmark(s"SQL Nested Column Scan", values, output = output)
withTempPath { dir =>
withTempTable("t1", "nativeOrcTable", "hiveOrcTable") {
import spark.implicits._
spark.range(values).map(_ => Random.nextLong).map { x =>
val arrayOfStructColumn = (0 until 5).map(i => (x + i, s"$x" * 5))
val mapOfStructColumn = Map(
s"$x" -> (x * 0.1, (x, s"$x" * 100)),
(s"$x" * 2) -> (x * 0.2, (x, s"$x" * 200)),
(s"$x" * 3) -> (x * 0.3, (x, s"$x" * 300)))
(arrayOfStructColumn, mapOfStructColumn)
}.toDF("col1", "col2")
.createOrReplaceTempView("t1")
prepareTable(dir, spark.sql(s"SELECT * FROM t1"))
benchmark.addCase("Native ORC MR") { _ =>
withSQLConf(SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> "false") {
spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
}
}
benchmark.addCase("Native ORC Vectorized (Enabled Nested Column)") { _ =>
spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
}
benchmark.addCase("Native ORC Vectorized (Disabled Nested Column)") { _ =>
withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "false") {
spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop()
}
}
benchmark.addCase("Hive built-in ORC") { _ =>
spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM hiveOrcTable").noop()
}
benchmark.run()
}
}
}
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added one simple test in `OrcSourceSuite.scala` to verify correctness.
Definitely need more unit tests and add benchmark here, but I want to first collect feedback before crafting more tests.
Closes#31958 from c21/orc-vector.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Update the Avro version to 1.10.2
### Why are the changes needed?
To stay up to date with upstream and catch compatibility issues with zstd
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests
Closes#31866 from iemejia/SPARK-27733-upgrade-avro-1.10.2.
Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
This PR aims to update SBT from 1.4.7 to 1.4.9.
### Why are the changes needed?
This will bring the following bug fixes and improvements.
- https://github.com/sbt/sbt/releases/tag/v1.4.9
- https://github.com/sbt/sbt/releases/tag/v1.4.8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#31828 from williamhyun/sbt149.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes a new feature that allows developers to debug test code using JDWP with sbt an Maven.
More specifically, this PR introduces the following profile options.
* `jdwp-test-debug`: An profile which controls enable/disable JDWP debug
* `test.jdwp.address`: An option which corresponds to `address` option in JDWP
* `test.jdwp.suspend`: An option which corresponds to `suspend` option in JDWP
* `test.jdwp.server`: An option which corresponds to `server` option in JDWP
* `test.debug.suite`: An option which controls whether debug ScalaStyle suites (Maven only)
For `sbt`, this feature can be used like `build/sbt -Pjdwp-test-debug -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y` and can be used for both JUnit tests and ScalaTest tests.
For `Maven`, this feature can be used like as follows:
(For JUnit tests) `build/mvn -Pjdwp-test-debug -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y`
(For ScalaTest suites) `build/mvn -Pjdwp-test-debug -Dtest.debug.suite=true -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y` (It might be useful to specify specific sub-modules like `-pl sql/core,sql/catalyst`).
### Why are the changes needed?
It's useful to debug test code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed the following things.
* `jdwp-tes-debug` can switch JDWP enabled/disabled
* `test.jdwp.address` can change address and port.
* `test.jdwp.suspend` can change the behavior that the target debugee suspends or not.
* `test.jdwp.server` can change the behavior that the JDWP debugger run as a server or client.
* ScalaTest suites can be debugged with Maven with setting `test.debug.suite` to `true`.
Closes#31706 from sarutak/sbt-jdwp.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1. This PR aims to ignore ORC encryption tests when ORC shim is loaded by old Hadoop library by some other tests. The test coverage is preserved by Jenkins SBT runs and GitHub Action jobs. This PR only aims to recover Maven Jenkins jobs.
2. In addition, this PR simplifies SBT testing by refactor the test config to `SparkBuild.scala/pom.xml` and remove `DedicatedJVMTest`. This will remove one GitHub Action job which was recently added for `DedicatedJVMTest` tag.
### Why are the changes needed?
Currently, Maven test fails when it runs in a batch mode because `HadoopShimsPre2_3$NullKeyProvider` is loaded.
**MVN COMMAND**
```
$ mvn test -pl sql/core --am -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite,org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite
```
**BEFORE**
```
- Write and read an encrypted table *** FAILED ***
...
Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii
at org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider.getCurrentKeyVersion(HadoopShimsPre2_3.java:71)
at org.apache.orc.impl.WriterImpl.getKey(WriterImpl.java:871)
```
**AFTER**
```
OrcV1QuerySuite
...
OrcEncryptionSuite:
- Write and read an encrypted file !!! CANCELED !!!
[] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider1b705f65 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:39)
- Write and read an encrypted table !!! CANCELED !!!
[] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider22adeee1 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:67)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins Maven tests.
For SBT command,
- the test suite required a dedicated JVM (Before)
- the test suite doesn't require a dedicated JVM (After)
```
$ build/sbt "sql/testOnly *.OrcV1QuerySuite *.OrcEncryptionSuite"
...
[info] OrcV1QuerySuite
...
[info] - SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core (26 milliseconds)
[info] OrcEncryptionSuite:
[info] - Write and read an encrypted file (431 milliseconds)
[info] - Write and read an encrypted table (359 milliseconds)
[info] All tests passed.
[info] Passed: Total 35, Failed 0, Errors 0, Passed 35
```
Closes#31697 from dongjoon-hyun/SPARK-34578-TEST.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a retry of #31065 . Last time, the newly add test cases passed in Jenkins and individually, but it's reverted because they fail when `GitHub Action` runs with `SERIAL_SBT_TESTS=1`.
In this PR, `SecurityTest` tag is used to isolate `KeyProvider`.
This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`.
Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe.
### Why are the changes needed?
Apache ORC 1.6 supports columnar encryption.
### Does this PR introduce _any_ user-facing change?
No. This is for a test case.
### How was this patch tested?
Pass the newly added test suite.
Closes#31603 from dongjoon-hyun/SPARK-34486-RETRY.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to update the sbt version to 1.4.7.
### Why are the changes needed?
This will bring the latest bug fixes and improvements.
- https://github.com/sbt/sbt/releases/tag/v1.4.7
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#31555 from williamhyun/sbt147.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983:
- Remove the API tag `Unstable` for `HiveSessionStateBuilder`
- Add document for spark.sql.hive package to emphasize it's a private package
### Why are the changes needed?
Follow the rule for a private package.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#31321 from xuanyuanking/SPARK-34185-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Update Avro dependency to version 1.10.1
### Why are the changes needed?
To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Since there were no API changes required we just run the tests
Closes#31232 from iemejia/SPARK-27733-avro-upgrade.
Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR is the 3rd try to upgrade Scala 2.12.x in order to see the feasibility.
- https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum )
- https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya )
`silencer` library is updated accordingly. And, Kafka version upgrade is required because it fails like the following.
```
[info] KafkaDataConsumerSuite:
[info] org.apache.spark.streaming.kafka010.KafkaDataConsumerSuite *** ABORTED *** (1 second, 580 milliseconds)
[info] java.lang.NoClassDefFoundError: scala/math/Ordering$$anon$7
[info] at kafka.api.ApiVersion$.orderingByVersion(ApiVersion.scala:45)
```
### Why are the changes needed?
Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11 and 2.12.12. This will bring all the bug fixes.
- https://github.com/scala/scala/releases/tag/v2.12.13
- https://github.com/scala/scala/releases/tag/v2.12.12
- https://github.com/scala/scala/releases/tag/v2.12.11
### Does this PR introduce _any_ user-facing change?
Yes, but this is a bug-fixed version.
### How was this patch tested?
Pass the CIs.
Closes#31223 from dongjoon-hyun/SPARK-31168.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to fix `MiMaExcludes` rule by moving SPARK-23429 from 2.4 to 3.0.
### Why are the changes needed?
SPARK-23429 was added at Apache Spark 3.0.0.
This should land on `master` and `branch-3.1` and `branch-3.0`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the MiMa rule.
Closes#31174 from dongjoon-hyun/SPARK-34103.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>