ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	ba5708d944	[SPARK-36670][FOLLOWUP][TEST] Remove brotli-codec dependency ### What changes were proposed in this pull request? Remove `com.github.rdblue:brotli-codec:0.1.1` dependency. ### Why are the changes needed? As Stephen Coy pointed out in the dev list, we should not have `com.github.rdblue:brotli-codec:0.1.1` dependency which is not available on Maven Central. This is to avoid possible artifact changes on `Jitpack.io`. Also, the dependency is for tests only. I suggest that we remove it now to unblock the 3.2.0 release ASAP. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #34059 from gengliangwang/removeDeps. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-21 10:57:20 -07:00
Dongjoon Hyun	16f1f71ba5	[SPARK-36759][BUILD] Upgrade Scala to 2.12.15 ### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better. ### Why are the changes needed? Scala 2.12.15 improves compatibility with JDK 17 and 18: https://github.com/scala/scala/releases/tag/v2.12.15 - Avoids IllegalArgumentException in JDK 17+ for lambda deserialization - Upgrades to ASM 9.2, for JDK 18 support in optimizer ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Pass the CIs Closes #33999 from dongjoon-hyun/SPARK-36759. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-15 13:43:25 -07:00
Liang-Chi Hsieh	5a0ae694d0	[SPARK-36670][SQL][TEST] Add FileSourceCodecSuite ### What changes were proposed in this pull request? This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources. ### Why are the changes needed? We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #33912 from viirya/SPARK-36670. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-07 16:53:11 -07:00
Gengliang Wang	e650d06ba9	[SPARK-36597][DOCS] Fix issues in SQL function docs ### What changes were proposed in this pull request? * the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional * remove the `\|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get * Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699 ### Why are the changes needed? Improve the documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and preview: ![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png) ![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png) ![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png) ![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png) Closes #33847 from gengliangwang/auditSQLDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-27 13:29:34 +08:00
Gengliang Wang	1a42aa5bd4	[SPARK-36457][DOCS] Review and fix issues in Scala/Java API docs ### What changes were proposed in this pull request? Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Improve API docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33824 from gengliangwang/auditDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-26 12:59:18 +08:00
Liang-Chi Hsieh	cd070f1b9c	[SPARK-36393][BUILD][FOLLOW-UP] Try to raise memory for GHA ### What changes were proposed in this pull request? As followup to raise memory for two places forgotten. ### Why are the changes needed? Raise memory for GHA. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #33658 from viirya/increasing-mem-ga-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-05 20:09:30 -07:00
Wu, Xiaochang	f6e6d1157a	[SPARK-36173][CORE] Support getting CPU number in TaskContext In stage-level resource scheduling, the allocated 3rd party resources can be obtained in TaskContext using resources() interface, however there is no API to get how many cpus are allocated for the task. Will add a cpus() interface to TaskContext to complement resources(). Althrough the task cpu requests can be got from profile, it's more convenient to get it inside the task code without the need to pass profile from driver side to the executor side. ### What changes were proposed in this pull request? Add cpus() interface in TaskContext and modify relevant code. ### Why are the changes needed? TaskContext has resources() to get 3rd party resources allocated. the is no API to get CPU allocated for the task. ### Does this PR introduce _any_ user-facing change? Add cpus() interface for TaskContext ### How was this patch tested? Unit tests Closes #33385 from xwu99/taskcontext-cpus. Lead-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-08-04 21:14:01 -05:00
Liang-Chi Hsieh	fd36ed4550	[SPARK-36270][BUILD] Change memory settings for enabling GA ### What changes were proposed in this pull request? Trying to adjust build memory settings and serial execution to re-enable GA. ### Why are the changes needed? GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? GA Closes #33447 from viirya/test-ga. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 19:10:45 +09:00
Dongjoon Hyun	d7df7a805f	[SPARK-36195][BUILD] Set MaxMetaspaceSize JVM option to 2g ### What changes were proposed in this pull request? This PR aims to set `MaxMetaspaceSize` to `2g` because it's increasing the native memory consumption unlimitedly by default. The unlimited increasing memory causes GitHub Action flakiness. The value I observed during `hive` module test was over 1.8G and growing. - https://docs.oracle.com/javase/10/gctuning/other-considerations.htm#JSGCT-GUID-BFB89453-60C0-42AC-81CA-87D59B0ACE2E > Starting with JDK 8, the permanent generation was removed and the class metadata is allocated in native memory. The amount of native memory that can be used for class metadata is by default unlimited. Use the option -XX:MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata. In addition, I increased the following memory limit to 4g consistently from two places. ```xml - <jvmArg>-Xms2048m</jvmArg> - <jvmArg>-Xmx2048m</jvmArg> + <jvmArg>-Xms4g</jvmArg> + <jvmArg>-Xmx4g</jvmArg> ``` ```scala - javaOptions += "-Xmx3g", + javaOptions ++= "-Xmx4g -XX:MaxMetaspaceSize=2g".split(" ").toSeq, ``` ### Why are the changes needed? This will reduce the flakiness in CI environment by limiting the memory usage explicitly. When we limit it with `1g`, Hive module fails with `OOM` like the following. ``` java.lang.OutOfMemoryError: Metaspace Error: Exception in thread "dispatcher-event-loop-110" java.lang.OutOfMemoryError: Metaspace ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33405 from dongjoon-hyun/SPARK-36195. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Kyle Bendickson <kbendickson@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-18 10:15:15 -07:00
yi.wu	4783fb72af	[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of https://github.com/apache/spark/pull/32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes #32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-07-17 00:23:14 -05:00
Kousuke Saruta	ad744fb4bf	[SPARK-36171][BUILD] Upgrade GenJavadoc to 0.18 ### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`. ### Why are the changes needed? `0.18` includes a bug fix for `Scala 2.13`. ``` This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases. ``` https://github.com/lightbend/genjavadoc/releases/tag/v0.18 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc for Scala 2.13. ``` build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #33383 from sarutak/upgrade-genjavadoc-0.18. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 20:24:45 -07:00
Kousuke Saruta	c46342e3d0	[SPARK-36110][BUILD] Upgrade SBT to 1.5.5 ### What changes were proposed in this pull request? This PR upgrades SBT to `1.5.5`. ### Why are the changes needed? SBT `1.5.5` was released, which includes 16 improvements/bug fixes. https://github.com/sbt/sbt/releases/tag/v1.5.5 * Fixes remote caching not managing resource files * Fixes launcher causing NoClassDefFoundError when launching sbt 1.4.0 - 1.4.2 * Fixes cross-Scala suffix conflict warning involving _3 * Fixes binaryScalaVersion of 3.0.1-SNAPSHOT * Fixes carriage return in supershell progress state * Fixes IntegrationTest configuration not tagged as test in BSP * Fixes BSP task error handling * Fixes handling of invalid range positions returned by Javac * Fixes local class analysis * Adds buildTarget/resources support for BSP * Adds build.sbt support for BSP import * Tracks source dependencies using OriginalTreeAttachments in Scala 2.13 * Reduces overhead in Analysis protobuf deserialization * Minimizes unnecessary information in signature analysis * Enables compile-to-jar for local Javac * Enables Zinc cycle reporting when Scalac is not invoked ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33312 from sarutak/upgrade-sbt-1.5.5. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-13 13:14:10 +09:00
Dongjoon Hyun	d8e91eb2f6	[SPARK-36004][INFRA] Update MiMa and audit API changes ### What changes were proposed in this pull request? This PR aims to update MiMa based on Apache Spark 3.1.1 (the first release on 3.1 line) for Apache Spark 3.2.0 release. ### Why are the changes needed? Old MiMa rules hides the breaking changes in Apache Spark 3.2.0. We need to audit and document it correctly in MiMa exclusion file. This issue is discussed here originally. - https://github.com/apache/spark/pull/33196#issuecomment-873249068 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs Closes #33199 from dongjoon-hyun/SPARK-36004. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-06 07:16:07 -05:00
Hyukjin Kwon	2fe6c94544	[SPARK-33996][BUILD][FOLLOW-UP] Match SBT's plugin checkstyle version to Maven's ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/31019 that forgot to update SBT's to match. ### Why are the changes needed? To use the same version in both Maven and SBT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should test them. Closes #33207 from HyukjinKwon/SPARK-33996. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-05 18:55:45 +09:00
Hyukjin Kwon	554d5fef13	[SPARK-36010][BUILD] Upgrade sbt-antlr4 from 0.8.2 to 0.8.3 ### What changes were proposed in this pull request? This PR proposes to upgrade sbt-antlr4 from 0.8.2 to 0.8.3 per the guides at https://github.com/ihji/sbt-antlr4 I can't find an official proper docs for this. ### Why are the changes needed? To stick to the guides in https://github.com/ihji/sbt-antlr4, and leverage the fixes included. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should tests it out. Closes #33208 from HyukjinKwon/SPARK-36010. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-07-05 16:54:07 +09:00
Dongjoon Hyun	f9f95686cb	[SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.3.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.3.0 and the published snapshot version should not conflict with `branch-3.2`. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #33196 from dongjoon-hyun/SPARK-35996. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 13:47:36 -07:00
Kousuke Saruta	6c4616b2ac	[SPARK-35990][BUILD] Remove avro-sbt plugin dependency ### What changes were proposed in this pull request? This PR removes sbt-avro plugin dependency. In the current master, Build with SBT depends on the plugin but it seems never used. Originally, the plugin was introduced for `flume-sink` in SPARK-1729 (#807) but `flume-sink` is no longer in Spark repository. After SBT was upgraded to 1.x in SPARK-21708 (#29286), `avroGenerate` part was introduced in `object SQL` in `SparkBuild.scala`. It's confusable but I understand `Test / avroGenerate := (Compile / avroGenerate).value` is for suppressing sbt-avro for `sql` sub-module. In fact, Test/compile will fail if `Test / avroGenerate :=(Compile / avroGenerate).value` is commented out. `sql` sub-module contains `parquet-compat.avpr` and `parquet-compat.avdl` but according to `sql/core/src/test/README.md`, they are intended to be handled by `gen-avro.sh`. Also, in terms of Maven build, there seems to be no definition to handle `.avpr` or `.avdl`. Based on the above, I think we can remove `sbt-avro`. ### Why are the changes needed? If `sbt-avro` is really no longer used, it's confusable that `sbt-avro` related configurations are in `SparkBuild.scala`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #33190 from sarutak/remove-avro-from-sbt. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 11:00:59 -07:00
Kousuke Saruta	05c6b8acdc	[SPARK-35921][BUILD] ${spark.yarn.isHadoopProvided} in config.properties is not edited if build with SBT ### What changes were proposed in this pull request? This PR changes `SparkBuild.scala` to edit `config.properties` in `yarn` sub-module in build with SBT like as build with Maven does. ### Why are the changes needed? yarn sub-module contains config.properties. ``` spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided} ``` The `${spark.yarn.isHadoopProvided}` part is replaced with `true` or `false` in build depending on whether Hadoop is provided or not (specified by -Phadoop-provided). The edited config.properties will be loaded at runtime to control how to populate Hadoop-related classpath. If we build with Maven, these process works but doesn't with SBT. If we build with SBT and deploy apps on YARN, the following warning appears and classpath is not populated correctly. ``` 21/06/29 10:51:20 WARN config.package: Can not load the default value of `spark.yarn.isHadoopProvided` from `org/apache/spark/deploy/yarn/config.properties` with error, java.lang.IllegalArgumentException: For input string: "${spark.yarn.isHadoopProvided}". Using `false` as a default value. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built with SBT and extracted `config.properties` from the build artifact and confirmed `${spark.yarn.isHadoopProvided} was correctly edited with `true` or `false`. ``` cat org/apache/spark/deploy/yarn/config.properties spark.yarn.isHadoopProvided = false # In case build with -Pyarn and without -Phadoop-provided spark.yarn.isHadoopProvided = true # In case build with -Pyarn and -Phadoop-provided ``` I also confirmed the warning message shown above no longer appears. Closes #33121 from sarutak/sbt-yarn-config-properties. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-06-29 21:25:31 +00:00
Dongjoon Hyun	7e7028282c	[SPARK-35928][BUILD] Upgrade ASM to 9.1 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 9.1 ### Why are the changes needed? The latest `xbean-asm9-shaded` is built with ASM 9.1. - https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20 - `5e0e3c0c64/pom.xml (L67)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33130 from dongjoon-hyun/SPARK-35928. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 10:27:51 -07:00
Dongjoon Hyun	9eaf678099	[SPARK-35830][TESTS] Upgrade sbt-mima-plugin to 0.9.2 ### What changes were proposed in this pull request? This PR aims to upgrade `sbt-mima-plugin` to 0.9.2 for Apache Spark 3.2.0. ### Why are the changes needed? `sbt-mima-plugin` 0.9.2 has the following updates including `Scala 3 initial support`. - https://github.com/lightbend/mima/releases/tag/0.9.2 - https://github.com/lightbend/mima/releases/tag/0.9.1 - https://github.com/lightbend/mima/releases/tag/0.9.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Also, I manually deleted some lines from MiMiExclusion and verified that it's detected correctly. Closes #32981 from dongjoon-hyun/SPARK-35830. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-20 11:20:44 +09:00
Dongjoon Hyun	aab37edefc	[SPARK-35593][K8S][TESTS][FOLLOWUP] Run KubernetesLocalDiskShuffleDataIOSuite on a dedicated JVM ### What changes were proposed in this pull request? This PR aims to run `KubernetesLocalDiskShuffleDataIOSuite` on a dedicated JVM. ### Why are the changes needed? In Jenkins environment, `KubernetesLocalDiskShuffleDataIOSuite` and `ExternalShuffleServiceSuite` currently hit issues. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140019/ ![Screen Shot 2021-06-19 at 10 33 20 AM](https://user-images.githubusercontent.com/9700541/122650832-d9810200-d0e9-11eb-9f2a-4fb44bb874f3.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #32976 from dongjoon-hyun/SPARK-35593-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-19 11:23:49 -07:00
Dongjoon Hyun	94f701587d	[SPARK-35818][BUILD] Upgrade SBT to 1.5.4 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.5.4. ### Why are the changes needed? SBT 1.5.4 is released 5 days ago. - https://github.com/sbt/sbt/releases/tag/v1.5.4 This will bring the latest bug fixes like the following. - Fixes BSP on ARM Macs by keeping JNI server socket to keep using JNI - Fixes compiler ClassLoader list to use compilerJars.toList (For Scala 3, this drops support for 3.0.0-M2) - Fixes undercompilation of package object causing "Symbol 'type X' is missing from the classpath" - Fixes overcompilation with scalac -release flag - Fixes build/exit notification not closing BSP channel - Fixes POM file's Maven repository ID character restriction to match that of Maven ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32966 from dongjoon-hyun/SPARK-35818. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-19 00:17:35 -07:00
kudhru	8aeed08d04	[SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters ### What changes were proposed in this pull request? This change is for [SPARK-35757](https://issues.apache.org/jira/browse/SPARK-35757) and does the following: 1. adds bitwise AND operation to BitArray (similar to existing `putAll` method) 2. adds an intersect operation for combining bloom filters using bitwise AND operation (similar to existing `mergeInPlace` method). ### Why are the changes needed? The current bloom filter library only allows combining two bloom filters using OR operation. It is useful to have AND operation as well. ### Does this PR introduce _any_ user-facing change? No, just adds new methods. ### How was this patch tested? Just the existing tests. Closes #32907 from kudhru/master. Lead-authored-by: kudhru <gargdhruv36@gmail.com> Co-authored-by: Dhruv Kumar <kudhru@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-17 06:29:33 +00:00
Kousuke Saruta	5e30666010	[SPARK-35656][BUILD] Upgrade SBT to 1.5.3 ### What changes were proposed in this pull request? This PR proposes to upgrade SBT to 1.5.3. ### Why are the changes needed? This release seems to include a bug fix for Scala 2.13.6+ and Scala 3. https://github.com/sbt/sbt/releases/tag/v1.5.3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #32792 from sarutak/upgrade-sbt-1.5.3. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-05 16:48:59 -07:00
yangjie01	4a549f2de2	[SPARK-35574][BUILD] Add a compile arg to turn compilation warnings related to `procedure syntax` to compilation errors in Scala 2.13 ### What changes were proposed in this pull request? There are several pr to fix compilation warnings related to `procedure syntax` like SPARK-29291， SPARK-33352 and SPARK-35526, in order to prevent the recurrence of similar problems, this pr add a compile arg to convert `procedure syntax` related compilation warnings to compilation errors in Scala 2.13. ### Why are the changes needed? Prevent the recurrence of compilation warnings related to `procedure syntax is deprecated` ### Does this PR introduce _any_ user-facing change? `procedure syntax` is no longer allowed in Spark code with Scala 2.13, for constructors methods definition should be `this(...) = { }` not `this(...) { }`, for without `return type` methods definition should be `def methodName(...): Unit = {}` not `def methodName(...) {}`. ### How was this patch tested? - Pass the GitHub Action Scala 2.13 job - Manual test： Do some code change like: ``` Index: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala =================================================================== -67,7 +67,7 private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) extends SparkListener with ThreadSafeRpcEndpoint with Logging { - def this(sc: SparkContext) = { + def this(sc: SparkContext) { this(sc, new SystemClock) } Index: core/src/main/scala/org/apache/spark/MapOutputTracker.scala =================================================================== -720,7 +720,7 } } - def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus): Unit = { + def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) { shuffleStatuses(shuffleId).addMergeResult(reduceId, status) } ``` sbt with Scala 2.13 profile compile failed as follows:* ``` [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70:29: procedure syntax is deprecated for constructors: add `=`, as in method definition [error] def this(sc: SparkContext) { [error] ^ [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723:79: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [error] def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) { [error] ^ [error] two errors found [error] (core / Compile / compileIncremental) Compilation failed [error] Total time: 136 s (02:16), completed May 31, 2021 10:06:50 AM Error: Process completed with exit code 1. ``` maven with Scala 2.13 profile compile failed as follows: ``` [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [ERROR] two errors found ``` Closes #32710 from LuciferYang/SPARK-35574. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:04 +09:00
Dongjoon Hyun	6c4b60f3b3	[SPARK-31168][BUILD] Upgrade Scala to 2.12.14 ### What changes were proposed in this pull request? This PR is the 4th try to upgrade Scala 2.12.x in order to see the feasibility. - https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum ) - https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya ) - https://github.com/apache/spark/pull/31223 (Upgrade Scala to 2.12.13, dongjoon-hyun ) Note that Scala 2.12.14 has the following fix for Apache Spark community. - Fix cyclic error in runtime reflection (protobuf), a regression that prevented Spark upgrading to 2.12.13 REQUIREMENTS: - [x] `silencer` library is released via https://github.com/ghik/silencer/pull/66 - [x] `genjavadoc` library is released via https://github.com/lightbend/genjavadoc/issues/282 ### Why are the changes needed? Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11/2.12.12/2.12.13. This will bring all the bug fixes. - https://github.com/scala/scala/releases/tag/v2.12.14 - https://github.com/scala/scala/releases/tag/v2.12.13 - https://github.com/scala/scala/releases/tag/v2.12.12 - https://github.com/scala/scala/releases/tag/v2.12.11 ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug-fixed version. ### How was this patch tested? Pass the CIs. Closes #32697 from dongjoon-hyun/SPARK-31168. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-30 16:08:13 -07:00
Vinod KC	003294ce1d	[SPARK-35488][BUILD] Upgrade ASM to 7.3.1 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 7.3.1. - https://issues.apache.org/jira/browse/XBEAN-323 - https://asm.ow2.io/versions.html ### Why are the changes needed? ASM 7.3.1 bring following changes - new V15 constant - experimental support for PermittedSubtypes and RecordComponent - bug fixes - - 317885: SKIP_DEBUG now skips MethodParameters attributes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran with the existing UTs Closes #32634 from vinodkc/br_build_upgrade_asm. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-23 02:33:15 +09:00
Dongjoon Hyun	bf5d332303	[SPARK-35417][BUILD] Upgrade SBT to 1.5.2 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.5.2 for better Scala 2.13.x support. ### Why are the changes needed? SBT 1.5.2 Release Note: https://github.com/sbt/sbt/releases/tag/v1.5.2 - Fixes ConcurrentModificationException while compiling Scala 2.13.4 and Java sources zinc - Uses -Duser.home instead of $HOME to download launcher JAR - Fixes -client by making it the same as --client - Fixes metabuild ClassLoader missing util-interface - Fixes sbt new leaving behind target directory - Fixes "zip END header not found" error during pushRemoteCache ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32565 from dongjoon-hyun/SPARK-35417. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-17 11:28:40 +09:00
Kousuke Saruta	8a5af37c25	[SPARK-35268][BUILD] Upgrade GenJavadoc to 0.17 ### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` to `0.17`. ### Why are the changes needed? This version seems to include a fix for an issue which can happen with Scala 2.13.5. https://github.com/lightbend/genjavadoc/releases/tag/v0.17 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed build succeed with the following commands. ``` # For Scala 2.12 $ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc # For Scala 2.13 build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #32392 from sarutak/upgrade-genjavadoc-0.17. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:47:14 -07:00
lipzhu	4e3daa5994	[SPARK-35254][BUILD] Upgrade SBT to 1.5.1 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.5.1. ### Why are the changes needed? https://github.com/sbt/sbt/releases/tag/v1.5.1 ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Pass the SBT CIs (Build/Test/Docs/Plugins). Closes #32382 from lipzhu/SPARK-35254. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:32:43 -07:00
yangjie01	74b93261af	[SPARK-35135][CORE] Turn the `WritablePartitionedIterator` from a trait into a default implementation class ### What changes were proposed in this pull request? `WritablePartitionedIterator` define in `WritablePartitionedPairCollection.scala` and there are two implementation of these trait, but the code for these two implementations is duplicate. The main change of this pr is turn the `WritablePartitionedIterator` from a trait into a default implementation class because there is only one implementation now. ### Why are the changes needed? Cleanup duplicate code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #32232 from LuciferYang/writable-partitioned-iterator. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-04-29 11:46:24 +08:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Kousuke Saruta	c0972dec1d	[SPARK-35180][BUILD] Allow to build SparkR with SBT ### What changes were proposed in this pull request? This PR proposes a change that allows us to build SparkR with SBT. ### Why are the changes needed? In the current master, SparkR can be built only with Maven. It's helpful if we can built it with SBT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that I can build SparkR on Ubuntu 20.04 with the following command. ``` build/sbt -Psparkr package ``` Closes #32285 from sarutak/sbt-sparkr. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-22 20:56:33 +09:00
yangjie01	670c3658e5	[SPARK-35134][BUILD][TESTS] Manually exclude redundant netty jars in SparkBuild.scala to avoid version conflicts in test ### What changes were proposed in this pull request? The following logs will print when Jenkins execute [PySpark pip packaging tests](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137500/console): ``` copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars ``` There will be 2 different versions of netty4 jars copied to the jars directory, but the `netty-xxx-4.1.50.Final.jar` not in maven `dependency:tree `, but spark only needs to rely on `netty-all-xxx.jar`. So this pr try to add new `ExclusionRule`s to `SparkBuild.scala` to exclude unnecessary netty 4 dependencies. ### Why are the changes needed? Make sure that only `netty-all-xxx.jar` is used in the test to avoid possible jar conflicts. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Check Jenkins log manually, there should be only `copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars` and there should be no such logs as ``` copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars ``` Closes #32230 from LuciferYang/SPARK-35134. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-20 14:39:04 +09:00
Ludovic Henry	9244066ca6	[SPARK-33882][ML] Add a vectorized BLAS implementation ### What changes were proposed in this pull request? This patch introduces a VectorizedBLAS class which implements such hardware-accelerated BLAS operations. This feature is hidden behind the "vectorized" profile that you can enable by passing "-Pvectorized" to sbt or maven. The Vector API has been introduced in JDK 16. Following discussion on the mailing list, this API is introduced transparently and needs to be enabled explicitely. ### Why are the changes needed? Whenever a native BLAS implementation isn't available on the system, Spark automatically falls back onto a Java implementation. With the recent release of the Vector API in the OpenJDK [1], we can use hardware acceleration for such operations. This change was also discussed on the mailing list. [2] ### Does this PR introduce _any_ user-facing change? It introduces a build-time profile called `vectorized`. You can pass it to sbt and mvn with `-Pvectorized`. There is no change to the end-user of Spark and it should only impact Spark developpers. It is also disabled by default. ### How was this patch tested? It passes `build/sbt mllib-local/test` with and without `-Pvectorized` with JDK 16. This patch also introduces benchmarks for BLAS. The benchmark results are as follows: ``` [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 37 37 0 271.5 3.7 1.0X [info] vector 24 25 4 416.1 2.4 1.5X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 70 70 0 143.2 7.0 1.0X [info] vector 35 35 2 288.7 3.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 50 51 1 199.8 5.0 1.0X [info] vector 15 15 0 648.7 1.5 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 34 34 0 295.6 3.4 1.0X [info] vector 19 19 0 531.2 1.9 1.8X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 25 25 1 399.0 2.5 1.0X [info] vector 8 9 1 1177.3 0.8 3.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 27 27 0 0.0 26651.5 1.0X [info] vector 21 21 0 0.0 20646.3 1.3X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 36 36 0 0.0 35501.4 1.0X [info] vector 22 22 0 0.0 21930.3 1.6X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 20 20 0 0.0 20283.3 1.0X [info] vector 9 9 0 0.1 8657.7 2.3X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 30 30 0 0.0 29845.8 1.0X [info] vector 10 10 1 0.1 9695.4 3.1X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 182 182 0 0.5 1820.0 1.0X [info] vector 160 160 1 0.6 1597.6 1.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 211 211 1 0.5 2106.2 1.0X [info] vector 156 157 0 0.6 1564.4 1.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 276 276 0 0.4 2757.8 1.0X [info] vector 137 137 0 0.7 1365.1 2.0X ``` /cc srowen xkrogen [1] https://openjdk.java.net/jeps/338 [2] https://mail-archives.apache.org/mod_mbox/spark-dev/202012.mbox/%3cDM5PR2101MB11106162BB3AF32AD29C6C79B0C69DM5PR2101MB1110.namprd21.prod.outlook.com%3e Closes #30810 from luhenry/master. Lead-authored-by: Ludovic Henry <luhenry@microsoft.com> Co-authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-14 11:36:58 -05:00
Denis Pyshev	d6df84e734	[SPARK-35023][BUILD] Migrate from deprecated `in` to `slash` syntax in SBT build file ### What changes were proposed in this pull request? SBT 1.5.0 deprecates `in` syntax from 0.13.x, so build file adjustment is recommended. See https://www.scala-sbt.org/1.x/docs/Migrating-from-sbt-013x.html#Migrating+to+slash+syntax ### Why are the changes needed? Removes significant amount of deprecation warnings and prepares to syntax removal in next versions of SBT. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build should pass on GH Actions. Closes #32115 from gemelen/feature/sbt-1.5-fixes. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-12 11:52:24 +09:00
Dongjoon Hyun	ac4334bcca	[SPARK-34959][BUILD] Upgrade SBT to 1.5.0 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.5.0. ### Why are the changes needed? SBT 1.5.0 is released yesterday with the built-in Scala 3 support. - https://github.com/sbt/sbt/releases/tag/v1.5.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the SBT CIs (Build/Test/Docs/Plugins). Closes #32055 from dongjoon-hyun/SPARK-34959. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-05 15:01:22 -07:00
Cheng Su	1fc66f6870	[SPARK-34862][SQL] Support nested column in ORC vectorized reader ### What changes were proposed in this pull request? This PR is to support nested column type in Spark ORC vectorized reader. Currently ORC vectorized reader [does not support nested column type (struct, array and map)](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L138). We implemented nested column vectorized reader for FB-ORC in our internal fork of Spark. We are seeing performance improvement compared to non-vectorized reader when reading nested columns. In addition, this can also help improve the non-nested column performance when reading non-nested and nested columns together in one query. Before this PR: * `OrcColumnVector` is the implementation class for Spark's `ColumnVector` to wrap Hive's/ORC's `ColumnVector` to read `AtomicType` data. After this PR: * `OrcColumnVector` is an abstract class to keep interface being shared between multiple implementation class of orc column vectors, namely `OrcAtomicColumnVector` (for `AtomicType`), `OrcArrayColumnVector` (for `ArrayType`), `OrcMapColumnVector` (for `MapType`), `OrcStructColumnVector` (for `StructType`). So the original logic to read `AtomicType` data is moved from `OrcColumnVector` to `OrcAtomicColumnVector`. The abstract class of `OrcColumnVector` is needed here because of supporting nested column (i.e. nested column vectors). * A utility method `OrcColumnVectorUtils.toOrcColumnVector` is added to create Spark's `OrcColumnVector` from Hive's/ORC's `ColumnVector`. * A new user-facing config `spark.sql.orc.enableNestedColumnVectorizedReader` is added to control enabling/disabling vectorized reader for nested columns. The default value is false (i.e. disabling by default). For certain tables having deep nested columns, vectorized reader might take too much memory for each sub-column vectors, compared to non-vectorized reader. So providing a config here to work around OOM for query reading wide and deep nested columns if any. We plan to enable it by default on 3.3. Leave it disable in 3.2 in case for any unknown bugs. ### Why are the changes needed? Improve query performance when reading nested columns from ORC file format. Tested with locally adding a small benchmark in `OrcReadBenchmark.scala`. Seeing more than 1x run time improvement. ``` Running benchmark: SQL Nested Column Scan Running case: Native ORC MR Stopped after 2 iterations, 37850 ms Running case: Native ORC Vectorized (Enabled Nested Column) Stopped after 2 iterations, 15892 ms Running case: Native ORC Vectorized (Disabled Nested Column) Stopped after 2 iterations, 37954 ms Running case: Hive built-in ORC Stopped after 2 iterations, 35118 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz SQL Nested Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------ Native ORC MR 18706 18925 310 0.1 17839.6 1.0X Native ORC Vectorized (Enabled Nested Column) 7625 7946 455 0.1 7271.6 2.5X Native ORC Vectorized (Disabled Nested Column) 18415 18977 796 0.1 17561.5 1.0X Hive built-in ORC 17469 17559 127 0.1 16660.1 1.1X ``` Benchmark: ``` nestedColumnScanBenchmark(1024 * 1024) def nestedColumnScanBenchmark(values: Int): Unit = { val benchmark = new Benchmark(s"SQL Nested Column Scan", values, output = output) withTempPath { dir => withTempTable("t1", "nativeOrcTable", "hiveOrcTable") { import spark.implicits._ spark.range(values).map(_ => Random.nextLong).map { x => val arrayOfStructColumn = (0 until 5).map(i => (x + i, s"$x" * 5)) val mapOfStructColumn = Map( s"$x" -> (x * 0.1, (x, s"$x" * 100)), (s"$x" * 2) -> (x * 0.2, (x, s"$x" * 200)), (s"$x" * 3) -> (x * 0.3, (x, s"$x" * 300))) (arrayOfStructColumn, mapOfStructColumn) }.toDF("col1", "col2") .createOrReplaceTempView("t1") prepareTable(dir, spark.sql(s"SELECT * FROM t1")) benchmark.addCase("Native ORC MR") { _ => withSQLConf(SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> "false") { spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop() } } benchmark.addCase("Native ORC Vectorized (Enabled Nested Column)") { _ => spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop() } benchmark.addCase("Native ORC Vectorized (Disabled Nested Column)") { _ => withSQLConf(SQLConf.ORC_VECTORIZED_READER_NESTED_COLUMN_ENABLED.key -> "false") { spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM nativeOrcTable").noop() } } benchmark.addCase("Hive built-in ORC") { _ => spark.sql("SELECT SUM(SIZE(col1)), SUM(SIZE(col2)) FROM hiveOrcTable").noop() } benchmark.run() } } } ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added one simple test in `OrcSourceSuite.scala` to verify correctness. Definitely need more unit tests and add benchmark here, but I want to first collect feedback before crafting more tests. Closes #31958 from c21/orc-vector. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-04-01 23:10:34 -07:00
Ismaël Mejía	8a552bfc76	[SPARK-34778][BUILD] Upgrade to Avro 1.10.2 ### What changes were proposed in this pull request? Update the Avro version to 1.10.2 ### Why are the changes needed? To stay up to date with upstream and catch compatibility issues with zstd ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #31866 from iemejia/SPARK-27733-upgrade-avro-1.10.2. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-22 19:30:14 +08:00
William Hyun	1e060312a3	[SPARK-34734][BUILD] Update sbt to version 1.4.9 ### What changes were proposed in this pull request? This PR aims to update SBT from 1.4.7 to 1.4.9. ### Why are the changes needed? This will bring the following bug fixes and improvements. - https://github.com/sbt/sbt/releases/tag/v1.4.9 - https://github.com/sbt/sbt/releases/tag/v1.4.8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31828 from williamhyun/sbt149. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-14 15:59:27 +09:00
Kousuke Saruta	33d1c16f53	[SPARK-34590][TESTS] Allow JDWP debug for tests ### What changes were proposed in this pull request? This PR proposes a new feature that allows developers to debug test code using JDWP with sbt an Maven. More specifically, this PR introduces the following profile options. * `jdwp-test-debug`: An profile which controls enable/disable JDWP debug * `test.jdwp.address`: An option which corresponds to `address` option in JDWP * `test.jdwp.suspend`: An option which corresponds to `suspend` option in JDWP * `test.jdwp.server`: An option which corresponds to `server` option in JDWP * `test.debug.suite`: An option which controls whether debug ScalaStyle suites (Maven only) For `sbt`, this feature can be used like `build/sbt -Pjdwp-test-debug -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y` and can be used for both JUnit tests and ScalaTest tests. For `Maven`, this feature can be used like as follows: (For JUnit tests) `build/mvn -Pjdwp-test-debug -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y` (For ScalaTest suites) `build/mvn -Pjdwp-test-debug -Dtest.debug.suite=true -Dtest.jdwp.address=localhost:9876 -Dtest.jdwp.suspend=y -Dtest.jdwp.server=y` (It might be useful to specify specific sub-modules like `-pl sql/core,sql/catalyst`). ### Why are the changes needed? It's useful to debug test code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the following things. * `jdwp-tes-debug` can switch JDWP enabled/disabled * `test.jdwp.address` can change address and port. * `test.jdwp.suspend` can change the behavior that the target debugee suspends or not. * `test.jdwp.server` can change the behavior that the JDWP debugger run as a server or client. * ScalaTest suites can be debugged with Maven with setting `test.debug.suite` to `true`. Closes #31706 from sarutak/sbt-jdwp. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-03 09:23:56 -08:00
Dongjoon Hyun	4818847e87	[SPARK-34578][SQL][TESTS][TEST-MAVEN] Refactor ORC encryption tests and ignore ORC shim loaded by old Hadoop library ### What changes were proposed in this pull request? 1. This PR aims to ignore ORC encryption tests when ORC shim is loaded by old Hadoop library by some other tests. The test coverage is preserved by Jenkins SBT runs and GitHub Action jobs. This PR only aims to recover Maven Jenkins jobs. 2. In addition, this PR simplifies SBT testing by refactor the test config to `SparkBuild.scala/pom.xml` and remove `DedicatedJVMTest`. This will remove one GitHub Action job which was recently added for `DedicatedJVMTest` tag. ### Why are the changes needed? Currently, Maven test fails when it runs in a batch mode because `HadoopShimsPre2_3$NullKeyProvider` is loaded. MVN COMMAND ``` $ mvn test -pl sql/core --am -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite,org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite ``` BEFORE ``` - Write and read an encrypted table * FAILED * ... Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.IllegalArgumentException: Unknown key pii at org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider.getCurrentKeyVersion(HadoopShimsPre2_3.java:71) at org.apache.orc.impl.WriterImpl.getKey(WriterImpl.java:871) ``` AFTER ``` OrcV1QuerySuite ... OrcEncryptionSuite: - Write and read an encrypted file !!! CANCELED !!! [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider1b705f65 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:39) - Write and read an encrypted table !!! CANCELED !!! [] was empty org.apache.orc.impl.HadoopShimsPre2_3$NullKeyProvider22adeee1 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:67) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins Maven tests. For SBT command, - the test suite required a dedicated JVM (Before) - the test suite doesn't require a dedicated JVM (After) ``` $ build/sbt "sql/testOnly .OrcV1QuerySuite .OrcEncryptionSuite" ... [info] OrcV1QuerySuite ... [info] - SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core (26 milliseconds) [info] OrcEncryptionSuite: [info] - Write and read an encrypted file (431 milliseconds) [info] - Write and read an encrypted table (359 milliseconds) [info] All tests passed. [info] Passed: Total 35, Failed 0, Errors 0, Passed 35 ``` Closes #31697 from dongjoon-hyun/SPARK-34578-TEST. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:52:27 +09:00
Dongjoon Hyun	03f4cf5845	[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider ### What changes were proposed in this pull request? This is a retry of #31065 . Last time, the newly add test cases passed in Jenkins and individually, but it's reverted because they fail when `GitHub Action` runs with `SERIAL_SBT_TESTS=1`. In this PR, `SecurityTest` tag is used to isolate `KeyProvider`. This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31603 from dongjoon-hyun/SPARK-34486-RETRY. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-21 15:05:29 -08:00
William Hyun	f2e1468496	[SPARK-34428][BUILD] Update sbt version to 1.4.7 ### What changes were proposed in this pull request? This PR aims to update the sbt version to 1.4.7. ### Why are the changes needed? This will bring the latest bug fixes and improvements. - https://github.com/sbt/sbt/releases/tag/v1.4.7 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31555 from williamhyun/sbt147. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-12 18:38:08 -08:00
HyukjinKwon	1217c8b418	Revert "[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0" This reverts commit `a65e86a65e`.	2021-01-27 17:03:15 +09:00
Yuanjian Li	0a1a029622	[SPARK-34235][SS] Make spark.sql.hive as a private package ### What changes were proposed in this pull request? Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983: - Remove the API tag `Unstable` for `HiveSessionStateBuilder` - Add document for spark.sql.hive package to emphasize it's a private package ### Why are the changes needed? Follow the rule for a private package. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. Closes #31321 from xuanyuanking/SPARK-34185-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 17:13:11 +09:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
Dongjoon Hyun	a65e86a65e	[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0 ### What changes were proposed in this pull request? This PR is the 3rd try to upgrade Scala 2.12.x in order to see the feasibility. - https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum ) - https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya ) `silencer` library is updated accordingly. And, Kafka version upgrade is required because it fails like the following. ``` [info] KafkaDataConsumerSuite: [info] org.apache.spark.streaming.kafka010.KafkaDataConsumerSuite * ABORTED * (1 second, 580 milliseconds) [info] java.lang.NoClassDefFoundError: scala/math/Ordering$$anon$7 [info] at kafka.api.ApiVersion$.orderingByVersion(ApiVersion.scala:45) ``` ### Why are the changes needed? Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11 and 2.12.12. This will bring all the bug fixes. - https://github.com/scala/scala/releases/tag/v2.12.13 - https://github.com/scala/scala/releases/tag/v2.12.12 - https://github.com/scala/scala/releases/tag/v2.12.11 ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug-fixed version. ### How was this patch tested? Pass the CIs. Closes #31223 from dongjoon-hyun/SPARK-31168. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-18 13:45:06 -08:00
Dongjoon Hyun	9e93fdb146	[SPARK-34103][INFRA] Fix MiMaExcludes by moving SPARK-23429 from 2.4 to 3.0 ### What changes were proposed in this pull request? This PR aims to fix `MiMaExcludes` rule by moving SPARK-23429 from 2.4 to 3.0. ### Why are the changes needed? SPARK-23429 was added at Apache Spark 3.0.0. This should land on `master` and `branch-3.1` and `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the MiMa rule. Closes #31174 from dongjoon-hyun/SPARK-34103. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-14 09:29:31 +09:00
Dongjoon Hyun	194edc86a2	Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider" This reverts commit `8bb70bf0d6`.	2021-01-06 23:41:27 -08:00

1 2 3 4 5 ...

1247 commits