Apache Spark - A unified analytics engine for large-scale data processing

Go to file

Cheng Su ae82768c13 [SPARK-32421][SQL] Add code-gen for shuffled hash join ### What changes were proposed in this pull request? Adding codegen for shuffled hash join. Shuffled hash join codegen is very similar to broadcast hash join codegen. So most of code change is to refactor existing codegen in `BroadcastHashJoinExec` to `HashJoin`. Example codegen for query in [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153): ``` def shuffleHashJoin(): Unit = { val N: Long = 4 << 20 withSQLConf( SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "10000000", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { codegenBenchmark("shuffle hash join", N) { val df1 = spark.range(N).selectExpr(s"id as k1") val df2 = spark.range(N / 3).selectExpr(s"id * 3 as k2") val df = df1.join(df2, col("k1") === col("k2")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[ShuffledHashJoinExec]).isDefined) df.noop() } } } ``` Shuffled hash join codegen: ``` == Subtree 3 / 3 (maxMethodCodeSize:113; maxConstantPoolSize:126(0.19% used); numInnerClasses:0) == (3) ShuffledHashJoin [k1#2L], [k2#6L], Inner, BuildRight :- (1) Project [id#0L AS k1#2L] : +- (1) Range (0, 4194304, step=1, splits=1) +- (2) Project [(id#4L * 3) AS k2#6L] +- (2) Range (0, 1398101, step=1, splits=1) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage3(references); / 003 / } / 004 / / 005 / // codegenStageId=3 / 006 / final class GeneratedIteratorForCodegenStage3 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.execution.joins.HashedRelation shj_relation_0; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] shj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 012 / / 013 / public GeneratedIteratorForCodegenStage3(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / inputadapter_input_0 = inputs[0]; / 021 / shj_relation_0 = ((org.apache.spark.sql.execution.joins.ShuffledHashJoinExec) references[0] / plan /).buildHashedRelation(inputs[1]); / 022 / shj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 023 / / 024 / } / 025 / / 026 / private void shj_doConsume_0(InternalRow inputadapter_row_0, long shj_expr_0_0) throws java.io.IOException { / 027 / // generate join key for stream side / 028 / / 029 / // find matches from HashRelation / 030 / scala.collection.Iterator shj_matches_0 = false ? / 031 / null : (scala.collection.Iterator)shj_relation_0.get(shj_expr_0_0); / 032 / if (shj_matches_0 != null) { / 033 / while (shj_matches_0.hasNext()) { / 034 / UnsafeRow shj_matched_0 = (UnsafeRow) shj_matches_0.next(); / 035 / { / 036 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] / numOutputRows /).add(1); / 037 / / 038 / long shj_value_1 = shj_matched_0.getLong(0); / 039 / shj_mutableStateArray_0[0].reset(); / 040 / / 041 / shj_mutableStateArray_0[0].write(0, shj_expr_0_0); / 042 / / 043 / shj_mutableStateArray_0[0].write(1, shj_value_1); / 044 / append((shj_mutableStateArray_0[0].getRow()).copy()); / 045 / / 046 / } / 047 / } / 048 / } / 049 / / 050 / } / 051 / / 052 / protected void processNext() throws java.io.IOException { / 053 / while ( inputadapter_input_0.hasNext()) { / 054 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 055 / / 056 / long inputadapter_value_0 = inputadapter_row_0.getLong(0); / 057 / / 058 / shj_doConsume_0(inputadapter_row_0, inputadapter_value_0); / 059 / if (shouldStop()) return; / 060 / } / 061 / } / 062 / / 063 / } ``` Broadcast hash join codegen for the same query (for reference here): ``` == Subtree 2 / 2 (maxMethodCodeSize:280; maxConstantPoolSize:218(0.33% used); numInnerClasses:0) == (2) BroadcastHashJoin [k1#2L], [k2#6L], Inner, BuildRight, false :- (2) Project [id#0L AS k1#2L] : +- (2) Range (0, 4194304, step=1, splits=1) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#22] +- (1) Project [(id#4L 3) AS k2#6L] +- (1) Range (0, 1398101, step=1, splits=1) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean range_initRange_0; / 010 / private long range_nextIndex_0; / 011 / private TaskContext range_taskContext_0; / 012 / private InputMetrics range_inputMetrics_0; / 013 / private long range_batchEnd_0; / 014 / private long range_numElementsTodo_0; / 015 / private org.apache.spark.sql.execution.joins.LongHashedRelation bhj_relation_0; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4]; / 017 / / 018 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / / 026 / range_taskContext_0 = TaskContext.get(); / 027 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 028 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 029 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 030 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 031 / / 032 / bhj_relation_0 = ((org.apache.spark.sql.execution.joins.LongHashedRelation) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] / broadcast /).value()).asReadOnlyCopy(); / 033 / incPeakExecutionMemory(bhj_relation_0.estimatedSize()); / 034 / / 035 / range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); / 036 / / 037 / } / 038 / / 039 / private void initRange(int idx) { / 040 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 041 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(1L); / 042 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(4194304L); / 043 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 044 / java.math.BigInteger start = java.math.BigInteger.valueOf(0L); / 045 / long partitionEnd; / 046 / / 047 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 048 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 049 / range_nextIndex_0 = Long.MAX_VALUE; / 050 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 051 / range_nextIndex_0 = Long.MIN_VALUE; / 052 / } else { / 053 / range_nextIndex_0 = st.longValue(); / 054 / } / 055 / range_batchEnd_0 = range_nextIndex_0; / 056 / / 057 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 058 / .multiply(step).add(start); / 059 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 060 / partitionEnd = Long.MAX_VALUE; / 061 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 062 / partitionEnd = Long.MIN_VALUE; / 063 / } else { / 064 / partitionEnd = end.longValue(); / 065 / } / 066 / / 067 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 068 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 069 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 070 / if (range_numElementsTodo_0 < 0) { / 071 / range_numElementsTodo_0 = 0; / 072 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 073 / range_numElementsTodo_0++; / 074 / } / 075 / } / 076 / / 077 / private void bhj_doConsume_0(long bhj_expr_0_0) throws java.io.IOException { / 078 / // generate join key for stream side / 079 / / 080 / // find matches from HashedRelation / 081 / UnsafeRow bhj_matched_0 = false ? null: (UnsafeRow)bhj_relation_0.getValue(bhj_expr_0_0); / 082 / if (bhj_matched_0 != null) { / 083 / { / 084 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / numOutputRows /).add(1); / 085 / / 086 / long bhj_value_2 = bhj_matched_0.getLong(0); / 087 / range_mutableStateArray_0[3].reset(); / 088 / / 089 / range_mutableStateArray_0[3].write(0, bhj_expr_0_0); / 090 / / 091 / range_mutableStateArray_0[3].write(1, bhj_value_2); / 092 / append((range_mutableStateArray_0[3].getRow())); / 093 / / 094 / } / 095 / } / 096 / / 097 / } / 098 / / 099 / protected void processNext() throws java.io.IOException { / 100 / // initialize Range / 101 / if (!range_initRange_0) { / 102 / range_initRange_0 = true; / 103 / initRange(partitionIndex); / 104 / } / 105 / / 106 / while (true) { / 107 / if (range_nextIndex_0 == range_batchEnd_0) { / 108 / long range_nextBatchTodo_0; / 109 / if (range_numElementsTodo_0 > 1000L) { / 110 / range_nextBatchTodo_0 = 1000L; / 111 / range_numElementsTodo_0 -= 1000L; / 112 / } else { / 113 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 114 / range_numElementsTodo_0 = 0; / 115 / if (range_nextBatchTodo_0 == 0) break; / 116 / } / 117 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 118 / } / 119 / / 120 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 121 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 122 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 123 / / 124 / bhj_doConsume_0(range_value_0); / 125 / / 126 / if (shouldStop()) { / 127 / range_nextIndex_0 = range_value_0 + 1L; / 128 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localIdx_0 + 1); / 129 / range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1); / 130 / return; / 131 / } / 132 / / 133 / } / 134 / range_nextIndex_0 = range_batchEnd_0; / 135 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 136 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 137 / range_taskContext_0.killTaskIfInterrupted(); / 138 / } / 139 / } / 140 / / 141 */ } ``` ### Why are the changes needed? Codegen shuffled hash join can help save CPU cost. We added shuffled hash join codegen internally in our fork, and seeing obvious improvement in benchmark compared to current non-codegen code path. Test example query in [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153), seeing 30% wall clock time improvement compared to existing non-codegen code path: Enable shuffled hash join code-gen: ``` Running benchmark: shuffle hash join Running case: shuffle hash join wholestage off Stopped after 2 iterations, 1358 ms Running case: shuffle hash join wholestage on Stopped after 5 iterations, 2323 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join wholestage off 649 679 43 6.5 154.7 1.0X shuffle hash join wholestage on 436 465 45 9.6 103.9 1.5X ``` Disable shuffled hash join codegen: ``` Running benchmark: shuffle hash join Running case: shuffle hash join wholestage off Stopped after 2 iterations, 1345 ms Running case: shuffle hash join wholestage on Stopped after 5 iterations, 2967 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join wholestage off 646 673 37 6.5 154.1 1.0X shuffle hash join wholestage on 549 594 47 7.6 130.9 1.2X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `WholeStageCodegenSuite`. Closes #29277 from c21/codegen. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2020-07-31 05:51:57 +00:00
.github	[SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions	2020-07-31 00:57:24 +09:00
assembly	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00
bin	[SPARK-32227] Fix regression bug in load-spark-env.cmd with Spark 3.0.0	2020-07-30 21:44:49 +09:00
build	[SPARK-31041][BUILD] Show Maven errors from within make-distribution.sh	2020-03-11 08:22:02 -05:00
common	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+	2020-07-23 16:20:17 -07:00
conf	[SPARK-32004][ALL] Drop references to slave	2020-07-13 14:05:33 -07:00
core	[SPARK-32175][SPARK-32175][FOLLOWUP] Remove flaky test added in	2020-07-31 10:37:05 +09:00
data	[SPARK-22666][ML][SQL] Spark datasource for image format	2018-09-05 11:59:00 -07:00
dev	[SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions	2020-07-31 00:57:24 +09:00
docs	[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization	2020-07-30 15:16:02 +09:00
examples	[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons…	2020-07-26 09:12:43 -05:00
external	[SPARK-32482][SS][TESTS] Eliminate deprecated poll(long) API calls to avoid infinite wait in tests	2020-07-31 13:40:33 +09:00
graphx	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+	2020-07-23 16:20:17 -07:00
hadoop-cloud	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00
launcher	[SPARK-32434][CORE] Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts	2020-07-25 08:19:02 -07:00
licenses	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3	2020-07-27 20:10:13 +09:00
licenses-binary	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3	2020-07-27 20:10:13 +09:00
mllib	[SPARK-32455][ML] LogisticRegressionModel prediction optimization	2020-07-29 19:53:28 -07:00
mllib-local	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+	2020-07-23 16:20:17 -07:00
project	[SPARK-32408][BUILD] Enable crossPaths back to prevent side effects	2020-07-24 08:52:30 -07:00
python	[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode	2020-07-30 10:15:25 +09:00
R	[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization	2020-07-30 15:16:02 +09:00
repl	[SPARK-31399][CORE][TEST-HADOOP3.2][TEST-JAVA11] Support indylambda Scala closure in ClosureCleaner	2020-05-18 05:32:57 +00:00
resource-managers	[SPARK-30794][CORE] Stage Level scheduling: Add ability to set off heap memory	2020-07-27 08:16:13 -05:00
sbin	[SPARK-32004][ALL] Drop references to slave	2020-07-13 14:05:33 -07:00
sql	[SPARK-32421][SQL] Add code-gen for shuffled hash join	2020-07-31 05:51:57 +00:00
streaming	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+	2020-07-23 16:20:17 -07:00
tools	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00
.asf.yaml	[SPARK-31352] Add .asf.yaml to control Github settings	2020-04-06 09:06:01 -05:00
.gitattributes	[SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files	2020-01-27 10:20:51 -08:00
.gitignore	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base	2020-07-27 17:49:21 +09:00
appveyor.yml	[MINOR][INFRA][R] Show the installed packages in R in a prettier way	2020-07-08 07:50:07 -07:00
CONTRIBUTING.md	[MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS	2019-05-21 10:56:42 -07:00
LICENSE	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3	2020-07-27 20:10:13 +09:00
LICENSE-binary	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3	2020-07-27 20:10:13 +09:00
NOTICE	[SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+	2019-11-03 15:13:06 -08:00
NOTICE-binary	[SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+	2019-11-03 15:13:06 -08:00
pom.xml	[SPARK-32397][BUILD] Allow specifying of time for build to keep time consistent between modules	2020-07-29 21:39:14 +00:00
README.md	[MINOR][DOCS] Fix Jenkins build image and link in README.md	2020-01-20 23:08:24 -08:00
scalastyle-config.xml	[SPARK-30030][INFRA] Use RegexChecker instead of TokenChecker to check `org.apache.commons.lang.`	2019-11-25 12:03:15 -08:00

README.md

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

https://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at "Building Spark".

For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.