Apache Spark - A unified analytics engine for large-scale data processing
Go to file
Cheng Lian 27daf6bcde [SPARK-17949][SQL] A JVM object based aggregate operator
## What changes were proposed in this pull request?

This PR adds a new hash-based aggregate operator named `ObjectHashAggregateExec` that supports `TypedImperativeAggregate`, which may use arbitrary Java objects as aggregation states. Please refer to the [design doc](https://issues.apache.org/jira/secure/attachment/12834260/%5BDesign%20Doc%5D%20Support%20for%20Arbitrary%20Aggregation%20States.pdf) attached in [SPARK-17949](https://issues.apache.org/jira/browse/SPARK-17949) for more details about it.

The major benefit of this operator is better performance when evaluating `TypedImperativeAggregate` functions, especially when there are relatively few distinct groups. Functions like Hive UDAFs, `collect_list`, and `collect_set` may also benefit from this after being migrated to `TypedImperativeAggregate`.

The following feature flag is introduced to enable or disable the new aggregate operator:
- Name: `spark.sql.execution.useObjectHashAggregateExec`
- Default value: `true`

We can also configure the fallback threshold using the following SQL operation:
- Name: `spark.sql.objectHashAggregate.sortBased.fallbackThreshold`
- Default value: 128

  Fallback to sort-based aggregation when more than 128 distinct groups are accumulated in the aggregation hash map. This number is intentionally made small to avoid GC problems since aggregation buffers of this operator may contain arbitrary Java objects.

  This may be improved by implementing size tracking for this operator, but that can be done in a separate PR.

Code generation and size tracking are planned to be implemented in follow-up PRs.
## Benchmark results
### `ObjectHashAggregateExec` vs `SortAggregateExec`

The first benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating `typed_count`, a testing `TypedImperativeAggregate` version of the SQL `count` function.

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
```

The next benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating the Spark native version of `percentile_approx`.

Note that `percentile_approx` is so heavy an aggregate function that the bottleneck of the benchmark is evaluating the aggregate function itself rather than the aggregate operator since I couldn't run a large scale benchmark on my laptop. That's why the results are so close and looks counter-intuitive (aggregation with grouping is even faster than that aggregation without grouping).

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
sort agg w/ group by                          3418 / 3530          0.6        1630.0       1.0X
object agg w/ group by w/o fallback           3210 / 3314          0.7        1530.7       1.1X
object agg w/ group by w/ fallback            3419 / 3511          0.6        1630.1       1.0X
sort agg w/o group by                         4336 / 4499          0.5        2067.3       0.8X
object agg w/o group by w/o fallback          4271 / 4372          0.5        2036.7       0.8X
```
### Hive UDAF vs Spark AF

This benchmark compares the following two kinds of aggregate functions:
- "hive udaf": Hive implementation of `percentile_approx`, without partial aggregation supports, evaluated using `SortAggregateExec`.
- "spark af": Spark native implementation of `percentile_approx`, with partial aggregation support, evaluated using `ObjectHashAggregateExec`

The performance differences are mostly due to faster implementation and partial aggregation support in the Spark native version of `percentile_approx`.

This benchmark basically shows the performance differences between the worst case, where an aggregate function without partial aggregation support is evaluated using `SortAggregateExec`, and the best case, where a `TypedImperativeAggregate` with partial aggregation support is evaluated using `ObjectHashAggregateExec`.

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
```
### Real world benchmark

We also did a relatively large benchmark using a real world query involving `percentile_approx`:
- Hive UDAF implementation, sort-based aggregation, w/o partial aggregation support

  24.77 minutes
- Native implementation, sort-based aggregation, w/ partial aggregation support

  4.64 minutes
- Native implementation, object hash aggregator, w/ partial aggregation support

  1.80 minutes
## How was this patch tested?

New unit tests and randomized test cases are added in `ObjectAggregateFunctionSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #15590 from liancheng/obj-hash-agg.
2016-11-03 09:34:51 -07:00
.github [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE 2016-10-12 11:14:03 -07:00
assembly [SPARK-16967] move mesos to module 2016-08-26 12:25:22 -07:00
bin [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] 2016-10-21 09:48:24 +01:00
build [SPARK-14279][BUILD] Pick the spark version from pom 2016-06-06 09:42:50 -07:00
common [SPARK-17800] Introduce InterfaceStability annotation 2016-10-07 10:24:42 -07:00
conf [SPARK-11653][DEPLOY] Allow spark-daemon.sh to run in the foreground 2016-10-20 09:49:58 +01:00
core [SPARK-18219] Move commit protocol API (internal) from sql/core to core module 2016-11-03 02:42:48 -07:00
data [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs 2016-08-05 20:57:46 +01:00
dev [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] 2016-10-21 09:48:24 +01:00
docs [SPARK-18198][DOC][STREAMING] Highlight code snippets 2016-11-02 09:10:34 +00:00
examples [MINOR] Use <= for clarity in Pi examples' Monte Carlo process 2016-11-02 09:09:16 +00:00
external [SPARK-17813][SQL][KAFKA] Maximum data per trigger 2016-10-27 10:30:59 -07:00
graphx [SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank 2016-09-10 00:15:59 -07:00
launcher [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf 2016-08-31 00:20:41 -07:00
licenses [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright 2016-06-04 21:41:27 +01:00
mesos [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US 2016-11-02 09:39:15 +00:00
mllib [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US 2016-11-02 09:39:15 +00:00
mllib-local [SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet 2016-10-24 23:47:59 -07:00
project [SPARK-18104][DOC] Don't build KafkaSource doc 2016-10-26 11:16:20 -07:00
python [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier 2016-11-03 07:45:20 -07:00
R [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table 2016-11-02 18:05:14 -07:00
repl [SPARK-18189][SQL] Fix serialization issue in KeyValueGroupedDataset 2016-11-01 11:18:42 -07:00
sbin [SPARK-17944][DEPLOY] sbin/start-* scripts use of hostname -f fail with Solaris 2016-10-22 09:37:53 +01:00
sql [SPARK-17949][SQL] A JVM object based aggregate operator 2016-11-03 09:34:51 -07:00
streaming [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US 2016-11-02 09:39:15 +00:00
tools [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent 2016-07-19 11:59:46 +01:00
yarn [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode 2016-11-02 11:47:45 -07:00
.gitattributes [SPARK-3870] EOL character enforcement 2014-10-31 12:39:52 -07:00
.gitignore [MINOR][SPARKR] Add sparkr-vignettes.html to gitignore. 2016-09-24 01:03:11 -07:00
.travis.yml [SPARK-16967] move mesos to module 2016-08-26 12:25:22 -07:00
appveyor.yml [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only) 2016-09-08 08:26:59 -07:00
CONTRIBUTING.md [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages 2016-09-14 10:10:16 +01:00
LICENSE [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] 2016-10-21 09:48:24 +01:00
NOTICE [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright 2016-06-04 21:41:27 +01:00
pom.xml [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test against staging artifacts 2016-11-02 11:52:29 -07:00
README.md [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE 2016-10-12 11:14:03 -07:00
scalastyle-config.xml [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL 2016-10-26 10:36:36 -07:00

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

You can build Spark using more than one thread by using the -T option with Maven, see "Parallel builds in Maven 3". More detailed documentation is available from the project site, at "Building Spark". For developing Spark using an IDE, see Eclipse and IntelliJ.

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

## Contributing

Please review the Contribution to Spark wiki for information on how to get started contributing to the project.