Apache Spark - A unified analytics engine for large-scale data processing
Go to file
Nicholas Hwang a803ac3e06 [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.

Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.

This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.

As an illustrative example, submit the following to `spark-submit`:
```
from pyspark import SparkConf, SparkContext
import collections

def updateCounter(acc, val):
    print 'update acc:', acc
    print 'update val:', val
    acc[val] += 1
    return acc

def comboCounter(acc1, acc2):
    print 'combo acc1:', acc1
    print 'combo acc2:', acc2
    acc1.update(acc2)
    return acc1

def main():
    conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
    sc = SparkContext(conf = conf)

    print '======= AGGREGATING with ONE PARTITION ======='
    print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)

    print '======= AGGREGATING with TWO PARTITIONS ======='
    print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)

if __name__ == "__main__":
    main()
```

One probably expects this to output the following:
```
Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
```

But it instead outputs this (regardless of the number of partitions):
```
Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
```

This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.

I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.

Author: Nicholas Hwang <moogling@gmail.com>

Closes #7378 from njhwang/master and squashes the following commits:

659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1.
90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
2015-07-19 10:30:28 -07:00
assembly [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 2015-06-03 10:11:27 -07:00
bagel [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 2015-06-03 10:11:27 -07:00
bin [SPARK-7733] [CORE] [BUILD] Update build, code to use Java 7 for 1.5.0+ 2015-06-07 20:18:13 +01:00
build [SPARK-8933] [BUILD] Provide a --force flag to build/mvn that always uses downloaded maven 2015-07-14 11:43:26 -07:00
conf [SPARK-3071] Increase default driver memory 2015-07-01 23:11:02 -07:00
core [SPARK-9171][SQL] add and improve tests for nondeterministic expressions 2015-07-18 11:58:53 -07:00
data/mllib [MLLIB] [DOC] Seed fix in mllib naive bayes example 2015-07-18 10:12:48 -07:00
dev [SPARK-9179] [BUILD] Allows committers to specify primary author of the PR to be merged 2015-07-19 17:37:25 +08:00
docker [SPARK-8954] [BUILD] Remove unneeded deb repository from Dockerfile to fix build error in docker. 2015-07-13 12:01:23 -07:00
docs [SPARK-6284] [MESOS] Add mesos role, principal and secret 2015-07-16 19:37:15 -07:00
ec2 [SPARK-8596] Add module for rstudio link to spark 2015-07-13 08:15:54 -07:00
examples [SPARK-7977] [BUILD] Disallowing println 2015-07-10 11:34:01 +01:00
external [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses 2015-07-14 16:08:17 -07:00
extras [SPARK-9030] [STREAMING] Add Kinesis.createStream unit tests that actual sends data 2015-07-17 16:43:18 -07:00
graphx [SPARK-9109] [GRAPHX] Keep the cached edge in the graph 2015-07-17 12:11:32 -07:00
launcher [SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc 2015-07-14 00:32:29 -07:00
mllib [SPARK-9118] [ML] Implement IntArrayParam in mllib 2015-07-17 20:02:05 -07:00
network [SPARK-3071] Increase default driver memory 2015-07-01 23:11:02 -07:00
project [SPARK-8278] Remove non-streaming JSON reader. 2015-07-18 20:27:55 -07:00
python [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold() 2015-07-19 10:30:28 -07:00
R [SPARK-9093] [SPARKR] Fix single-quotes strings in SparkR 2015-07-17 17:00:50 +09:00
repl [SPARK-9015] [BUILD] Clean project import in scala ide 2015-07-16 18:42:41 +01:00
sbin [SPARK-5412] [DEPLOY] Cannot bind Master to a specific hostname as per the documentation 2015-05-15 11:30:19 -07:00
sbt Adde LICENSE Header to build/mvn, build/sbt and sbt/sbt 2014-12-29 10:48:53 -08:00
sql [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506 2015-07-19 18:58:19 +08:00
streaming [SPARK-5681] [STREAMING] Move 'stopReceivers' to the event loop to resolve the race condition 2015-07-17 14:00:31 -07:00
tools [SPARK-9015] [BUILD] Clean project import in scala ide 2015-07-16 18:42:41 +01:00
unsafe [SPARK-8240][SQL] string function: concat 2015-07-18 14:07:56 -07:00
yarn [SPARK-8851] [YARN] In Client mode, make sure the client logs in and updates tokens 2015-07-17 09:38:08 -05:00
.gitattributes [SPARK-3870] EOL character enforcement 2014-10-31 12:39:52 -07:00
.gitignore [SPARK-8495] [SPARKR] Add a .lintr file to validate the SparkR files and the lint-r script 2015-06-20 16:10:14 -07:00
.rat-excludes [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility 2015-07-08 15:51:01 -07:00
CONTRIBUTING.md [SPARK-6889] [DOCS] CONTRIBUTING.md updates to accompany contribution doc updates 2015-04-21 22:34:31 -07:00
LICENSE [SPARK-8709] Exclude hadoop-client's mockito-all dependency 2015-06-29 14:07:55 -07:00
make-distribution.sh [SPARK-6797] [SPARKR] Add support for YARN cluster mode. 2015-07-13 08:21:47 -07:00
NOTICE SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info 2014-05-14 09:38:33 -07:00
pom.xml [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 2015-07-19 09:14:55 +01:00
pylintrc [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark 2015-07-15 08:25:53 -07:00
README.md Update README to include DataFrames and zinc. 2015-05-31 23:55:45 -07:00
scalastyle-config.xml [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses 2015-07-14 16:08:17 -07:00
tox.ini [SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python 2015-05-10 19:18:32 -07:00

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at "Building Spark".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn-cluster" or "yarn-client" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions. See also "Third Party Hadoop Distributions" for guidance on building a Spark application that works with a particular distribution.

Configuration

Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.