Commit graph

4402 commits

Author SHA1 Message Date
Patrick Wendell 4ba32678e0 Adding improved error message when multiple assembly jars are present.
This can happen easily if building different hadoop versions.
2013-10-25 19:01:15 -07:00
Patrick Wendell af4a529f6e Exclude jopt from kafka dependency.
Kafka uses an older version of jopt that causes bad conflicts with the version
used by spark-perf. It's not easy to remove this downstream because of the way
that spark-perf uses Spark (by including a spark assembly as an unmanaged jar).
This fixes the problem at its source by just never including it.
2013-10-25 09:20:30 -07:00
Patrick Wendell ad5f579cbf Style fixes 2013-10-24 22:18:53 -07:00
Patrick Wendell e5f6d5697b Spacing fix 2013-10-24 22:08:06 -07:00
Patrick Wendell a351fd4aed Small spacing fix 2013-10-24 21:16:30 -07:00
Patrick Wendell 31e92b72e3 Adding Java versions and associated tests 2013-10-24 21:14:56 -07:00
Patrick Wendell 39f6f75588 Some clean-up of tests 2013-10-24 16:43:33 -07:00
Patrick Wendell 9423532fab Removing Java for now 2013-10-24 14:31:34 -07:00
Patrick Wendell 05ac9940ee Adding tests 2013-10-24 14:31:34 -07:00
Patrick Wendell 2fda84fe3f Always use a shuffle 2013-10-24 14:31:34 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Matei Zaharia 1dc776b863 Merge pull request #93 from kayousterhout/ui_new_state
Show "GETTING_RESULTS" state in UI.

This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
2013-10-23 22:05:52 -07:00
Reynold Xin c4b187d1db Merge pull request #105 from pwendell/doc-fix
Fixing broken links in programming guide

Unfortunately these are broken in 0.8.0.
2013-10-23 21:56:18 -07:00
Patrick Wendell 4e093b88f8 Fixing broken links in programming guide 2013-10-23 21:28:23 -07:00
Kay Ousterhout b45352e373 Clear akka frame size property in tests 2013-10-23 18:23:28 -07:00
Reynold Xin a098438c48 Merge pull request #103 from JoshRosen/unpersist-fix
Add unpersist() to JavaDoubleRDD and JavaPairRDD.

This fixes a minor inconsistency where [unpersist() was only available on JavaRDD](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3CCE8D8748.68C0%25YannLuppo%40livenation.com%3E) and not JavaPairRDD / JavaDoubleRDD.   I also added support for the new optional `blocking` argument added in 0.8.

Please merge this into branch-0.8, too.
2013-10-23 18:03:08 -07:00
Kay Ousterhout c42f5d1787 Fixed broken tests 2013-10-23 17:35:01 -07:00
Josh Rosen 210858ac02 Add unpersist() to JavaDoubleRDD and JavaPairRDD.
Also add support for new optional `blocking` argument.
2013-10-23 17:27:01 -07:00
Kay Ousterhout a5f8f54ecd Merge remote-tracking branch 'upstream/master' into ui_new_state
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
2013-10-23 16:06:28 -07:00
Matei Zaharia dadfc63b03 Fix Maven build to use MQTT repository 2013-10-23 15:29:22 -07:00
Matei Zaharia dd659642e7 Merge pull request #64 from prabeesh/master
MQTT Adapter for Spark Streaming

MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol.
It was designed as an extremely lightweight publish/subscribe messaging transport. You may read more about it here http://mqtt.org/

Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications. It enables the transfer of telemetry-style data in the form of messages from devices like sensors and actuators, to mobile phones, embedded systems on vehicles, or laptops and full scale computers.

The protocol was invented by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions

This protocol enables a publish/subscribe messaging model in an extremely lightweight way. It is useful for connections with remote locations where line of code and network bandwidth is a constraint.

MQTT is one of the widely used protocol for 'Internet of Things'. This protocol is getting much attraction as anything and everything is getting connected to internet and they all produce data. Researchers and companies predict some 25 billion devices will be connected to the internet by 2015.

Plugin/Support for MQTT is available in popular MQs like RabbitMQ, ActiveMQ etc.

Support for MQTT in Spark will help people with Internet of Things (IoT) projects to use Spark Streaming for their real time data processing needs (from sensors and other embedded devices etc).
2013-10-23 15:07:59 -07:00
Matei Zaharia 452aa36d67 Merge pull request #97 from ewencp/pyspark-system-properties
Add classmethod to SparkContext to set system properties.

Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.
2013-10-22 23:15:33 -07:00
Reynold Xin 9dfcf53a08 Merge pull request #100 from JoshRosen/spark-902
Remove redundant Java Function call() definitions

This should fix [SPARK-902](https://spark-project.atlassian.net/browse/SPARK-902), an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler.

Thanks to @MartinWeindel for diagnosing this problem.

(This PR subsumes #30).
2013-10-22 16:01:42 -07:00
Josh Rosen 768eb9c962 Remove redundant Java Function call() definitions
This should fix SPARK-902, an issue where some
Java API Function classes could cause
AbstractMethodErrors when user code is compiled
using the Eclipse compiler.

Thanks to @MartinWeindel for diagnosing this
problem.

(This PR subsumes / closes #30)
2013-10-22 14:26:52 -07:00
Patrick Wendell 97184de1db Merge pull request #99 from pwendell/master
Use correct formatting for comments in StoragePerfTester
2013-10-22 13:10:14 -07:00
Patrick Wendell ab5ece19a3 Formatting cleanup 2013-10-22 13:03:08 -07:00
Ewen Cheslack-Postava c8748c25eb Add notes to python documentation about using SparkContext.setSystemProperty. 2013-10-22 11:49:52 -07:00
Patrick Wendell c404adb9d2 Merge pull request #90 from pwendell/master
SPARK-940: Do not directly pass Stage objects to SparkListener.

This patch updates the SparkListener interface to pass StageInfo objects rather than directly pass spark Stages. The reason for this patch is explained in detail in SPARK-940.
2013-10-22 11:30:19 -07:00
Ewen Cheslack-Postava 317a9eb1ce Pass self to SparkContext._ensure_initialized.
The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.
2013-10-22 11:26:49 -07:00
Patrick Wendell c22046b3cc Minor clean-up in review 2013-10-22 11:00:50 -07:00
Patrick Wendell 7de0ea4d42 Response to code review and adding some more tests 2013-10-22 11:00:50 -07:00
Patrick Wendell 2fa3c4c49c Fix for Spark-870.
This patch fixes a bug where the Spark UI didn't display the correct number of total
tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.

It also cleans up the listener API a bit by embedding this information in the
StageInfo class rather than passing it seperately.
2013-10-22 11:00:25 -07:00
Patrick Wendell a854f5bfcf SPARK-940: Do not directly pass Stage objects to SparkListener. 2013-10-22 11:00:06 -07:00
Matei Zaharia aa9019fc82 Merge pull request #98 from aarondav/docs
Docs: Fix links to RDD API documentation
2013-10-22 10:30:02 -07:00
Matei Zaharia a0e08f0fb9 Merge pull request #82 from JoshRosen/map-output-tracker-refactoring
Split MapOutputTracker into Master/Worker classes

Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances.  This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.

I also renamed a few methods and made others protected/private.
2013-10-22 10:20:43 -07:00
Kay Ousterhout 37b9b4cc11 Shorten GETTING_RESULT to GET_RESULT 2013-10-22 10:05:33 -07:00
Aaron Davidson 962bec97ee Docs: Fix links to RDD API documentation 2013-10-22 09:39:36 -07:00
Ewen Cheslack-Postava 56d230e614 Add classmethod to SparkContext to set system properties.
Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.
2013-10-22 00:22:37 -07:00
Matei Zaharia b84193c5b8 Merge pull request #92 from tgravescs/sparkYarnFixClasspath
Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ...

...to be explicit about inclusion of spark.jar and app.jar.  Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.
2013-10-21 23:35:13 -07:00
Matei Zaharia 731c94e91d Merge pull request #56 from jerryshao/kafka-0.8-dev
Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming

Conflicts:
	streaming/pom.xml
2013-10-21 23:31:38 -07:00
Reynold Xin 48952d67e6 Merge pull request #87 from aarondav/shuffle-base
Basic shuffle file consolidation

The Spark shuffle phase can produce a large number of files, as one file is created
per mapper per reducer. For large or repeated jobs, this often produces millions of
shuffle files, which sees extremely degredaded performance from the OS file system.
This patch seeks to reduce that burden by combining multipe shuffle files into one.

This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669.
However, it simplifies the design in order to get the majority of the gain with less
overall intellectual and code burden. The vast majority of code in this pull request
is a refactor to allow the insertion of a clean layer of indirection between logical
block ids and physical files. This, I feel, provides some design clarity in addition
to enabling shuffle file consolidation.

The main goal is to produce one shuffle file per reducer per active mapper thread.
This allows us to isolate the mappers (simplifying the failure modes), while still
allowing us to reduce the number of mappers tremendously for large tasks. In order
to accomplish this, we simply create a new set of shuffle files for every parallel
task, and return the files to a pool which will be given out to the next run task.

I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark:

    scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt))
    scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }
    scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0)

For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two.

Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.
2013-10-21 22:45:00 -07:00
Aaron Davidson 053ef949ac Merge ShufflePerfTester patch into shuffle block consolidation 2013-10-21 22:17:53 -07:00
Prabeesh K 9ca1bd9530 Update MQTTWordCount.scala 2013-10-22 09:05:57 +05:30
Reynold Xin a51359c917 Merge pull request #95 from aarondav/perftest
Minor: Put StoragePerfTester in org/apache/
2013-10-21 20:33:29 -07:00
Aaron Davidson 97053c4a91 Put StoragePerfTester in org/apache/ 2013-10-21 20:25:40 -07:00
Prabeesh K dbafa11396 Update MQTTWordCount.scala 2013-10-22 08:50:34 +05:30
Matei Zaharia 39d2e9b293 Merge pull request #94 from aarondav/mesos-fix
Fix mesos urls

This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71.
Previously, we explicitly removed the mesos:// part; with #71, this no longer occurs.
2013-10-21 18:58:48 -07:00
Aaron Davidson 0071f0899c Fix mesos urls
This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71
Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.
2013-10-21 15:56:14 -07:00
Kay Ousterhout 916270f5f3 Show "GETTING_RESULTS" state in UI.
This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
2013-10-21 12:46:57 -07:00
Aaron Davidson 4aa0ba1df7 Remove executorId from Task.run() 2013-10-21 12:19:15 -07:00