ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Tathagata Das	bacfe5ebca	Added JavaStreamingContext.transform	2013-10-24 10:56:24 -07:00
Matei Zaharia	1dc776b863	Merge pull request #93 from kayousterhout/ui_new_state Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-23 22:05:52 -07:00
Reynold Xin	c4b187d1db	Merge pull request #105 from pwendell/doc-fix Fixing broken links in programming guide Unfortunately these are broken in 0.8.0.	2013-10-23 21:56:18 -07:00
Patrick Wendell	4e093b88f8	Fixing broken links in programming guide	2013-10-23 21:28:23 -07:00
Kay Ousterhout	b45352e373	Clear akka frame size property in tests	2013-10-23 18:23:28 -07:00
Reynold Xin	a098438c48	Merge pull request #103 from JoshRosen/unpersist-fix Add unpersist() to JavaDoubleRDD and JavaPairRDD. This fixes a minor inconsistency where [unpersist() was only available on JavaRDD](https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3CCE8D8748.68C0%25YannLuppo%40livenation.com%3E) and not JavaPairRDD / JavaDoubleRDD. I also added support for the new optional `blocking` argument added in 0.8. Please merge this into branch-0.8, too.	2013-10-23 18:03:08 -07:00
Kay Ousterhout	c42f5d1787	Fixed broken tests	2013-10-23 17:35:01 -07:00
Josh Rosen	210858ac02	Add unpersist() to JavaDoubleRDD and JavaPairRDD. Also add support for new optional `blocking` argument.	2013-10-23 17:27:01 -07:00
Kay Ousterhout	a5f8f54ecd	Merge remote-tracking branch 'upstream/master' into ui_new_state Conflicts: core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala	2013-10-23 16:06:28 -07:00
Matei Zaharia	dadfc63b03	Fix Maven build to use MQTT repository	2013-10-23 15:29:22 -07:00
Matei Zaharia	dd659642e7	Merge pull request #64 from prabeesh/master MQTT Adapter for Spark Streaming MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. You may read more about it here http://mqtt.org/ Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications. It enables the transfer of telemetry-style data in the form of messages from devices like sensors and actuators, to mobile phones, embedded systems on vehicles, or laptops and full scale computers. The protocol was invented by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions This protocol enables a publish/subscribe messaging model in an extremely lightweight way. It is useful for connections with remote locations where line of code and network bandwidth is a constraint. MQTT is one of the widely used protocol for 'Internet of Things'. This protocol is getting much attraction as anything and everything is getting connected to internet and they all produce data. Researchers and companies predict some 25 billion devices will be connected to the internet by 2015. Plugin/Support for MQTT is available in popular MQs like RabbitMQ, ActiveMQ etc. Support for MQTT in Spark will help people with Internet of Things (IoT) projects to use Spark Streaming for their real time data processing needs (from sensors and other embedded devices etc).	2013-10-23 15:07:59 -07:00
Tathagata Das	9fccb17a5f	Removed Function3.call() based on Josh's comment.	2013-10-23 12:07:07 -07:00
Tathagata Das	fe8626efd1	Merge branch 'apache-master' into transform	2013-10-22 23:40:40 -07:00
Tathagata Das	72d2e1dd77	Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs.	2013-10-22 23:35:51 -07:00
Matei Zaharia	452aa36d67	Merge pull request #97 from ewencp/pyspark-system-properties Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 23:15:33 -07:00
Reynold Xin	9dfcf53a08	Merge pull request #100 from JoshRosen/spark-902 Remove redundant Java Function call() definitions This should fix [SPARK-902](https://spark-project.atlassian.net/browse/SPARK-902), an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes #30).	2013-10-22 16:01:42 -07:00
Josh Rosen	768eb9c962	Remove redundant Java Function call() definitions This should fix SPARK-902, an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler. Thanks to @MartinWeindel for diagnosing this problem. (This PR subsumes / closes #30)	2013-10-22 14:26:52 -07:00
Patrick Wendell	97184de1db	Merge pull request #99 from pwendell/master Use correct formatting for comments in StoragePerfTester	2013-10-22 13:10:14 -07:00
Patrick Wendell	ab5ece19a3	Formatting cleanup	2013-10-22 13:03:08 -07:00
Ewen Cheslack-Postava	c8748c25eb	Add notes to python documentation about using SparkContext.setSystemProperty.	2013-10-22 11:49:52 -07:00
Patrick Wendell	c404adb9d2	Merge pull request #90 from pwendell/master SPARK-940: Do not directly pass Stage objects to SparkListener. This patch updates the SparkListener interface to pass StageInfo objects rather than directly pass spark Stages. The reason for this patch is explained in detail in SPARK-940.	2013-10-22 11:30:19 -07:00
Ewen Cheslack-Postava	317a9eb1ce	Pass self to SparkContext._ensure_initialized. The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.	2013-10-22 11:26:49 -07:00
Patrick Wendell	c22046b3cc	Minor clean-up in review	2013-10-22 11:00:50 -07:00
Patrick Wendell	7de0ea4d42	Response to code review and adding some more tests	2013-10-22 11:00:50 -07:00
Patrick Wendell	2fa3c4c49c	Fix for Spark-870. This patch fixes a bug where the Spark UI didn't display the correct number of total tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions. It also cleans up the listener API a bit by embedding this information in the StageInfo class rather than passing it seperately.	2013-10-22 11:00:25 -07:00
Patrick Wendell	a854f5bfcf	SPARK-940: Do not directly pass Stage objects to SparkListener.	2013-10-22 11:00:06 -07:00
Matei Zaharia	aa9019fc82	Merge pull request #98 from aarondav/docs Docs: Fix links to RDD API documentation	2013-10-22 10:30:02 -07:00
Matei Zaharia	a0e08f0fb9	Merge pull request #82 from JoshRosen/map-output-tracker-refactoring Split MapOutputTracker into Master/Worker classes Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers. I also renamed a few methods and made others protected/private.	2013-10-22 10:20:43 -07:00
Kay Ousterhout	37b9b4cc11	Shorten GETTING_RESULT to GET_RESULT	2013-10-22 10:05:33 -07:00
Aaron Davidson	962bec97ee	Docs: Fix links to RDD API documentation	2013-10-22 09:39:36 -07:00
Ewen Cheslack-Postava	56d230e614	Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 00:22:37 -07:00
Matei Zaharia	b84193c5b8	Merge pull request #92 from tgravescs/sparkYarnFixClasspath Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ... ...to be explicit about inclusion of spark.jar and app.jar. Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.	2013-10-21 23:35:13 -07:00
Matei Zaharia	731c94e91d	Merge pull request #56 from jerryshao/kafka-0.8-dev Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming Conflicts: streaming/pom.xml	2013-10-21 23:31:38 -07:00
Reynold Xin	48952d67e6	Merge pull request #87 from aarondav/shuffle-base Basic shuffle file consolidation The Spark shuffle phase can produce a large number of files, as one file is created per mapper per reducer. For large or repeated jobs, this often produces millions of shuffle files, which sees extremely degredaded performance from the OS file system. This patch seeks to reduce that burden by combining multipe shuffle files into one. This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669. However, it simplifies the design in order to get the majority of the gain with less overall intellectual and code burden. The vast majority of code in this pull request is a refactor to allow the insertion of a clean layer of indirection between logical block ids and physical files. This, I feel, provides some design clarity in addition to enabling shuffle file consolidation. The main goal is to produce one shuffle file per reducer per active mapper thread. This allows us to isolate the mappers (simplifying the failure modes), while still allowing us to reduce the number of mappers tremendously for large tasks. In order to accomplish this, we simply create a new set of shuffle files for every parallel task, and return the files to a pool which will be given out to the next run task. I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark: scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt)) scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now } scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0) For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two. Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.	2013-10-21 22:45:00 -07:00
Aaron Davidson	053ef949ac	Merge ShufflePerfTester patch into shuffle block consolidation	2013-10-21 22:17:53 -07:00
Prabeesh K	9ca1bd9530	Update MQTTWordCount.scala	2013-10-22 09:05:57 +05:30
Reynold Xin	a51359c917	Merge pull request #95 from aarondav/perftest Minor: Put StoragePerfTester in org/apache/	2013-10-21 20:33:29 -07:00
Aaron Davidson	97053c4a91	Put StoragePerfTester in org/apache/	2013-10-21 20:25:40 -07:00
Prabeesh K	dbafa11396	Update MQTTWordCount.scala	2013-10-22 08:50:34 +05:30
Matei Zaharia	39d2e9b293	Merge pull request #94 from aarondav/mesos-fix Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71. Previously, we explicitly removed the mesos:// part; with #71, this no longer occurs.	2013-10-21 18:58:48 -07:00
Aaron Davidson	0071f0899c	Fix mesos urls This was a bug I introduced in https://github.com/apache/incubator-spark/pull/71 Previously, we explicitly removed the mesos:// part; with PR 71, this no longer occured.	2013-10-21 15:56:14 -07:00
Kay Ousterhout	916270f5f3	Show "GETTING_RESULTS" state in UI. This commit adds a set of calls using the SparkListener interface that indicate when a task is remotely fetching results, so that we can display this (potentially time-consuming) phase of execution to users through the UI.	2013-10-21 12:46:57 -07:00
Aaron Davidson	4aa0ba1df7	Remove executorId from Task.run()	2013-10-21 12:19:15 -07:00
tgravescs	b6571541a6	Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath to be explicit about inclusion of spark.jar and app.jar	2013-10-21 14:05:15 -05:00
Patrick Wendell	aa61bfd399	Merge pull request #88 from rxin/clean Made the following traits/interfaces/classes non-public: Made the following traits/interfaces/classes non-public: SparkHadoopWriter SparkHadoopMapRedUtil SparkHadoopMapReduceUtil SparkHadoopUtil PythonAccumulatorParam BlockManagerSlaveActor	2013-10-21 11:57:05 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Aaron Davidson	444162afe7	Documentation update	2013-10-20 22:59:45 -07:00
Aaron Davidson	947fceaa73	Close shuffle writers during failure & remove executorId from TaskContext	2013-10-20 22:47:10 -07:00
Patrick Wendell	35886f3474	Merge pull request #41 from pwendell/shuffle-benchmark Provide Instrumentation for Shuffle Write Performance Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are: 1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes. 2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache. 3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs. I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.	2013-10-20 22:20:32 -07:00
Reynold Xin	5b9380e017	Merge pull request #89 from rxin/executor Don't setup the uncaught exception handler in local mode. This avoids unit test failures for Spark streaming. java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108) at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)	2013-10-20 21:03:51 -07:00

1 2 3 4 5 ...

4446 commits