ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Tathagata Das	c5921e5c61	Fixed bugs.	2014-01-12 01:12:08 -08:00
Tathagata Das	18f4889d96	Merge remote-tracking branch 'apache/master' into error-handling	2014-01-11 23:40:57 -08:00
Tathagata Das	4d9b0ab420	Added waitForStop and stop to JavaStreamingContext.	2014-01-11 23:35:51 -08:00
Tathagata Das	f5108ffc24	Converted JobScheduler to use actors for event handling. Changed protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.	2014-01-11 23:15:09 -08:00
Patrick Wendell	22d4d62420	Revert "Fix one unit test that was not setting spark.cleaner.ttl" This reverts commit `942c80b34c`.	2014-01-11 16:07:03 -08:00
Matei Zaharia	1d7bef0c91	Merge pull request #381 from mateiz/default-ttl Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.	2014-01-10 18:53:03 -08:00
Matei Zaharia	942c80b34c	Fix one unit test that was not setting spark.cleaner.ttl	2014-01-10 16:32:36 -08:00
Patrick Wendell	f26553102c	Merge pull request #383 from tdas/driver-test API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.	2014-01-10 16:25:44 -08:00
Patrick Wendell	d37408f39c	Merge pull request #377 from andrewor14/master External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.	2014-01-10 16:25:01 -08:00
Tathagata Das	4f39e79c23	Merge remote-tracking branch 'apache/master' into driver-test Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala	2014-01-10 15:47:01 -08:00
Tathagata Das	82f07deeda	Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.	2014-01-10 15:37:05 -08:00
Tathagata Das	e4bb845238	Updated docs based on Patrick's comments in PR 383.	2014-01-10 12:17:09 -08:00
Tathagata Das	2213a5a47f	Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test	2014-01-10 05:06:22 -08:00
Tathagata Das	740730a179	Fixed conf/slaves and updated docs.	2014-01-10 05:06:15 -08:00
Tathagata Das	4f609f7901	Removed spark.hostPort and other setting from SparkConf before saving to checkpoint.	2014-01-10 12:58:07 +00:00
Tathagata Das	d7ec73ac76	Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test	2014-01-10 11:44:17 +00:00
Tathagata Das	9d3d9c8251	Refactored graph checkpoint file reading and writing code to make it cleaner and easily debuggable.	2014-01-10 11:44:02 +00:00
Patrick Wendell	997c830e0b	Merge pull request #363 from pwendell/streaming-logs Set default logging to WARN for Spark streaming examples. This programatically sets the log level to WARN by default for streaming tests. If the user has already specified a log4j.properties file, the user's file will take precedence over this default.	2014-01-09 22:22:20 -08:00
Andrew Or	372a533a6c	Fix wonky imports from merge	2014-01-09 21:47:49 -08:00
Andrew Or	d76e1f90a8	Merge github.com:apache/incubator-spark Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java	2014-01-09 21:38:48 -08:00
Tathagata Das	38d75e18fa	Merge remote-tracking branch 'apache/master' into driver-test	2014-01-09 19:31:36 -08:00
Tathagata Das	4a5558ca99	Fixed bugs in reading of checkpoints.	2014-01-10 03:28:39 +00:00
Tathagata Das	f1d206c6b4	Merge branch 'standalone-driver' into driver-test Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2014-01-09 15:06:24 -08:00
Tathagata Das	6f713e2a3e	Changed the way StreamingContext finds and reads checkpoint files, and added JavaStreamingContext.getOrCreate.	2014-01-09 13:42:04 -08:00
Patrick Wendell	35f80da21a	Set default logging to WARN for Spark streaming examples. This programatically sets the log level to WARN by default for streaming tests. If the user has already specified a log4j.properties file, the user's file will take precedence over this default.	2014-01-09 10:42:58 -08:00
Matei Zaharia	a01f3401e3	Use typed getters for configuration settings	2014-01-09 00:07:29 -08:00
Tathagata Das	a17cc602ac	More bug fixes.	2014-01-08 04:12:05 -08:00
Tathagata Das	0b7a132d03	Modified checkpoing file clearing policy.	2014-01-08 03:22:06 -08:00
Tathagata Das	3b4c4c7f4d	Merge remote-tracking branch 'apache/master' into project-refactor Conflicts: examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala	2014-01-06 03:05:52 -08:00
Tathagata Das	ac1f4b06c1	Added a hashmap to cache file mod times.	2014-01-05 23:42:53 -08:00
Tathagata Das	2394794591	Merge branch 'filestream-fix' into driver-test Conflicts: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala	2014-01-06 02:23:53 +00:00
Tathagata Das	8e88db3ca5	Bug fixes to the DriverRunner and minor changes here and there.	2014-01-06 02:21:56 +00:00
Patrick Wendell	79f52809c8	Removing SPARK_EXAMPLES_JAR in the code	2014-01-05 11:49:42 -08:00
Andrew Or	df413e996f	Merge remote-tracking branch 'spark/master' Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala	2014-01-02 20:51:23 -08:00
Tathagata Das	a1b8dd53e3	Added StreamingContext.getOrCreate to for automatic recovery, and added RecoverableNetworkWordCount example to use it.	2014-01-02 19:07:22 -08:00
Patrick Wendell	588a1695f4	Merge pull request #297 from tdas/window-improvement Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.	2014-01-02 13:20:54 -08:00
Matei Zaharia	e2c68642c6	Miscellaneous fixes from code review. Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.	2014-01-01 22:03:39 -05:00
Matei Zaharia	45ff8f413d	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala	2014-01-01 21:25:00 -05:00
Patrick Wendell	f8d245bdfc	Merge remote-tracking branch 'apache-github/master' into log4j-fix-2 Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2014-01-01 16:10:51 -08:00
Matei Zaharia	42bcfb2bb2	Fix two compile errors introduced in merge	2013-12-31 18:26:23 -05:00
Matei Zaharia	ba9338f104	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2013-12-31 18:23:14 -05:00
Tathagata Das	fcd17a1e8e	Fixed comments and long lines based on comments on PR 289.	2013-12-31 02:01:45 -08:00
Tathagata Das	87b915f221	Removed extra empty lines.	2013-12-31 00:42:10 -08:00
Tathagata Das	3ab297adaa	Removed unnecessary comments.	2013-12-31 00:38:19 -08:00
Tathagata Das	97630849ff	Added pom.xml for external projects and removed unnecessary dependencies and repositoris from other poms and sbt.	2013-12-31 00:28:57 -08:00
Patrick Wendell	18181e6c41	Removing initLogging entirely	2013-12-30 23:39:47 -08:00
Tathagata Das	f4e4066191	Refactored kafka, flume, zeromq, mqtt as separate external projects, with their own self-contained scala API, java API, scala unit tests and java unit tests. Updated examples to use the external projects.	2013-12-30 11:13:24 -08:00
Andrew Or	8fbff9f5d0	Address Aaron's comments	2013-12-29 16:22:44 -08:00
Matei Zaharia	0bd1900cbc	Fix a few settings that were being read as system properties after merge	2013-12-29 15:38:46 -05:00
Matei Zaharia	b4ceed40d6	Merge remote-tracking branch 'origin/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala	2013-12-29 15:08:08 -05:00
Matei Zaharia	20631348d1	Fix other failing tests	2013-12-28 23:17:58 -05:00
Matei Zaharia	0900d5c72a	Add a StreamingContext constructor that takes a conf object	2013-12-28 21:38:07 -05:00
Matei Zaharia	a8f316386a	Fix CheckpointSuite test failures	2013-12-28 21:26:43 -05:00
Matei Zaharia	578bd1fc28	Fix test failures due to setting / clearing clock type in Streaming	2013-12-28 21:21:06 -05:00
Matei Zaharia	642029e7f4	Various fixes to configuration code - Got rid of global SparkContext.globalConf - Pass SparkConf to serializers and compression codecs - Made SparkConf public instead of private[spark] - Improved API of SparkContext and SparkConf - Switched executor environment vars to be passed through SparkConf - Fixed some places that were still using system properties - Fixed some tests, though others are still failing This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation).	2013-12-28 17:13:15 -05:00
Tathagata Das	271e3237f3	Minor changes in comments and strings to address comments in PR 289.	2013-12-27 12:26:57 -08:00
Andrew Or	a515706d9c	Fix streaming JavaAPISuite again	2013-12-26 23:40:07 -08:00
Aaron Davidson	1ffe26c7c0	Fix streaming JavaAPISuite that depended on order	2013-12-26 23:40:07 -08:00
Tathagata Das	6e43039614	Refactored streaming project to separate out the twitter functionality.	2013-12-26 18:02:49 -08:00
Tathagata Das	577c8cc834	Removed unncessary options from WindowedDStream.	2013-12-26 14:17:16 -08:00
Tathagata Das	3618d70b2a	Added warning if filestream adds files with no data in them (file RDDs have 0 partitions).	2013-12-26 12:45:40 -08:00
Tathagata Das	be64719138	Changed file stream to not catch any exceptions related to finding new files (FileNotFound exception is still caught and ignored).	2013-12-26 12:33:12 -08:00
Tathagata Das	069cb14bdc	Updated groupByKeyAndWindow to be computed incrementally, and added mapSideCombine to combineByKeyAndWindow.	2013-12-26 02:58:29 -08:00
Tathagata Das	bacc65cf28	Removed slack time in file stream and added better handling of exceptions due to failures due FileNotFound exceptions.	2013-12-26 10:18:46 +00:00
Tathagata Das	d4dfab503a	Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.	2013-12-24 14:01:13 -08:00
Prashant Sharma	2573add94c	spark-544, introducing SparkConf and related configuration overhaul.	2013-12-25 00:09:36 +05:30
Tathagata Das	e9165d2a39	Merge branch 'scheduler-update' into window-improvement	2013-12-23 17:49:41 -08:00
Tathagata Das	0af7f84c8e	Minor formatting fixes.	2013-12-23 17:47:16 -08:00
Tathagata Das	8ca14a1e51	Updated testsuites to work with the slack time of file stream.	2013-12-23 16:27:00 -08:00
Tathagata Das	b31e91f927	Merge branch 'scheduler-update' into filestream-fix	2013-12-23 15:59:15 -08:00
Tathagata Das	6eaa050549	Minor change for PR 277.	2013-12-23 15:55:45 -08:00
Tathagata Das	19d1d58b67	Fixed bug in file stream that prevented some files from being read correctly.	2013-12-23 23:48:43 +00:00
Tathagata Das	f9771690a6	Minor formatting fixes.	2013-12-23 11:32:26 -08:00
Tathagata Das	dc3ee6b612	Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277.	2013-12-23 11:30:42 -08:00
Tathagata Das	e7b62cbfbf	Updated CheckpointWriter and FileInputDStream to be robust against failed FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.	2013-12-22 18:49:36 -08:00
Tathagata Das	d91ec6f8ea	Merge branch 'scheduler-update' into filestream-fix	2013-12-22 15:23:35 -08:00
Tathagata Das	3ddbdbfbc7	Minor updated based on comments on PR 277.	2013-12-20 19:51:37 -08:00
Tathagata Das	de41c436a0	Merge branch 'scheduler-update' into window-improvement Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala	2013-12-19 12:05:08 -08:00
Tathagata Das	984c582487	Merge branch 'scheduler-update' into filestream-fix Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala	2013-12-19 11:20:48 -08:00
Tathagata Das	ec71b445ad	Minor changes.	2013-12-18 23:39:28 -08:00
Tathagata Das	e93b391d75	Merge branch 'apache-master' into scheduler-update Conflicts: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala	2013-12-18 17:51:14 -08:00
Tathagata Das	b80ec05635	Added StatsReportListener to generate processing time statistics across multiple batches.	2013-12-18 15:35:24 -08:00
Mark Hamstra	09ed7ddfa0	Use scala.binary.version in POMs	2013-12-15 12:39:58 -08:00
Tathagata Das	097e120c0c	Refactored streaming scheduler and added listener interface. - Refactored Scheduler + JobManager to JobGenerator + JobScheduler and added JobSet for cleaner code. Moved scheduler related code to streaming.scheduler package. - Added StreamingListener trait (similar to SparkListener) to enable gathering to streaming stats like processing times and delays. StreamingContext.addListener() to added listeners. - Deduped some code in streaming tests by modifying TestSuiteBase, and added StreamingListenerSuite.	2013-12-12 20:48:02 -08:00
Tathagata Das	5e9ce83d68	Fixed multiple file stream and checkpointing bugs. - Made file stream more robust to transient failures. - Changed Spark.setCheckpointDir API to not have the second 'useExisting' parameter. Spark will always create a unique directory for checkpointing underneath the directory provide to the funtion. - Fixed bug wrt local relative paths as checkpoint directory. - Made DStream and RDD checkpointing use SparkContext.hadoopConfiguration, so that more HDFS compatible filesystems are supported for checkpointing.	2013-12-11 14:01:36 -08:00
Prashant Sharma	603af51bb5	Merge branch 'master' into akka-bug-fix Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala	2013-12-11 10:21:53 +05:30
Prashant Sharma	17db6a9041	Style fixes and addressed review comments at #221	2013-12-10 11:47:16 +05:30
Prashant Sharma	7ad6921ae0	Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster.	2013-12-07 12:45:57 +05:30
Raymond Liu	4738818dd6	Fix pom.xml for maven build	2013-12-03 16:36:05 +08:00
Tathagata Das	03ef6e8899	Added flag in window operation to use partition awaare union.	2013-11-21 11:38:56 -08:00
Tathagata Das	fd031679df	Added partitioner aware union, modified DStream.window.	2013-11-21 11:28:37 -08:00
Tathagata Das	2ec4b2e38d	Added partition aware union to improve reduceByKeyAndWindow	2013-11-20 23:49:30 -08:00
Prashant Sharma	95d8dbce91	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10-temp Conflicts: core/src/main/scala/org/apache/spark/util/collection/PrimitiveVector.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala	2013-11-21 12:34:46 +05:30
Prashant Sharma	199e9cf02d	Merge branch 'scala210-master' of github.com:colorant/incubator-spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/deploy/client/Client.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala	2013-11-21 11:55:48 +05:30
Henry Saputra	10be58f251	Another set of changes to remove unnecessary semicolon (;) from Scala code. Passed the sbt/sbt compile and test	2013-11-19 16:56:23 -08:00
Henry Saputra	9c934b640f	Remove the semicolons at the end of Scala code to make it more pure Scala code. Also remove unused imports as I found them along the way. Remove return statements when returning value in the Scala code. Passing compile and tests.	2013-11-19 10:19:03 -08:00
Aaron Davidson	f629ba95b6	Various merge corrections I've diff'd this patch against my own -- since they were both created independently, this means that two sets of eyes have gone over all the merge conflicts that were created, so I'm feeling significantly more confident in the resulting PR. @rxin has looked at the changes to the repl and is resoundingly confident that they are correct.	2013-11-14 22:13:09 -08:00
Raymond Liu	a60620b76a	Merge branch 'master' into scala-2.10	2013-11-14 12:44:19 +08:00
Raymond Liu	0f2e3c6e31	Merge branch 'master' into scala-2.10	2013-11-13 16:55:11 +08:00
Tathagata Das	7ccbbdacb9	Made block generator thread safe to fix Kafka bug.	2013-11-12 00:10:45 -08:00
Prashant Sharma	6860b79f6e	Remove deprecated actorFor and use actorSelection everywhere.	2013-11-12 12:43:53 +05:30
Tathagata Das	dc9570782a	Merge branch 'apache-master' into transform	2013-10-25 14:22:23 -07:00
Patrick Wendell	af4a529f6e	Exclude jopt from kafka dependency. Kafka uses an older version of jopt that causes bad conflicts with the version used by spark-perf. It's not easy to remove this downstream because of the way that spark-perf uses Spark (by including a spark assembly as an unmanaged jar). This fixes the problem at its source by just never including it.	2013-10-25 09:20:30 -07:00
Patrick Wendell	ad5f579cbf	Style fixes	2013-10-24 22:18:53 -07:00
Patrick Wendell	e5f6d5697b	Spacing fix	2013-10-24 22:08:06 -07:00
Patrick Wendell	a351fd4aed	Small spacing fix	2013-10-24 21:16:30 -07:00
Patrick Wendell	31e92b72e3	Adding Java versions and associated tests	2013-10-24 21:14:56 -07:00
Patrick Wendell	39f6f75588	Some clean-up of tests	2013-10-24 16:43:33 -07:00
Tathagata Das	e962a6e6ee	Fixed accidental bug.	2013-10-24 15:17:26 -07:00
Patrick Wendell	9423532fab	Removing Java for now	2013-10-24 14:31:34 -07:00
Patrick Wendell	05ac9940ee	Adding tests	2013-10-24 14:31:34 -07:00
Patrick Wendell	08c1a42d7d	Add a `repartition` operator. This patch adds an operator called repartition with more straightforward semantics than the current `coalesce` operator. There are a few use cases where this operator is useful: 1. If a user wants to increase the number of partitions in the RDD. This is more common now with streaming. E.g. a user is ingesting data on one node but they want to add more partitions to ensure parallelism of subsequent operations across threads or the cluster. Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's super confusing. 2. If a user has input data where the number of partitions is not known. E.g. > sc.textFile("some file").coalesce(50).... This is both vague semantically (am I growing or shrinking this RDD) but also, may not work correctly if the base RDD has fewer than 50 partitions. The new operator forces shuffles every time, so it will always produce exactly the number of new partitions. It also throws an exception rather than silently not-working if a bad input is passed. I am currently adding streaming tests (requires refactoring some of the test suite to allow testing at partition granularity), so this is not ready for merge yet. But feedback is welcome.	2013-10-24 14:31:33 -07:00
Tathagata Das	0400aba1c0	Merge branch 'apache-master' into transform	2013-10-24 11:05:00 -07:00
Tathagata Das	bacfe5ebca	Added JavaStreamingContext.transform	2013-10-24 10:56:24 -07:00
Matei Zaharia	dd659642e7	Merge pull request #64 from prabeesh/master MQTT Adapter for Spark Streaming MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. You may read more about it here http://mqtt.org/ Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications. It enables the transfer of telemetry-style data in the form of messages from devices like sensors and actuators, to mobile phones, embedded systems on vehicles, or laptops and full scale computers. The protocol was invented by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions This protocol enables a publish/subscribe messaging model in an extremely lightweight way. It is useful for connections with remote locations where line of code and network bandwidth is a constraint. MQTT is one of the widely used protocol for 'Internet of Things'. This protocol is getting much attraction as anything and everything is getting connected to internet and they all produce data. Researchers and companies predict some 25 billion devices will be connected to the internet by 2015. Plugin/Support for MQTT is available in popular MQs like RabbitMQ, ActiveMQ etc. Support for MQTT in Spark will help people with Internet of Things (IoT) projects to use Spark Streaming for their real time data processing needs (from sensors and other embedded devices etc).	2013-10-23 15:07:59 -07:00
Tathagata Das	fe8626efd1	Merge branch 'apache-master' into transform	2013-10-22 23:40:40 -07:00
Tathagata Das	72d2e1dd77	Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs.	2013-10-22 23:35:51 -07:00
Matei Zaharia	731c94e91d	Merge pull request #56 from jerryshao/kafka-0.8-dev Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming Conflicts: streaming/pom.xml	2013-10-21 23:31:38 -07:00
Tathagata Das	0666498799	Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs.	2013-10-21 05:34:09 -07:00
Reynold Xin	4e44d65b5e	Exclusion rules for Maven build files.	2013-10-19 12:35:55 -07:00
Prabeesh K	d223d38933	Update MQTTInputDStream.scala	2013-10-18 09:09:49 +05:30
prabeesh	890f8fe439	modify code, use Spark Logging Class	2013-10-17 10:00:40 +05:30
prabeesh	9a7575728d	add maven dependencies for mqtt	2013-10-16 13:41:49 +05:30
prabeesh	2e48b23eae	added mqtt adapter	2013-10-16 13:36:25 +05:30
prabeesh	742ada91e0	mqttinputdstream for mqttstreaming adapter	2013-10-16 13:35:29 +05:30
Matei Zaharia	b5346064d6	Merge pull request #8 from vchekan/checkpoint-ttl-restore Serialize and restore spark.cleaner.ttl to savepoint In accordance to conversation in spark-dev maillist, preserve spark.cleaner.ttl parameter when serializing checkpoint.	2013-10-15 21:25:03 -07:00
Aaron Davidson	a395911138	Refactor BlockId into an actual type This is an unfortunately invasive change which converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now: + Type safety + Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types. + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported. (I'm looking at you, shuffle file consolidation.) + It will only get harder to make this change as time goes on. Since this touches a lot of files, it'd be best to either get this patch in quickly or throw it on the ground to avoid too many secondary merge conflicts.	2013-10-12 22:44:57 -07:00
jerryshao	c23cd72b4b	Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming	2013-10-12 20:00:42 +08:00
Prashant Sharma	26860639c5	Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala project/SparkBuild.scala	2013-10-10 09:42:23 +05:30
Prashant Sharma	7be75682b9	Merge branch 'master' into wip-merge-master Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.	2013-10-08 11:29:40 +05:30
Patrick Wendell	aa9fb84994	Merging build changes in from 0.8	2013-10-05 22:07:00 -07:00
Martin Weindel	e09f4a9601	fixed some warnings	2013-10-05 23:08:23 +02:00
Prashant Sharma	5829692885	Merge branch 'master' into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2013-10-01 11:57:24 +05:30
Prashant Sharma	7ff4c2d399	fixed maven build for scala 2.10	2013-09-26 10:48:24 +05:30
Patrick Wendell	6079721fa1	Update build version in master	2013-09-24 11:41:51 -07:00
Prashant Sharma	276c37a51c	Akka 2.2 migration	2013-09-22 08:20:12 +05:30
Vadim Chekan	fbe40c5806	Serialize and restore spark.cleaner.ttl to savepoint	2013-09-20 12:13:48 -07:00
Prashant Sharma	6fcfefcb27	Few more fixes to tests broken during merge	2013-09-10 10:57:47 +05:30
Prashant Sharma	4106ae9fbf	Merged with master	2013-09-06 17:53:01 +05:30
Matei Zaharia	0a8cc30921	Move some classes to more appropriate packages: * RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer	2013-09-01 14:13:16 -07:00
Matei Zaharia	5701eb92c7	Fix some URLs	2013-09-01 14:13:16 -07:00
Matei Zaharia	46eecd110a	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
Matei Zaharia	5a6ac12840	Merge pull request #701 from ScrapCodes/documentation-suggestions Documentation suggestions for spark streaming.	2013-08-22 22:08:03 -07:00
Prashant Sharma	2bc348e92c	Linking custom receiver guide	2013-08-23 09:44:02 +05:30
Prashant Sharma	3049415e24	Corrections in documentation comment	2013-08-23 09:40:28 +05:30
Jey Kottalam	23f4622aff	Remove redundant dependencies from POMs	2013-08-18 18:53:57 -07:00
Jey Kottalam	ad580b94d5	Maven build now also works with YARN	2013-08-16 13:50:12 -07:00
Jey Kottalam	11b42a84db	Maven build now works with CDH hadoop-2.0.0-mr1	2013-08-16 13:50:12 -07:00
Jey Kottalam	353fab2440	Initial changes to make Maven build agnostic of hadoop version	2013-08-16 13:50:12 -07:00
Josh Rosen	d7f78b443b	Change scala.Option to Guava Optional in Java APIs.	2013-08-11 12:05:09 -07:00
Reynold Xin	c61843a69f	Changed other LZF uses to use the compression codec interface.	2013-07-31 10:32:13 -07:00
Matei Zaharia	af3c9d5042	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
Prashant Sharma	119c98c1be	code formatting, The warning related to scope exit and enter is not worth fixing as it only affects debugging scopes and nothing else.	2013-07-16 15:01:33 +05:30
Prashant Sharma	55da6e9504	Fixed warning erasure -> runtimeClass	2013-07-16 14:37:08 +05:30
Prashant Sharma	ff14f38f3d	Fixed warning Throwables	2013-07-16 14:34:56 +05:30
Prashant Sharma	63addd93a8	Fixed warning ClassManifest -> ClassTag	2013-07-16 14:09:52 +05:30
Prashant Sharma	e86d5dbaad	Merge branch 'master' into master-merge Conflicts: README.md core/pom.xml core/src/main/scala/spark/deploy/JsonProtocol.scala core/src/main/scala/spark/deploy/LocalSparkCluster.scala core/src/main/scala/spark/deploy/master/Master.scala core/src/main/scala/spark/deploy/master/MasterWebUI.scala core/src/main/scala/spark/deploy/worker/Worker.scala core/src/main/scala/spark/deploy/worker/WorkerWebUI.scala core/src/main/scala/spark/storage/BlockManagerUI.scala core/src/main/scala/spark/util/AkkaUtils.scala pom.xml project/SparkBuild.scala streaming/src/main/scala/spark/streaming/receivers/ActorReceiver.scala	2013-07-12 14:49:16 +05:30
Matei Zaharia	7dcda9ae74	Merge pull request #688 from markhamstra/scalaDependencies Fixed SPARK-795 with explicit dependencies	2013-07-08 23:24:23 -07:00
Mark Hamstra	0b39d66f3f	pom cleanup	2013-07-08 16:07:09 -07:00
Mark Hamstra	afdaf430bd	Explicit dependencies for scala-library and scalap to prevent 2.9.2 vs. 2.9.3 problems	2013-07-08 15:40:50 -07:00
Shivaram Venkataraman	3350ad0d7f	Catch RejectedExecution exception in Checkpoint handler.	2013-07-07 04:09:37 -07:00
Matei Zaharia	1ffadb2d9e	Merge remote-tracking branch 'pwendell/ui-updates' Conflicts: core/src/main/scala/spark/scheduler/DAGScheduler.scala core/src/main/scala/spark/util/AkkaUtils.scala pom.xml	2013-07-06 15:51:41 -07:00
Matei Zaharia	94871e4703	Merge pull request #655 from tgravescs/master Add support for running Spark on Yarn on a secure Hadoop Cluster	2013-07-06 15:26:19 -07:00
Tathagata Das	280418ac45	Reduced the number of Iterator to ArrayBuffer copies in NetworkReceiver.	2013-07-05 21:38:21 -07:00
Prashant Sharma	a5f1f6a907	Merge branch 'master' into master-merge Conflicts: core/pom.xml core/src/main/scala/spark/MapOutputTracker.scala core/src/main/scala/spark/RDD.scala core/src/main/scala/spark/RDDCheckpointData.scala core/src/main/scala/spark/SparkContext.scala core/src/main/scala/spark/Utils.scala core/src/main/scala/spark/api/python/PythonRDD.scala core/src/main/scala/spark/deploy/client/Client.scala core/src/main/scala/spark/deploy/master/MasterWebUI.scala core/src/main/scala/spark/deploy/worker/Worker.scala core/src/main/scala/spark/deploy/worker/WorkerWebUI.scala core/src/main/scala/spark/rdd/BlockRDD.scala core/src/main/scala/spark/rdd/ZippedRDD.scala core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala core/src/main/scala/spark/storage/BlockManager.scala core/src/main/scala/spark/storage/BlockManagerMaster.scala core/src/main/scala/spark/storage/BlockManagerMasterActor.scala core/src/main/scala/spark/storage/BlockManagerUI.scala core/src/main/scala/spark/util/AkkaUtils.scala core/src/test/scala/spark/SizeEstimatorSuite.scala pom.xml project/SparkBuild.scala repl/src/main/scala/spark/repl/SparkILoop.scala repl/src/test/scala/spark/repl/ReplSuite.scala streaming/src/main/scala/spark/streaming/StreamingContext.scala streaming/src/main/scala/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/spark/streaming/dstream/KafkaInputDStream.scala streaming/src/main/scala/spark/streaming/util/MasterFailureTest.scala	2013-07-03 11:43:26 +05:30
$Y.CORP.YAHOO.COM\tgraves$ Y.CORP.YAHOO.COM\tgraves	923cf92900	Rework from pull request. Removed --user option from Spark on Yarn Client, made the user of JAVA_HOME environment variable conditional on if its set, and created addCredentials in each of the SparkHadoopUtil classes to only add the credentials when the profile is hadoop2-yarn.	2013-07-02 21:18:59 -05:00
Matei Zaharia	4358acfe07	Initialize Twitter4J OAuth from system properties instead of prompting	2013-06-29 15:25:06 -07:00
Matei Zaharia	1667158544	Merge remote-tracking branch 'mrpotes/master'	2013-06-29 14:36:09 -07:00
Patrick Wendell	362d996c81	Handful of changes based on matei's review - Avoid exception when no tasks have finished for a stage - Adding DOCTYPE so css renders properly - Adding progress slider	2013-06-27 19:14:28 -07:00
James Phillpotts	366572edca	Include a default OAuth implementation, and update examples and JavaStreamingContext	2013-06-25 22:59:34 +01:00
Tathagata Das	c89af0a7f9	Merge branch 'master' into streaming Conflicts: .gitignore	2013-06-24 23:57:47 -07:00
Tathagata Das	48c7e373c6	Minor formatting fixes	2013-06-24 23:11:04 -07:00
Tathagata Das	1249e9153b	Merge pull request #572 from Reinvigorate/sm-block-interval Adding spark.streaming.blockInterval property	2013-06-24 21:46:33 -07:00
Tathagata Das	cfcda95f86	Merge pull request #571 from Reinvigorate/sm-kafka-serializers Surfacing decoders on KafkaInputDStream	2013-06-24 21:44:50 -07:00
James Phillpotts	8955787a59	Twitter API v1 is retired - username/password auth no longer possible	2013-06-24 09:15:17 +01:00
James Phillpotts	93a1643405	Allow other twitter authorizations than username/password	2013-06-21 14:21:52 +01:00
Thomas Graves	75d78c7ac9	Add support for Spark on Yarn on a secure Hadoop cluster	2013-06-19 11:18:42 -05:00
Jey Kottalam	e7982c798e	Exclude old versions of Netty from Maven-based build	2013-05-18 21:24:58 -07:00
seanm	f25282def5	fixing kafkaStream Java API and adding test	2013-05-10 17:34:28 -06:00
seanm	3632980b1b	fixing indentation	2013-05-10 15:54:26 -06:00
seanm	b95c1bdbba	count() now uses a transform instead of ConstantInputDStream	2013-05-10 12:47:24 -06:00
seanm	d761e7359d	adding kafkaStream API tests	2013-05-10 12:05:10 -06:00
Reynold Xin	90577ada69	Merge branch 'shuffle-performance-fix-0.7' of github.com:shane-huang/spark into shufflemerge Conflicts: core/src/main/scala/spark/storage/BlockManager.scala core/src/main/scala/spark/storage/DiskStore.scala project/SparkBuild.scala	2013-05-07 15:56:19 -07:00
Prashant Sharma	4041a2689e	Updated to latest stable scala 2.10.1 and akka 2.1.2	2013-05-01 11:35:35 +05:30
Prashant Sharma	24bbf318b3	Fixied other warnings	2013-04-29 19:56:28 +05:30
Prashant Sharma	d3518f57cd	Fixed warning: erasure -> runtimeClass	2013-04-29 18:14:25 +05:30
Prashant Sharma	8f3ac240cb	Fixed Warning: ClassManifest -> ClassTag	2013-04-29 16:39:13 +05:30
Prashant Sharma	4b4a36ea7d	Fixed pom.xml with updated dependencies.	2013-04-29 12:55:43 +05:30
Mridul Muralidharan	430c531464	Remove debug statements	2013-04-29 00:24:30 +05:30
Mridul Muralidharan	3a89a76b87	Make log message more descriptive to aid in debugging	2013-04-29 00:04:12 +05:30
Mridul Muralidharan	7fa6978a1e	Allow CheckpointWriter pending tasks to finish	2013-04-28 23:08:10 +05:30
Mridul Muralidharan	afee902443	Attempt to fix streaming test failures after yarn branch merge	2013-04-28 22:26:45 +05:30
Prashant Sharma	bb4102b0ee	Fixed breaking tests in streaming checkpoint suite. Changed RichInt to Int as it is final and not serializable	2013-04-25 14:38:01 +05:30
Prashant Sharma	ad88f083a6	scala 2.10 and master merge	2013-04-24 18:08:26 +05:30
Mridul Muralidharan	dd515ca3ee	Attempt at fixing merge conflict	2013-04-24 09:24:17 +05:30
seanm	7e56e99573	Surfacing decoders on KafkaInputDStream	2013-04-16 17:17:16 -06:00
seanm	ab0f834dbb	adding spark.streaming.blockInterval property	2013-04-16 11:57:05 -06:00
seanm	b42d68c8ce	fixing Spark Streaming count() so that 0 will be emitted when there is nothing to count	2013-04-15 12:54:55 -06:00
Matei Zaharia	65caa8f711	Merge remote-tracking branch 'jey/bump-development-version-to-0.8.0' Conflicts: docs/_config.yml project/SparkBuild.scala	2013-04-08 12:43:17 -04:00
Mridul Muralidharan	6798a09df8	Add support for building against hadoop2-yarn : adding new maven profile for it	2013-04-07 17:47:38 +05:30

... 2 3 4 5 6 ...

633 commits