Tathagata Das
4a5558ca99
Fixed bugs in reading of checkpoints.
2014-01-10 03:28:39 +00:00
Tathagata Das
f1d206c6b4
Merge branch 'standalone-driver' into driver-test
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala
examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-09 15:06:24 -08:00
Tathagata Das
6f713e2a3e
Changed the way StreamingContext finds and reads checkpoint files, and added JavaStreamingContext.getOrCreate.
2014-01-09 13:42:04 -08:00
Tathagata Das
a17cc602ac
More bug fixes.
2014-01-08 04:12:05 -08:00
Tathagata Das
0b7a132d03
Modified checkpoing file clearing policy.
2014-01-08 03:22:06 -08:00
Tathagata Das
3b4c4c7f4d
Merge remote-tracking branch 'apache/master' into project-refactor
...
Conflicts:
examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
2014-01-06 03:05:52 -08:00
Tathagata Das
ac1f4b06c1
Added a hashmap to cache file mod times.
2014-01-05 23:42:53 -08:00
Tathagata Das
2394794591
Merge branch 'filestream-fix' into driver-test
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
2014-01-06 02:23:53 +00:00
Tathagata Das
8e88db3ca5
Bug fixes to the DriverRunner and minor changes here and there.
2014-01-06 02:21:56 +00:00
Patrick Wendell
79f52809c8
Removing SPARK_EXAMPLES_JAR in the code
2014-01-05 11:49:42 -08:00
Tathagata Das
a1b8dd53e3
Added StreamingContext.getOrCreate to for automatic recovery, and added RecoverableNetworkWordCount example to use it.
2014-01-02 19:07:22 -08:00
Patrick Wendell
588a1695f4
Merge pull request #297 from tdas/window-improvement
...
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite
- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.
2014-01-02 13:20:54 -08:00
Matei Zaharia
e2c68642c6
Miscellaneous fixes from code review.
...
Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.
2014-01-01 22:03:39 -05:00
Matei Zaharia
45ff8f413d
Merge remote-tracking branch 'apache/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
2014-01-01 21:25:00 -05:00
Patrick Wendell
f8d245bdfc
Merge remote-tracking branch 'apache-github/master' into log4j-fix-2
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-01 16:10:51 -08:00
Matei Zaharia
42bcfb2bb2
Fix two compile errors introduced in merge
2013-12-31 18:26:23 -05:00
Matei Zaharia
ba9338f104
Merge remote-tracking branch 'apache/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2013-12-31 18:23:14 -05:00
Tathagata Das
fcd17a1e8e
Fixed comments and long lines based on comments on PR 289.
2013-12-31 02:01:45 -08:00
Tathagata Das
87b915f221
Removed extra empty lines.
2013-12-31 00:42:10 -08:00
Tathagata Das
3ab297adaa
Removed unnecessary comments.
2013-12-31 00:38:19 -08:00
Patrick Wendell
18181e6c41
Removing initLogging entirely
2013-12-30 23:39:47 -08:00
Tathagata Das
f4e4066191
Refactored kafka, flume, zeromq, mqtt as separate external projects, with their own self-contained scala API, java API, scala unit tests and java unit tests. Updated examples to use the external projects.
2013-12-30 11:13:24 -08:00
Matei Zaharia
0bd1900cbc
Fix a few settings that were being read as system properties after merge
2013-12-29 15:38:46 -05:00
Matei Zaharia
b4ceed40d6
Merge remote-tracking branch 'origin/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
2013-12-29 15:08:08 -05:00
Matei Zaharia
20631348d1
Fix other failing tests
2013-12-28 23:17:58 -05:00
Matei Zaharia
0900d5c72a
Add a StreamingContext constructor that takes a conf object
2013-12-28 21:38:07 -05:00
Matei Zaharia
a8f316386a
Fix CheckpointSuite test failures
2013-12-28 21:26:43 -05:00
Matei Zaharia
578bd1fc28
Fix test failures due to setting / clearing clock type in Streaming
2013-12-28 21:21:06 -05:00
Matei Zaharia
642029e7f4
Various fixes to configuration code
...
- Got rid of global SparkContext.globalConf
- Pass SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing
This still fails several tests in core, repl and streaming, likely due
to properties not being set or cleared correctly (some of the tests run
fine in isolation).
2013-12-28 17:13:15 -05:00
Tathagata Das
271e3237f3
Minor changes in comments and strings to address comments in PR 289.
2013-12-27 12:26:57 -08:00
Tathagata Das
6e43039614
Refactored streaming project to separate out the twitter functionality.
2013-12-26 18:02:49 -08:00
Tathagata Das
577c8cc834
Removed unncessary options from WindowedDStream.
2013-12-26 14:17:16 -08:00
Tathagata Das
3618d70b2a
Added warning if filestream adds files with no data in them (file RDDs have 0 partitions).
2013-12-26 12:45:40 -08:00
Tathagata Das
be64719138
Changed file stream to not catch any exceptions related to finding new files (FileNotFound exception is still caught and ignored).
2013-12-26 12:33:12 -08:00
Tathagata Das
069cb14bdc
Updated groupByKeyAndWindow to be computed incrementally, and added mapSideCombine to combineByKeyAndWindow.
2013-12-26 02:58:29 -08:00
Tathagata Das
bacc65cf28
Removed slack time in file stream and added better handling of exceptions due to failures due FileNotFound exceptions.
2013-12-26 10:18:46 +00:00
Tathagata Das
d4dfab503a
Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.
2013-12-24 14:01:13 -08:00
Prashant Sharma
2573add94c
spark-544, introducing SparkConf and related configuration overhaul.
2013-12-25 00:09:36 +05:30
Tathagata Das
e9165d2a39
Merge branch 'scheduler-update' into window-improvement
2013-12-23 17:49:41 -08:00
Tathagata Das
0af7f84c8e
Minor formatting fixes.
2013-12-23 17:47:16 -08:00
Tathagata Das
8ca14a1e51
Updated testsuites to work with the slack time of file stream.
2013-12-23 16:27:00 -08:00
Tathagata Das
b31e91f927
Merge branch 'scheduler-update' into filestream-fix
2013-12-23 15:59:15 -08:00
Tathagata Das
6eaa050549
Minor change for PR 277.
2013-12-23 15:55:45 -08:00
Tathagata Das
19d1d58b67
Fixed bug in file stream that prevented some files from being read
...
correctly.
2013-12-23 23:48:43 +00:00
Tathagata Das
f9771690a6
Minor formatting fixes.
2013-12-23 11:32:26 -08:00
Tathagata Das
dc3ee6b612
Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277.
2013-12-23 11:30:42 -08:00
Tathagata Das
e7b62cbfbf
Updated CheckpointWriter and FileInputDStream to be robust against failed FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
2013-12-22 18:49:36 -08:00
Tathagata Das
d91ec6f8ea
Merge branch 'scheduler-update' into filestream-fix
2013-12-22 15:23:35 -08:00
Tathagata Das
3ddbdbfbc7
Minor updated based on comments on PR 277.
2013-12-20 19:51:37 -08:00
Tathagata Das
de41c436a0
Merge branch 'scheduler-update' into window-improvement
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
2013-12-19 12:05:08 -08:00