Patrick Wendell
d37408f39c
Merge pull request #377 from andrewor14/master
...
External Sorting for Aggregator and CoGroupedRDDs (Revisited)
(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303 , which was closed because Jenkins / github was misbehaving)
The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.
The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.
Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
2014-01-10 16:25:01 -08:00
Tathagata Das
4f39e79c23
Merge remote-tracking branch 'apache/master' into driver-test
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
2014-01-10 15:47:01 -08:00
Tathagata Das
82f07deeda
Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.
2014-01-10 15:37:05 -08:00
Tathagata Das
e4bb845238
Updated docs based on Patrick's comments in PR 383.
2014-01-10 12:17:09 -08:00
Tathagata Das
2213a5a47f
Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test
2014-01-10 05:06:22 -08:00
Tathagata Das
740730a179
Fixed conf/slaves and updated docs.
2014-01-10 05:06:15 -08:00
Tathagata Das
4f609f7901
Removed spark.hostPort and other setting from SparkConf before saving to checkpoint.
2014-01-10 12:58:07 +00:00
Tathagata Das
d7ec73ac76
Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test
2014-01-10 11:44:17 +00:00
Tathagata Das
9d3d9c8251
Refactored graph checkpoint file reading and writing code to make it cleaner and easily debuggable.
2014-01-10 11:44:02 +00:00
Patrick Wendell
997c830e0b
Merge pull request #363 from pwendell/streaming-logs
...
Set default logging to WARN for Spark streaming examples.
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
2014-01-09 22:22:20 -08:00
Andrew Or
372a533a6c
Fix wonky imports from merge
2014-01-09 21:47:49 -08:00
Andrew Or
d76e1f90a8
Merge github.com:apache/incubator-spark
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkEnv.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
2014-01-09 21:38:48 -08:00
Tathagata Das
38d75e18fa
Merge remote-tracking branch 'apache/master' into driver-test
2014-01-09 19:31:36 -08:00
Tathagata Das
4a5558ca99
Fixed bugs in reading of checkpoints.
2014-01-10 03:28:39 +00:00
Tathagata Das
f1d206c6b4
Merge branch 'standalone-driver' into driver-test
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala
examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-09 15:06:24 -08:00
Tathagata Das
6f713e2a3e
Changed the way StreamingContext finds and reads checkpoint files, and added JavaStreamingContext.getOrCreate.
2014-01-09 13:42:04 -08:00
Patrick Wendell
35f80da21a
Set default logging to WARN for Spark streaming examples.
...
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
2014-01-09 10:42:58 -08:00
Matei Zaharia
a01f3401e3
Use typed getters for configuration settings
2014-01-09 00:07:29 -08:00
Tathagata Das
a17cc602ac
More bug fixes.
2014-01-08 04:12:05 -08:00
Tathagata Das
0b7a132d03
Modified checkpoing file clearing policy.
2014-01-08 03:22:06 -08:00
Tathagata Das
3b4c4c7f4d
Merge remote-tracking branch 'apache/master' into project-refactor
...
Conflicts:
examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
2014-01-06 03:05:52 -08:00
Tathagata Das
ac1f4b06c1
Added a hashmap to cache file mod times.
2014-01-05 23:42:53 -08:00
Tathagata Das
2394794591
Merge branch 'filestream-fix' into driver-test
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
2014-01-06 02:23:53 +00:00
Tathagata Das
8e88db3ca5
Bug fixes to the DriverRunner and minor changes here and there.
2014-01-06 02:21:56 +00:00
Patrick Wendell
79f52809c8
Removing SPARK_EXAMPLES_JAR in the code
2014-01-05 11:49:42 -08:00
Andrew Or
df413e996f
Merge remote-tracking branch 'spark/master'
...
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
2014-01-02 20:51:23 -08:00
Tathagata Das
a1b8dd53e3
Added StreamingContext.getOrCreate to for automatic recovery, and added RecoverableNetworkWordCount example to use it.
2014-01-02 19:07:22 -08:00
Patrick Wendell
588a1695f4
Merge pull request #297 from tdas/window-improvement
...
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite
- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.
2014-01-02 13:20:54 -08:00
Matei Zaharia
e2c68642c6
Miscellaneous fixes from code review.
...
Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.
2014-01-01 22:03:39 -05:00
Matei Zaharia
45ff8f413d
Merge remote-tracking branch 'apache/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
2014-01-01 21:25:00 -05:00
Patrick Wendell
f8d245bdfc
Merge remote-tracking branch 'apache-github/master' into log4j-fix-2
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2014-01-01 16:10:51 -08:00
Matei Zaharia
42bcfb2bb2
Fix two compile errors introduced in merge
2013-12-31 18:26:23 -05:00
Matei Zaharia
ba9338f104
Merge remote-tracking branch 'apache/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2013-12-31 18:23:14 -05:00
Tathagata Das
fcd17a1e8e
Fixed comments and long lines based on comments on PR 289.
2013-12-31 02:01:45 -08:00
Tathagata Das
87b915f221
Removed extra empty lines.
2013-12-31 00:42:10 -08:00
Tathagata Das
3ab297adaa
Removed unnecessary comments.
2013-12-31 00:38:19 -08:00
Tathagata Das
97630849ff
Added pom.xml for external projects and removed unnecessary dependencies and repositoris from other poms and sbt.
2013-12-31 00:28:57 -08:00
Patrick Wendell
18181e6c41
Removing initLogging entirely
2013-12-30 23:39:47 -08:00
Tathagata Das
f4e4066191
Refactored kafka, flume, zeromq, mqtt as separate external projects, with their own self-contained scala API, java API, scala unit tests and java unit tests. Updated examples to use the external projects.
2013-12-30 11:13:24 -08:00
Andrew Or
8fbff9f5d0
Address Aaron's comments
2013-12-29 16:22:44 -08:00
Matei Zaharia
0bd1900cbc
Fix a few settings that were being read as system properties after merge
2013-12-29 15:38:46 -05:00
Matei Zaharia
b4ceed40d6
Merge remote-tracking branch 'origin/master' into conf2
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
2013-12-29 15:08:08 -05:00
Matei Zaharia
20631348d1
Fix other failing tests
2013-12-28 23:17:58 -05:00
Matei Zaharia
0900d5c72a
Add a StreamingContext constructor that takes a conf object
2013-12-28 21:38:07 -05:00
Matei Zaharia
a8f316386a
Fix CheckpointSuite test failures
2013-12-28 21:26:43 -05:00
Matei Zaharia
578bd1fc28
Fix test failures due to setting / clearing clock type in Streaming
2013-12-28 21:21:06 -05:00
Matei Zaharia
642029e7f4
Various fixes to configuration code
...
- Got rid of global SparkContext.globalConf
- Pass SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing
This still fails several tests in core, repl and streaming, likely due
to properties not being set or cleared correctly (some of the tests run
fine in isolation).
2013-12-28 17:13:15 -05:00
Tathagata Das
271e3237f3
Minor changes in comments and strings to address comments in PR 289.
2013-12-27 12:26:57 -08:00
Andrew Or
a515706d9c
Fix streaming JavaAPISuite again
2013-12-26 23:40:07 -08:00
Aaron Davidson
1ffe26c7c0
Fix streaming JavaAPISuite that depended on order
2013-12-26 23:40:07 -08:00
Tathagata Das
6e43039614
Refactored streaming project to separate out the twitter functionality.
2013-12-26 18:02:49 -08:00
Tathagata Das
577c8cc834
Removed unncessary options from WindowedDStream.
2013-12-26 14:17:16 -08:00
Tathagata Das
3618d70b2a
Added warning if filestream adds files with no data in them (file RDDs have 0 partitions).
2013-12-26 12:45:40 -08:00
Tathagata Das
be64719138
Changed file stream to not catch any exceptions related to finding new files (FileNotFound exception is still caught and ignored).
2013-12-26 12:33:12 -08:00
Tathagata Das
069cb14bdc
Updated groupByKeyAndWindow to be computed incrementally, and added mapSideCombine to combineByKeyAndWindow.
2013-12-26 02:58:29 -08:00
Tathagata Das
bacc65cf28
Removed slack time in file stream and added better handling of exceptions due to failures due FileNotFound exceptions.
2013-12-26 10:18:46 +00:00
Tathagata Das
d4dfab503a
Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.
2013-12-24 14:01:13 -08:00
Prashant Sharma
2573add94c
spark-544, introducing SparkConf and related configuration overhaul.
2013-12-25 00:09:36 +05:30
Tathagata Das
e9165d2a39
Merge branch 'scheduler-update' into window-improvement
2013-12-23 17:49:41 -08:00
Tathagata Das
0af7f84c8e
Minor formatting fixes.
2013-12-23 17:47:16 -08:00
Tathagata Das
8ca14a1e51
Updated testsuites to work with the slack time of file stream.
2013-12-23 16:27:00 -08:00
Tathagata Das
b31e91f927
Merge branch 'scheduler-update' into filestream-fix
2013-12-23 15:59:15 -08:00
Tathagata Das
6eaa050549
Minor change for PR 277.
2013-12-23 15:55:45 -08:00
Tathagata Das
19d1d58b67
Fixed bug in file stream that prevented some files from being read
...
correctly.
2013-12-23 23:48:43 +00:00
Tathagata Das
f9771690a6
Minor formatting fixes.
2013-12-23 11:32:26 -08:00
Tathagata Das
dc3ee6b612
Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277.
2013-12-23 11:30:42 -08:00
Tathagata Das
e7b62cbfbf
Updated CheckpointWriter and FileInputDStream to be robust against failed FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
2013-12-22 18:49:36 -08:00
Tathagata Das
d91ec6f8ea
Merge branch 'scheduler-update' into filestream-fix
2013-12-22 15:23:35 -08:00
Tathagata Das
3ddbdbfbc7
Minor updated based on comments on PR 277.
2013-12-20 19:51:37 -08:00
Tathagata Das
de41c436a0
Merge branch 'scheduler-update' into window-improvement
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
2013-12-19 12:05:08 -08:00
Tathagata Das
984c582487
Merge branch 'scheduler-update' into filestream-fix
...
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
2013-12-19 11:20:48 -08:00
Tathagata Das
ec71b445ad
Minor changes.
2013-12-18 23:39:28 -08:00
Tathagata Das
e93b391d75
Merge branch 'apache-master' into scheduler-update
...
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala
2013-12-18 17:51:14 -08:00
Tathagata Das
b80ec05635
Added StatsReportListener to generate processing time statistics across multiple batches.
2013-12-18 15:35:24 -08:00
Mark Hamstra
09ed7ddfa0
Use scala.binary.version in POMs
2013-12-15 12:39:58 -08:00
Tathagata Das
097e120c0c
Refactored streaming scheduler and added listener interface.
...
- Refactored Scheduler + JobManager to JobGenerator + JobScheduler and
added JobSet for cleaner code. Moved scheduler related code to
streaming.scheduler package.
- Added StreamingListener trait (similar to SparkListener) to enable
gathering to streaming stats like processing times and delays.
StreamingContext.addListener() to added listeners.
- Deduped some code in streaming tests by modifying TestSuiteBase, and
added StreamingListenerSuite.
2013-12-12 20:48:02 -08:00
Tathagata Das
5e9ce83d68
Fixed multiple file stream and checkpointing bugs.
...
- Made file stream more robust to transient failures.
- Changed Spark.setCheckpointDir API to not have the second
'useExisting' parameter. Spark will always create a unique directory
for checkpointing underneath the directory provide to the funtion.
- Fixed bug wrt local relative paths as checkpoint directory.
- Made DStream and RDD checkpointing use
SparkContext.hadoopConfiguration, so that more HDFS compatible
filesystems are supported for checkpointing.
2013-12-11 14:01:36 -08:00
Prashant Sharma
603af51bb5
Merge branch 'master' into akka-bug-fix
...
Conflicts:
core/pom.xml
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
pom.xml
project/SparkBuild.scala
streaming/pom.xml
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Prashant Sharma
17db6a9041
Style fixes and addressed review comments at #221
2013-12-10 11:47:16 +05:30
Prashant Sharma
7ad6921ae0
Incorporated Patrick's feedback comment on #211 and made maven build/dep-resolution atleast a bit faster.
2013-12-07 12:45:57 +05:30
Raymond Liu
4738818dd6
Fix pom.xml for maven build
2013-12-03 16:36:05 +08:00
Tathagata Das
03ef6e8899
Added flag in window operation to use partition awaare union.
2013-11-21 11:38:56 -08:00
Tathagata Das
fd031679df
Added partitioner aware union, modified DStream.window.
2013-11-21 11:28:37 -08:00
Tathagata Das
2ec4b2e38d
Added partition aware union to improve reduceByKeyAndWindow
2013-11-20 23:49:30 -08:00
Prashant Sharma
95d8dbce91
Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10-temp
...
Conflicts:
core/src/main/scala/org/apache/spark/util/collection/PrimitiveVector.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
2013-11-21 12:34:46 +05:30
Prashant Sharma
199e9cf02d
Merge branch 'scala210-master' of github.com:colorant/incubator-spark into scala-2.10
...
Conflicts:
core/src/main/scala/org/apache/spark/deploy/client/Client.scala
core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
2013-11-21 11:55:48 +05:30
Henry Saputra
10be58f251
Another set of changes to remove unnecessary semicolon (;) from Scala code.
...
Passed the sbt/sbt compile and test
2013-11-19 16:56:23 -08:00
Henry Saputra
9c934b640f
Remove the semicolons at the end of Scala code to make it more pure Scala code.
...
Also remove unused imports as I found them along the way.
Remove return statements when returning value in the Scala code.
Passing compile and tests.
2013-11-19 10:19:03 -08:00
Aaron Davidson
f629ba95b6
Various merge corrections
...
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.
@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
Raymond Liu
a60620b76a
Merge branch 'master' into scala-2.10
2013-11-14 12:44:19 +08:00
Raymond Liu
0f2e3c6e31
Merge branch 'master' into scala-2.10
2013-11-13 16:55:11 +08:00
Tathagata Das
7ccbbdacb9
Made block generator thread safe to fix Kafka bug.
2013-11-12 00:10:45 -08:00
Prashant Sharma
6860b79f6e
Remove deprecated actorFor and use actorSelection everywhere.
2013-11-12 12:43:53 +05:30
Tathagata Das
dc9570782a
Merge branch 'apache-master' into transform
2013-10-25 14:22:23 -07:00
Patrick Wendell
af4a529f6e
Exclude jopt from kafka dependency.
...
Kafka uses an older version of jopt that causes bad conflicts with the version
used by spark-perf. It's not easy to remove this downstream because of the way
that spark-perf uses Spark (by including a spark assembly as an unmanaged jar).
This fixes the problem at its source by just never including it.
2013-10-25 09:20:30 -07:00
Patrick Wendell
ad5f579cbf
Style fixes
2013-10-24 22:18:53 -07:00
Patrick Wendell
e5f6d5697b
Spacing fix
2013-10-24 22:08:06 -07:00
Patrick Wendell
a351fd4aed
Small spacing fix
2013-10-24 21:16:30 -07:00
Patrick Wendell
31e92b72e3
Adding Java versions and associated tests
2013-10-24 21:14:56 -07:00
Patrick Wendell
39f6f75588
Some clean-up of tests
2013-10-24 16:43:33 -07:00