Commit graph

380 commits

Author SHA1 Message Date
Tathagata Das fcd17a1e8e Fixed comments and long lines based on comments on PR 289. 2013-12-31 02:01:45 -08:00
Tathagata Das 271e3237f3 Minor changes in comments and strings to address comments in PR 289. 2013-12-27 12:26:57 -08:00
Tathagata Das 3618d70b2a Added warning if filestream adds files with no data in them (file RDDs have 0 partitions). 2013-12-26 12:45:40 -08:00
Tathagata Das be64719138 Changed file stream to not catch any exceptions related to finding new files (FileNotFound exception is still caught and ignored). 2013-12-26 12:33:12 -08:00
Tathagata Das bacc65cf28 Removed slack time in file stream and added better handling of exceptions due to failures due FileNotFound exceptions. 2013-12-26 10:18:46 +00:00
Tathagata Das d4dfab503a Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. 2013-12-24 14:01:13 -08:00
Tathagata Das 0af7f84c8e Minor formatting fixes. 2013-12-23 17:47:16 -08:00
Tathagata Das 8ca14a1e51 Updated testsuites to work with the slack time of file stream. 2013-12-23 16:27:00 -08:00
Tathagata Das b31e91f927 Merge branch 'scheduler-update' into filestream-fix 2013-12-23 15:59:15 -08:00
Tathagata Das 6eaa050549 Minor change for PR 277. 2013-12-23 15:55:45 -08:00
Tathagata Das 19d1d58b67 Fixed bug in file stream that prevented some files from being read
correctly.
2013-12-23 23:48:43 +00:00
Tathagata Das f9771690a6 Minor formatting fixes. 2013-12-23 11:32:26 -08:00
Tathagata Das dc3ee6b612 Added comments to BatchInfo and JobSet, based on Patrick's comment on PR 277. 2013-12-23 11:30:42 -08:00
Tathagata Das e7b62cbfbf Updated CheckpointWriter and FileInputDStream to be robust against failed FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded. 2013-12-22 18:49:36 -08:00
Tathagata Das d91ec6f8ea Merge branch 'scheduler-update' into filestream-fix 2013-12-22 15:23:35 -08:00
Tathagata Das 3ddbdbfbc7 Minor updated based on comments on PR 277. 2013-12-20 19:51:37 -08:00
Tathagata Das 984c582487 Merge branch 'scheduler-update' into filestream-fix
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
	streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
2013-12-19 11:20:48 -08:00
Tathagata Das ec71b445ad Minor changes. 2013-12-18 23:39:28 -08:00
Tathagata Das e93b391d75 Merge branch 'apache-master' into scheduler-update
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala
2013-12-18 17:51:14 -08:00
Tathagata Das b80ec05635 Added StatsReportListener to generate processing time statistics across multiple batches. 2013-12-18 15:35:24 -08:00
Tathagata Das 097e120c0c Refactored streaming scheduler and added listener interface.
- Refactored Scheduler + JobManager to JobGenerator + JobScheduler and
  added JobSet for cleaner code. Moved scheduler related code to
  streaming.scheduler package.
- Added StreamingListener trait (similar to SparkListener) to enable
  gathering to streaming stats like processing times and delays.
  StreamingContext.addListener() to added listeners.
- Deduped some code in streaming tests by modifying TestSuiteBase, and
  added StreamingListenerSuite.
2013-12-12 20:48:02 -08:00
Tathagata Das 5e9ce83d68 Fixed multiple file stream and checkpointing bugs.
- Made file stream more robust to transient failures.
- Changed Spark.setCheckpointDir API to not have the second
  'useExisting' parameter. Spark will always create a unique directory
  for checkpointing underneath the directory provide to the funtion.
- Fixed bug wrt local relative paths as checkpoint directory.
- Made DStream and RDD checkpointing use
  SparkContext.hadoopConfiguration, so that more HDFS compatible
  filesystems are supported for checkpointing.
2013-12-11 14:01:36 -08:00
Prashant Sharma 17db6a9041 Style fixes and addressed review comments at #221 2013-12-10 11:47:16 +05:30
Prashant Sharma 95d8dbce91 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10-temp
Conflicts:
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveVector.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
2013-11-21 12:34:46 +05:30
Prashant Sharma 199e9cf02d Merge branch 'scala210-master' of github.com:colorant/incubator-spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/client/Client.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
	core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
2013-11-21 11:55:48 +05:30
Henry Saputra 10be58f251 Another set of changes to remove unnecessary semicolon (;) from Scala code.
Passed the sbt/sbt compile and test
2013-11-19 16:56:23 -08:00
Henry Saputra 9c934b640f Remove the semicolons at the end of Scala code to make it more pure Scala code.
Also remove unused imports as I found them along the way.
Remove return statements when returning value in the Scala code.

Passing compile and tests.
2013-11-19 10:19:03 -08:00
Aaron Davidson f629ba95b6 Various merge corrections
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.

@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
Raymond Liu a60620b76a Merge branch 'master' into scala-2.10 2013-11-14 12:44:19 +08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
Tathagata Das 7ccbbdacb9 Made block generator thread safe to fix Kafka bug. 2013-11-12 00:10:45 -08:00
Prashant Sharma 6860b79f6e Remove deprecated actorFor and use actorSelection everywhere. 2013-11-12 12:43:53 +05:30
Tathagata Das dc9570782a Merge branch 'apache-master' into transform 2013-10-25 14:22:23 -07:00
Patrick Wendell ad5f579cbf Style fixes 2013-10-24 22:18:53 -07:00
Patrick Wendell e5f6d5697b Spacing fix 2013-10-24 22:08:06 -07:00
Patrick Wendell a351fd4aed Small spacing fix 2013-10-24 21:16:30 -07:00
Patrick Wendell 31e92b72e3 Adding Java versions and associated tests 2013-10-24 21:14:56 -07:00
Patrick Wendell 39f6f75588 Some clean-up of tests 2013-10-24 16:43:33 -07:00
Tathagata Das e962a6e6ee Fixed accidental bug. 2013-10-24 15:17:26 -07:00
Patrick Wendell 9423532fab Removing Java for now 2013-10-24 14:31:34 -07:00
Patrick Wendell 05ac9940ee Adding tests 2013-10-24 14:31:34 -07:00
Patrick Wendell 08c1a42d7d Add a repartition operator.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
2013-10-24 14:31:33 -07:00
Tathagata Das 0400aba1c0 Merge branch 'apache-master' into transform 2013-10-24 11:05:00 -07:00
Tathagata Das bacfe5ebca Added JavaStreamingContext.transform 2013-10-24 10:56:24 -07:00
Matei Zaharia dd659642e7 Merge pull request #64 from prabeesh/master
MQTT Adapter for Spark Streaming

MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol.
It was designed as an extremely lightweight publish/subscribe messaging transport. You may read more about it here http://mqtt.org/

Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications. It enables the transfer of telemetry-style data in the form of messages from devices like sensors and actuators, to mobile phones, embedded systems on vehicles, or laptops and full scale computers.

The protocol was invented by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions

This protocol enables a publish/subscribe messaging model in an extremely lightweight way. It is useful for connections with remote locations where line of code and network bandwidth is a constraint.

MQTT is one of the widely used protocol for 'Internet of Things'. This protocol is getting much attraction as anything and everything is getting connected to internet and they all produce data. Researchers and companies predict some 25 billion devices will be connected to the internet by 2015.

Plugin/Support for MQTT is available in popular MQs like RabbitMQ, ActiveMQ etc.

Support for MQTT in Spark will help people with Internet of Things (IoT) projects to use Spark Streaming for their real time data processing needs (from sensors and other embedded devices etc).
2013-10-23 15:07:59 -07:00
Tathagata Das fe8626efd1 Merge branch 'apache-master' into transform 2013-10-22 23:40:40 -07:00
Tathagata Das 72d2e1dd77 Fixed bug in Java transformWith, added more Java testcases for transform and transformWith, added missing variations of Java join and cogroup, updated various Scala and Java API docs. 2013-10-22 23:35:51 -07:00
Matei Zaharia 731c94e91d Merge pull request #56 from jerryshao/kafka-0.8-dev
Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming

Conflicts:
	streaming/pom.xml
2013-10-21 23:31:38 -07:00
Tathagata Das 0666498799 Updated TransformDStream to allow n-ary DStream transform. Added transformWith, leftOuterJoin and rightOuterJoin operations to DStream for Scala and Java APIs. Also added n-ary union and n-ary transform operations to StreamingContext for Scala and Java APIs. 2013-10-21 05:34:09 -07:00
Prabeesh K d223d38933 Update MQTTInputDStream.scala 2013-10-18 09:09:49 +05:30