Commit graph

1951 commits

Author SHA1 Message Date
Tathagata Das 934ecc829a Removed streaming-env.sh.template 2013-01-06 14:15:07 -08:00
Stephen Haberman 8dc06069fe Rename RDD.tupleBy to keyBy. 2013-01-06 15:21:45 -06:00
Matei Zaharia 8fd3a70c18 Add PairRDD.keys() and values() to Java API 2013-01-05 22:46:45 -05:00
Matei Zaharia b1663752c6 Merge pull request #351 from stephenh/values
Add PairRDDFunctions.keys and values.
2013-01-05 19:15:54 -08:00
Matei Zaharia 0982572519 Add methods called just 'accumulator' for int/double in Java API 2013-01-05 22:11:28 -05:00
Matei Zaharia 86af64b0a6 Fix Accumulators in Java, and add a test for them 2013-01-05 20:55:17 -05:00
Matei Zaharia ecf9c08901 Fix Accumulators in Java, and add a test for them 2013-01-05 20:54:08 -05:00
Stephen Haberman 1fdb6946b5 Add RDD.tupleBy. 2013-01-05 13:07:59 -06:00
Stephen Haberman 6a0db3b449 Fix typo. 2013-01-05 12:56:17 -06:00
Matei Zaharia 7ab9f09140 Merge pull request #352 from stephenh/collect
Add RDD.collect(PartialFunction).
2013-01-05 10:17:20 -08:00
Stephen Haberman f4e6b9361f Add RDD.collect(PartialFunction). 2013-01-05 12:14:08 -06:00
Stephen Haberman 8d57c78c83 Add PairRDDFunctions.keys and values. 2013-01-05 12:04:01 -06:00
Josh Rosen 33beba3965 Change PySpark RDD.take() to not call iterator(). 2013-01-03 14:52:21 -08:00
Patrick Wendell c438faeac4 Merge pull request #10 from radlab/datahandler-fix
Several code-quality improvements to DataHandler.
2013-01-02 17:07:12 -08:00
Patrick Wendell 2ef993d159 BufferingBlockCreator -> NetworkReceiver.BlockGenerator 2013-01-02 14:19:51 -08:00
Patrick Wendell 96a6ff0b09 Merge branch 'dev-merge' into datahandler-fix
Conflicts:
	streaming/src/main/scala/spark/streaming/dstream/DataHandler.scala
2013-01-02 14:08:15 -08:00
Patrick Wendell 493d65ce65 Several code-quality improvements to DataHandler.
- Changed to more accurate name: BufferingBlockCreator
- Docstring now correctly reflects the abstraction
  offered by the class
- Made internal methods private
- Fixed indentation problems
2013-01-02 13:39:18 -08:00
Josh Rosen ce9f1bbe20 Add pyspark script to replace the other scripts.
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Tathagata Das 3dc87dd923 Fixed compilation bug in RDDSuite created during merge for mesos/master. 2013-01-01 16:38:04 -08:00
Tathagata Das d34dba25c2 Merge branch 'mesos' into dev-merge 2013-01-01 15:48:39 -08:00
Josh Rosen b58340dbd9 Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
Josh Rosen 170e451fbd Minor documentation and style fixes for PySpark. 2013-01-01 13:52:14 -08:00
Tathagata Das 02497f0cd4 Updated Streaming Programming Guide. 2013-01-01 12:21:32 -08:00
Matei Zaharia 55809fbc6d Merge pull request #349 from woggling/cache-finally
Avoid stalls when computation of cached RDD throws exception
2013-01-01 08:21:33 -08:00
Matei Zaharia c593f6329e Merge pull request #348 from JoshRosen/spark-597
Raise exception when hashing Java arrays (SPARK-597)
2013-01-01 08:20:06 -08:00
Charles Reiss 58072a7340 Remove some dead comments 2013-01-01 08:07:44 -08:00
Charles Reiss 21636ee4fa Test with exception while computing cached RDD. 2013-01-01 08:07:40 -08:00
Charles Reiss feadaf72f4 Mark key as not loading in CacheTracker even when compute() fails 2013-01-01 07:57:20 -08:00
Josh Rosen f803953998 Raise exception when hashing Java arrays (SPARK-597) 2012-12-31 20:20:11 -08:00
Josh Rosen 6f6a6b79c4 Launch with scala by default in run-pyspark 2012-12-31 14:57:18 -08:00
Tathagata Das 18b9b3b99f More classes made private[streaming] to hide from scala docs. 2012-12-30 20:00:42 -08:00
Tathagata Das 7e0271b438 Refactored a whole lot to push all DStreams into the spark.streaming.dstream package. 2012-12-30 15:19:55 -08:00
Tathagata Das 9e644402c1 Improved jekyll and scala docs. Made many classes and method private to remove them from scala docs. 2012-12-29 18:31:51 -08:00
Josh Rosen 099898b439 Port LR example to PySpark using numpy.
This version of the example crashes after the first iteration with
"OverflowError: math range error" because Python's math.exp()
behaves differently than Scala's; see SPARK-646.
2012-12-29 18:00:28 -08:00
Josh Rosen 39dd953fd8 Add test for pyspark.RDD.saveAsTextFile(). 2012-12-29 17:06:50 -08:00
Josh Rosen 59195c68ec Update PySpark for compatibility with TaskContext. 2012-12-29 16:01:03 -08:00
Josh Rosen c5cee53f20 Merge remote-tracking branch 'origin/master' into python-api
Conflicts:
	docs/quick-start.md
2012-12-29 16:00:51 -08:00
Josh Rosen 26186e2d25 Use batching in pyspark parallelize(); fix cartesian() 2012-12-29 15:34:57 -08:00
Matei Zaharia 3f74f729a1 Merge pull request #345 from JoshRosen/fix/add-file
Fix deletion of files in current working directory by clearFiles()
2012-12-29 15:01:33 -08:00
Josh Rosen 6ee1ff2663 Fix bug in pyspark.serializers.batch; add .gitignore. 2012-12-29 22:25:34 +00:00
Patrick Wendell 518111573f Merge pull request #8 from radlab/twitter-example
Adding a Twitter InputDStream with an example
2012-12-29 14:23:01 -08:00
Josh Rosen c2b105af34 Add documentation for Python API. 2012-12-28 22:51:28 -08:00
Josh Rosen 7ec3595de2 Fix bug (introduced by batching) in PySpark take() 2012-12-28 22:21:16 -08:00
Josh Rosen 397e67103c Change Utils.fetchFile() warning to SparkException. 2012-12-28 17:37:13 -08:00
Josh Rosen d64fa72d2e Add addFile() and addJar() to JavaSparkContext. 2012-12-28 17:00:57 -08:00
Josh Rosen bd237d4a9d Add synchronization to LocalScheduler.updateDependencies(). 2012-12-28 17:00:57 -08:00
Josh Rosen f1bf4f0385 Skip deletion of files in clearFiles().
This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().

There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.

This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
2012-12-28 17:00:57 -08:00
Josh Rosen fbadb1cda5 Mark api.python classes as private; echo Java output to stderr. 2012-12-28 09:06:11 -08:00
Josh Rosen 665466dfff Simplify PySpark installation.
- Bundle Py4J binaries, since it's hard to install
- Uses Spark's `run` script to launch the Py4J
  gateway, inheriting the settings in spark-env.sh

With these changes, (hopefully) nothing more than
running `sbt/sbt package` will be necessary to run
PySpark.
2012-12-27 22:47:37 -08:00
Josh Rosen ac32447cd3 Use addFile() to ship code to cluster in PySpark.
Add options to pyspark.SparkContext constructor.
2012-12-27 19:59:04 -08:00