Patrick Wendell
96a6ff0b09
Merge branch 'dev-merge' into datahandler-fix
...
Conflicts:
streaming/src/main/scala/spark/streaming/dstream/DataHandler.scala
2013-01-02 14:08:15 -08:00
Patrick Wendell
493d65ce65
Several code-quality improvements to DataHandler.
...
- Changed to more accurate name: BufferingBlockCreator
- Docstring now correctly reflects the abstraction
offered by the class
- Made internal methods private
- Fixed indentation problems
2013-01-02 13:39:18 -08:00
Josh Rosen
ce9f1bbe20
Add pyspark
script to replace the other scripts.
...
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Tathagata Das
3dc87dd923
Fixed compilation bug in RDDSuite created during merge for mesos/master.
2013-01-01 16:38:04 -08:00
Tathagata Das
d34dba25c2
Merge branch 'mesos' into dev-merge
2013-01-01 15:48:39 -08:00
Josh Rosen
b58340dbd9
Rename top-level 'pyspark' directory to 'python'
2013-01-01 15:05:00 -08:00
Josh Rosen
170e451fbd
Minor documentation and style fixes for PySpark.
2013-01-01 13:52:14 -08:00
Tathagata Das
02497f0cd4
Updated Streaming Programming Guide.
2013-01-01 12:21:32 -08:00
Matei Zaharia
55809fbc6d
Merge pull request #349 from woggling/cache-finally
...
Avoid stalls when computation of cached RDD throws exception
2013-01-01 08:21:33 -08:00
Matei Zaharia
c593f6329e
Merge pull request #348 from JoshRosen/spark-597
...
Raise exception when hashing Java arrays (SPARK-597)
2013-01-01 08:20:06 -08:00
Charles Reiss
58072a7340
Remove some dead comments
2013-01-01 08:07:44 -08:00
Charles Reiss
21636ee4fa
Test with exception while computing cached RDD.
2013-01-01 08:07:40 -08:00
Charles Reiss
feadaf72f4
Mark key as not loading in CacheTracker even when compute() fails
2013-01-01 07:57:20 -08:00
Josh Rosen
f803953998
Raise exception when hashing Java arrays (SPARK-597)
2012-12-31 20:20:11 -08:00
Josh Rosen
6f6a6b79c4
Launch with scala
by default in run-pyspark
2012-12-31 14:57:18 -08:00
Tathagata Das
18b9b3b99f
More classes made private[streaming] to hide from scala docs.
2012-12-30 20:00:42 -08:00
Tathagata Das
7e0271b438
Refactored a whole lot to push all DStreams into the spark.streaming.dstream package.
2012-12-30 15:19:55 -08:00
Tathagata Das
9e644402c1
Improved jekyll and scala docs. Made many classes and method private to remove them from scala docs.
2012-12-29 18:31:51 -08:00
Josh Rosen
099898b439
Port LR example to PySpark using numpy.
...
This version of the example crashes after the first iteration with
"OverflowError: math range error" because Python's math.exp()
behaves differently than Scala's; see SPARK-646.
2012-12-29 18:00:28 -08:00
Josh Rosen
39dd953fd8
Add test for pyspark.RDD.saveAsTextFile().
2012-12-29 17:06:50 -08:00
Josh Rosen
59195c68ec
Update PySpark for compatibility with TaskContext.
2012-12-29 16:01:03 -08:00
Josh Rosen
c5cee53f20
Merge remote-tracking branch 'origin/master' into python-api
...
Conflicts:
docs/quick-start.md
2012-12-29 16:00:51 -08:00
Josh Rosen
26186e2d25
Use batching in pyspark parallelize(); fix cartesian()
2012-12-29 15:34:57 -08:00
Matei Zaharia
3f74f729a1
Merge pull request #345 from JoshRosen/fix/add-file
...
Fix deletion of files in current working directory by clearFiles()
2012-12-29 15:01:33 -08:00
Josh Rosen
6ee1ff2663
Fix bug in pyspark.serializers.batch; add .gitignore.
2012-12-29 22:25:34 +00:00
Patrick Wendell
518111573f
Merge pull request #8 from radlab/twitter-example
...
Adding a Twitter InputDStream with an example
2012-12-29 14:23:01 -08:00
Josh Rosen
c2b105af34
Add documentation for Python API.
2012-12-28 22:51:28 -08:00
Josh Rosen
7ec3595de2
Fix bug (introduced by batching) in PySpark take()
2012-12-28 22:21:16 -08:00
Josh Rosen
397e67103c
Change Utils.fetchFile() warning to SparkException.
2012-12-28 17:37:13 -08:00
Josh Rosen
d64fa72d2e
Add addFile() and addJar() to JavaSparkContext.
2012-12-28 17:00:57 -08:00
Josh Rosen
bd237d4a9d
Add synchronization to LocalScheduler.updateDependencies().
2012-12-28 17:00:57 -08:00
Josh Rosen
f1bf4f0385
Skip deletion of files in clearFiles().
...
This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().
There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.
This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
2012-12-28 17:00:57 -08:00
Josh Rosen
fbadb1cda5
Mark api.python classes as private; echo Java output to stderr.
2012-12-28 09:06:11 -08:00
Josh Rosen
665466dfff
Simplify PySpark installation.
...
- Bundle Py4J binaries, since it's hard to install
- Uses Spark's `run` script to launch the Py4J
gateway, inheriting the settings in spark-env.sh
With these changes, (hopefully) nothing more than
running `sbt/sbt package` will be necessary to run
PySpark.
2012-12-27 22:47:37 -08:00
Josh Rosen
ac32447cd3
Use addFile() to ship code to cluster in PySpark.
...
Add options to pyspark.SparkContext constructor.
2012-12-27 19:59:04 -08:00
Josh Rosen
85b8f2c64f
Add epydoc API documentation for PySpark.
2012-12-27 18:04:10 -08:00
Tathagata Das
0bc0a60d30
Modifications to make sure LocalScheduler terminate cleanly without errors when SparkContext is shutdown, to minimize spurious exception during master failure tests.
2012-12-27 15:37:33 -08:00
Josh Rosen
2d98fff065
Add IPython support to pyspark-shell.
...
Suggested by / based on code from @MLnick
2012-12-27 10:17:36 -08:00
Tathagata Das
7c33f76291
Merge branch 'mesos' into dev-merge
2012-12-26 19:19:07 -08:00
Tathagata Das
836042bb9f
Merge branch 'dev-checkpoint' of github.com:radlab/spark into dev-merge
...
Conflicts:
core/src/main/scala/spark/ParallelCollection.scala
core/src/main/scala/spark/RDD.scala
core/src/main/scala/spark/rdd/BlockRDD.scala
core/src/main/scala/spark/rdd/CartesianRDD.scala
core/src/main/scala/spark/rdd/CoGroupedRDD.scala
core/src/main/scala/spark/rdd/CoalescedRDD.scala
core/src/main/scala/spark/rdd/FilteredRDD.scala
core/src/main/scala/spark/rdd/FlatMappedRDD.scala
core/src/main/scala/spark/rdd/GlommedRDD.scala
core/src/main/scala/spark/rdd/HadoopRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsRDD.scala
core/src/main/scala/spark/rdd/MapPartitionsWithSplitRDD.scala
core/src/main/scala/spark/rdd/MappedRDD.scala
core/src/main/scala/spark/rdd/PipedRDD.scala
core/src/main/scala/spark/rdd/SampledRDD.scala
core/src/main/scala/spark/rdd/ShuffledRDD.scala
core/src/main/scala/spark/rdd/UnionRDD.scala
core/src/main/scala/spark/scheduler/ResultTask.scala
core/src/test/scala/spark/CheckpointSuite.scala
2012-12-26 19:09:01 -08:00
Josh Rosen
1dca0c5180
Remove debug output from PythonPartitioner.
2012-12-26 18:23:06 -08:00
Josh Rosen
e2dad15621
Add support for batched serialization of Python objects in PySpark.
2012-12-26 18:16:09 -08:00
Josh Rosen
4608902fb8
Use filesystem to collect RDDs in PySpark.
...
Passing large volumes of data through Py4J seems
to be slow. It appears to be faster to write the
data to the local filesystem and read it back from
Python.
2012-12-24 17:20:10 -08:00
Matei Zaharia
84587a9bf3
Merge pull request #343 from markhamstra/spark-601
...
lookup() needn't fail when there is no partitioner
2012-12-24 15:28:05 -08:00
Josh Rosen
ccd075cf96
Reduce object overhead in Pyspark shuffle and collect
2012-12-24 15:01:13 -08:00
Mark Hamstra
903f3518df
fall back to filter-map-collect when calling lookup() on an RDD without a partitioner
2012-12-24 13:18:45 -08:00
Matei Zaharia
b575cbe069
Merge pull request #342 from markhamstra/spark-645
...
Allow distinct() to be called without parentheses
2012-12-24 08:04:50 -08:00
Mark Hamstra
61be8566e2
Allow distinct() to be called without parentheses when using the default number of splits.
2012-12-24 02:36:47 -08:00
Patrick Wendell
bce84ceabb
Minor changes after review and general cleanup.
...
- Added filters to Twitter example
- Removed un-used import
- Some code clean-up
2012-12-21 20:57:46 -08:00
Patrick Wendell
9ac4cb1c5f
Adding a Twitter InputDStream with an example
2012-12-21 17:18:19 -08:00