Commit graph

1589 commits

Author SHA1 Message Date
Shivaram Venkataraman aed368a970 Update Hadoop dependency to 1.0.3 as 0.20 has Sun specific dependencies. Also
fix SequenceFileRDDFunctions to pick the right type conversion across Hadoop
versions
2013-01-07 15:57:33 -08:00
Shivaram Venkataraman f8d579a0c0 Remove dependencies on sun jvm classes. Instead use reflection to infer
HotSpot options and total physical memory size
2013-01-07 15:57:18 -08:00
Matei Zaharia 1941d9602d Merge branch 'master' of github.com:mesos/spark 2013-01-07 16:50:39 -05:00
Matei Zaharia 9c32f300fb Add Accumulable.setValue for easier use in Java 2013-01-07 16:50:23 -05:00
Stephen Haberman 8dc06069fe Rename RDD.tupleBy to keyBy. 2013-01-06 15:21:45 -06:00
Matei Zaharia 8fd3a70c18 Add PairRDD.keys() and values() to Java API 2013-01-05 22:46:45 -05:00
Matei Zaharia b1663752c6 Merge pull request #351 from stephenh/values
Add PairRDDFunctions.keys and values.
2013-01-05 19:15:54 -08:00
Matei Zaharia 0982572519 Add methods called just 'accumulator' for int/double in Java API 2013-01-05 22:11:28 -05:00
Matei Zaharia 86af64b0a6 Fix Accumulators in Java, and add a test for them 2013-01-05 20:55:17 -05:00
Stephen Haberman 1fdb6946b5 Add RDD.tupleBy. 2013-01-05 13:07:59 -06:00
Stephen Haberman 6a0db3b449 Fix typo. 2013-01-05 12:56:17 -06:00
Matei Zaharia 7ab9f09140 Merge pull request #352 from stephenh/collect
Add RDD.collect(PartialFunction).
2013-01-05 10:17:20 -08:00
Stephen Haberman f4e6b9361f Add RDD.collect(PartialFunction). 2013-01-05 12:14:08 -06:00
Stephen Haberman 8d57c78c83 Add PairRDDFunctions.keys and values. 2013-01-05 12:04:01 -06:00
Josh Rosen 33beba3965 Change PySpark RDD.take() to not call iterator(). 2013-01-03 14:52:21 -08:00
Josh Rosen ce9f1bbe20 Add pyspark script to replace the other scripts.
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Josh Rosen b58340dbd9 Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
Josh Rosen 170e451fbd Minor documentation and style fixes for PySpark. 2013-01-01 13:52:14 -08:00
Matei Zaharia 55809fbc6d Merge pull request #349 from woggling/cache-finally
Avoid stalls when computation of cached RDD throws exception
2013-01-01 08:21:33 -08:00
Matei Zaharia c593f6329e Merge pull request #348 from JoshRosen/spark-597
Raise exception when hashing Java arrays (SPARK-597)
2013-01-01 08:20:06 -08:00
Charles Reiss 58072a7340 Remove some dead comments 2013-01-01 08:07:44 -08:00
Charles Reiss 21636ee4fa Test with exception while computing cached RDD. 2013-01-01 08:07:40 -08:00
Charles Reiss feadaf72f4 Mark key as not loading in CacheTracker even when compute() fails 2013-01-01 07:57:20 -08:00
Josh Rosen f803953998 Raise exception when hashing Java arrays (SPARK-597) 2012-12-31 20:20:11 -08:00
Josh Rosen 6f6a6b79c4 Launch with scala by default in run-pyspark 2012-12-31 14:57:18 -08:00
Josh Rosen 099898b439 Port LR example to PySpark using numpy.
This version of the example crashes after the first iteration with
"OverflowError: math range error" because Python's math.exp()
behaves differently than Scala's; see SPARK-646.
2012-12-29 18:00:28 -08:00
Josh Rosen 39dd953fd8 Add test for pyspark.RDD.saveAsTextFile(). 2012-12-29 17:06:50 -08:00
Josh Rosen 59195c68ec Update PySpark for compatibility with TaskContext. 2012-12-29 16:01:03 -08:00
Josh Rosen c5cee53f20 Merge remote-tracking branch 'origin/master' into python-api
Conflicts:
	docs/quick-start.md
2012-12-29 16:00:51 -08:00
Josh Rosen 26186e2d25 Use batching in pyspark parallelize(); fix cartesian() 2012-12-29 15:34:57 -08:00
Matei Zaharia 3f74f729a1 Merge pull request #345 from JoshRosen/fix/add-file
Fix deletion of files in current working directory by clearFiles()
2012-12-29 15:01:33 -08:00
Josh Rosen 6ee1ff2663 Fix bug in pyspark.serializers.batch; add .gitignore. 2012-12-29 22:25:34 +00:00
Josh Rosen c2b105af34 Add documentation for Python API. 2012-12-28 22:51:28 -08:00
Josh Rosen 7ec3595de2 Fix bug (introduced by batching) in PySpark take() 2012-12-28 22:21:16 -08:00
Josh Rosen 397e67103c Change Utils.fetchFile() warning to SparkException. 2012-12-28 17:37:13 -08:00
Josh Rosen d64fa72d2e Add addFile() and addJar() to JavaSparkContext. 2012-12-28 17:00:57 -08:00
Josh Rosen bd237d4a9d Add synchronization to LocalScheduler.updateDependencies(). 2012-12-28 17:00:57 -08:00
Josh Rosen f1bf4f0385 Skip deletion of files in clearFiles().
This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().

There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.

This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
2012-12-28 17:00:57 -08:00
Josh Rosen fbadb1cda5 Mark api.python classes as private; echo Java output to stderr. 2012-12-28 09:06:11 -08:00
Josh Rosen 665466dfff Simplify PySpark installation.
- Bundle Py4J binaries, since it's hard to install
- Uses Spark's `run` script to launch the Py4J
  gateway, inheriting the settings in spark-env.sh

With these changes, (hopefully) nothing more than
running `sbt/sbt package` will be necessary to run
PySpark.
2012-12-27 22:47:37 -08:00
Josh Rosen ac32447cd3 Use addFile() to ship code to cluster in PySpark.
Add options to pyspark.SparkContext constructor.
2012-12-27 19:59:04 -08:00
Josh Rosen 85b8f2c64f Add epydoc API documentation for PySpark. 2012-12-27 18:04:10 -08:00
Josh Rosen 2d98fff065 Add IPython support to pyspark-shell.
Suggested by / based on code from @MLnick
2012-12-27 10:17:36 -08:00
Josh Rosen 1dca0c5180 Remove debug output from PythonPartitioner. 2012-12-26 18:23:06 -08:00
Josh Rosen e2dad15621 Add support for batched serialization of Python objects in PySpark. 2012-12-26 18:16:09 -08:00
Josh Rosen 4608902fb8 Use filesystem to collect RDDs in PySpark.
Passing large volumes of data through Py4J seems
to be slow.  It appears to be faster to write the
data to the local filesystem and read it back from
Python.
2012-12-24 17:20:10 -08:00
Matei Zaharia 84587a9bf3 Merge pull request #343 from markhamstra/spark-601
lookup() needn't fail when there is no partitioner
2012-12-24 15:28:05 -08:00
Josh Rosen ccd075cf96 Reduce object overhead in Pyspark shuffle and collect 2012-12-24 15:01:13 -08:00
Mark Hamstra 903f3518df fall back to filter-map-collect when calling lookup() on an RDD without a partitioner 2012-12-24 13:18:45 -08:00
Matei Zaharia b575cbe069 Merge pull request #342 from markhamstra/spark-645
Allow distinct() to be called without parentheses
2012-12-24 08:04:50 -08:00