Commit graph

1517 commits

Author SHA1 Message Date
Josh Rosen b58340dbd9 Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
Josh Rosen 170e451fbd Minor documentation and style fixes for PySpark. 2013-01-01 13:52:14 -08:00
Josh Rosen 6f6a6b79c4 Launch with scala by default in run-pyspark 2012-12-31 14:57:18 -08:00
Josh Rosen 099898b439 Port LR example to PySpark using numpy.
This version of the example crashes after the first iteration with
"OverflowError: math range error" because Python's math.exp()
behaves differently than Scala's; see SPARK-646.
2012-12-29 18:00:28 -08:00
Josh Rosen 39dd953fd8 Add test for pyspark.RDD.saveAsTextFile(). 2012-12-29 17:06:50 -08:00
Josh Rosen 59195c68ec Update PySpark for compatibility with TaskContext. 2012-12-29 16:01:03 -08:00
Josh Rosen c5cee53f20 Merge remote-tracking branch 'origin/master' into python-api
Conflicts:
	docs/quick-start.md
2012-12-29 16:00:51 -08:00
Josh Rosen 26186e2d25 Use batching in pyspark parallelize(); fix cartesian() 2012-12-29 15:34:57 -08:00
Matei Zaharia 3f74f729a1 Merge pull request #345 from JoshRosen/fix/add-file
Fix deletion of files in current working directory by clearFiles()
2012-12-29 15:01:33 -08:00
Josh Rosen 6ee1ff2663 Fix bug in pyspark.serializers.batch; add .gitignore. 2012-12-29 22:25:34 +00:00
Josh Rosen c2b105af34 Add documentation for Python API. 2012-12-28 22:51:28 -08:00
Josh Rosen 7ec3595de2 Fix bug (introduced by batching) in PySpark take() 2012-12-28 22:21:16 -08:00
Josh Rosen 397e67103c Change Utils.fetchFile() warning to SparkException. 2012-12-28 17:37:13 -08:00
Josh Rosen d64fa72d2e Add addFile() and addJar() to JavaSparkContext. 2012-12-28 17:00:57 -08:00
Josh Rosen bd237d4a9d Add synchronization to LocalScheduler.updateDependencies(). 2012-12-28 17:00:57 -08:00
Josh Rosen f1bf4f0385 Skip deletion of files in clearFiles().
This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().

There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.

This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
2012-12-28 17:00:57 -08:00
Josh Rosen fbadb1cda5 Mark api.python classes as private; echo Java output to stderr. 2012-12-28 09:06:11 -08:00
Josh Rosen 665466dfff Simplify PySpark installation.
- Bundle Py4J binaries, since it's hard to install
- Uses Spark's `run` script to launch the Py4J
  gateway, inheriting the settings in spark-env.sh

With these changes, (hopefully) nothing more than
running `sbt/sbt package` will be necessary to run
PySpark.
2012-12-27 22:47:37 -08:00
Josh Rosen ac32447cd3 Use addFile() to ship code to cluster in PySpark.
Add options to pyspark.SparkContext constructor.
2012-12-27 19:59:04 -08:00
Josh Rosen 85b8f2c64f Add epydoc API documentation for PySpark. 2012-12-27 18:04:10 -08:00
Josh Rosen 2d98fff065 Add IPython support to pyspark-shell.
Suggested by / based on code from @MLnick
2012-12-27 10:17:36 -08:00
Josh Rosen 1dca0c5180 Remove debug output from PythonPartitioner. 2012-12-26 18:23:06 -08:00
Josh Rosen e2dad15621 Add support for batched serialization of Python objects in PySpark. 2012-12-26 18:16:09 -08:00
Josh Rosen 4608902fb8 Use filesystem to collect RDDs in PySpark.
Passing large volumes of data through Py4J seems
to be slow.  It appears to be faster to write the
data to the local filesystem and read it back from
Python.
2012-12-24 17:20:10 -08:00
Matei Zaharia 84587a9bf3 Merge pull request #343 from markhamstra/spark-601
lookup() needn't fail when there is no partitioner
2012-12-24 15:28:05 -08:00
Josh Rosen ccd075cf96 Reduce object overhead in Pyspark shuffle and collect 2012-12-24 15:01:13 -08:00
Mark Hamstra 903f3518df fall back to filter-map-collect when calling lookup() on an RDD without a partitioner 2012-12-24 13:18:45 -08:00
Matei Zaharia b575cbe069 Merge pull request #342 from markhamstra/spark-645
Allow distinct() to be called without parentheses
2012-12-24 08:04:50 -08:00
Mark Hamstra 61be8566e2 Allow distinct() to be called without parentheses when using the default number of splits. 2012-12-24 02:36:47 -08:00
Reynold Xin a6bb41c6d3 Updated Kryo version for Maven pom file. 2012-12-21 16:25:50 -08:00
Reynold Xin c68a076037 Updated Kryo documentation for Kryo version update. 2012-12-21 16:03:17 -08:00
Reynold Xin 60f7338092 Remove the call to close input stream in Kryo serializer. 2012-12-21 15:49:33 -08:00
Matei Zaharia 3334b7c6b5 Merge pull request #341 from rxin/4a3fb06ac2d11125feb08acbbd4df76d1e91b677
Kryo2 update against Spark master
2012-12-21 15:31:23 -08:00
Matei Zaharia 5e51b889fe Merge pull request #327 from rxin/spark-633
Added the ability in block manager to remove blocks.
2012-12-20 11:33:38 -08:00
Reynold Xin 9397c5014e Let the slave notify the master block removal. 2012-12-20 01:37:09 -08:00
Matei Zaharia e7051767f7 Merge pull request #337 from pwendell/worker-liveness-ui
SPARK-616: Logging dead workers in Web UI.
2012-12-19 15:31:32 -08:00
Reynold Xin 68c52d80ec Moved BlockManager's IdGenerator into BlockManager object. Removed some
excessive debug messages.
2012-12-19 15:27:23 -08:00
Matei Zaharia 30b47794da Merge pull request #340 from tomdz/deb-packaging-tweaks
Tweaked debian packaging to be a bit more in line with debian standards
2012-12-19 12:07:03 -08:00
Thomas Dudziak 5488ac67c3 Tweaked debian packaging to be a bit more in line with debian standards 2012-12-19 10:20:43 -08:00
Matei Zaharia 1e6e154d6d Merge pull request #338 from tomdz/repl-pom-fix
Fixed repl maven build
2012-12-18 14:03:29 -08:00
Thomas Dudziak 4af6cad37a Fixed repl maven build to produce artifacts with the appropriate hadoop classifier and extracted repl fat-jar and debian packaging into a separate project to make Maven happy 2012-12-18 12:08:19 -08:00
Patrick Wendell bfac06e1f6 SPARK-616: Logging dead workers in Web UI.
This patch keeps track of which workers have died and marks them
as such in the master web UI. It also handles workers which die and
re-register using different actor ID's.
2012-12-17 23:09:05 -08:00
Matei Zaharia b82a6dd2c7 Merge pull request #332 from JoshRosen/spark-607
Add try-finally to handle MapOutputTracker timeouts
2012-12-14 11:41:16 -08:00
Reynold Xin 06f855c24d Merge branch 'spark-633' of github.com:rxin/spark into spark-633 2012-12-14 00:27:24 -08:00
Reynold Xin 8c01295b85 Fixed conflicts from merging Charles' and TD's block manager changes. 2012-12-14 00:26:36 -08:00
Matei Zaharia 1072f970cc Merge pull request #331 from woggling/deploy-exit-status
Have standalone cluster report exit codes to clients
2012-12-13 22:43:48 -08:00
Charles Reiss c528932a41 Code review cleanup. 2012-12-13 22:37:16 -08:00
Charles Reiss 0aad42b5e7 Have standalone cluster report exit codes to clients. Addresses SPARK-639. 2012-12-13 22:37:16 -08:00
Reynold Xin 0235667f73 Merge branch 'master' of github.com:mesos/spark into spark-633 2012-12-13 22:33:41 -08:00
Reynold Xin 97434f49b8 Merged TD's block manager refactoring. 2012-12-13 22:32:19 -08:00