Commit graph

48 commits

Author SHA1 Message Date
root ec31e68d5d Fixed PySpark perf regression by not using socket.makefile(), and improved
debuggability by letting "print" statements show up in the executor's stderr

Conflicts:
	core/src/main/scala/spark/api/python/PythonRDD.scala
2013-07-01 06:26:31 +00:00
Jey Kottalam c75bed0eeb Fix reporting of PySpark exceptions 2013-06-21 12:14:16 -04:00
Jey Kottalam 7c5ff733ee PySpark daemon: fix deadlock, improve error handling 2013-06-21 12:14:16 -04:00
Jey Kottalam 62c4781400 Add tests and fixes for Python daemon shutdown 2013-06-21 12:14:16 -04:00
Jey Kottalam c79a6078c3 Prefork Python worker processes 2013-06-21 12:14:16 -04:00
Jey Kottalam 40afe0d2a5 Add Python timing instrumentation 2013-06-21 12:14:16 -04:00
Jey Kottalam 9a731f5a6d Fix Python saveAsTextFile doctest to not expect order to be preserved 2013-04-02 11:59:20 -07:00
Jey Kottalam 20604001e2 Fix argv handling in Python transitive closure example 2013-04-02 11:59:07 -07:00
Josh Rosen 2c966c98fb Change numSplits to numPartitions in PySpark. 2013-02-24 13:25:09 -08:00
Mark Hamstra b7a1fb5c5d Add commutative requirement for 'reduce' to Python docstring. 2013-02-09 12:14:11 -08:00
Josh Rosen e61729113d Remove unnecessary doctest __main__ methods. 2013-02-03 21:29:40 -08:00
Josh Rosen 8fbd5380b7 Fetch fewer objects in PySpark's take() method. 2013-02-03 06:44:49 +00:00
Josh Rosen 2415c18f48 Fix reporting of PySpark doctest failures. 2013-02-03 06:44:11 +00:00
Josh Rosen e211f405bc Use spark.local.dir for PySpark temp files (SPARK-580). 2013-02-01 11:50:27 -08:00
Josh Rosen 9cc6ff9c4e Do not launch JavaGateways on workers (SPARK-674).
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded.  The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.

I also made the gateway and jvm variables private.

This change results in ~3-4x performance improvement when running the
PySpark unit tests.
2013-02-01 11:13:10 -08:00
Josh Rosen 57b64d0d19 Fix stdout redirection in PySpark. 2013-02-01 00:25:19 -08:00
Patrick Wendell 3446d5c8d6 SPARK-673: Capture and re-throw Python exceptions
This patch alters the Python <-> executor protocol to pass on
exception data when they occur in user Python code.
2013-01-31 18:06:11 -08:00
Matei Zaharia 55327a283e Merge pull request #430 from pwendell/pyspark-guide
Minor improvements to PySpark docs
2013-01-30 15:35:29 -08:00
Patrick Wendell 3f945e3b83 Make module help available in python shell.
Also, adds a line in doc explaining how to use.
2013-01-30 15:04:06 -08:00
Stephen Haberman 7dfb82a992 Replace old 'master' term with 'driver'. 2013-01-25 11:03:00 -06:00
Matei Zaharia a2f4891d1d Merge pull request #396 from JoshRosen/spark-653
Make PySpark AccumulatorParam an abstract base class
2013-01-24 13:05:03 -08:00
Josh Rosen b47d054cfc Remove use of abc.ABCMeta due to cloudpickle issue.
cloudpickle runs into issues while pickling subclasses of AccumulatorParam,
which may be related to this Python issue:

    http://bugs.python.org/issue7689

This seems hard to fix and the ABCMeta wasn't necessary, so I removed it.
2013-01-23 11:47:27 -08:00
Josh Rosen ae2ed2947d Allow PySpark's SparkFiles to be used from driver
Fix minor documentation formatting issues.
2013-01-23 10:58:50 -08:00
Josh Rosen 35168d9c89 Fix sys.path bug in PySpark SparkContext.addPyFile 2013-01-22 17:54:11 -08:00
Josh Rosen c75ae3622e Make AccumulatorParam an abstract base class. 2013-01-21 22:32:57 -08:00
Josh Rosen ef711902c1 Don't download files to master's working directory.
This should avoid exceptions caused by existing
files with different contents.

I also removed some unused code.
2013-01-21 17:34:17 -08:00
Matei Zaharia c7b5e5f1ec Merge pull request #389 from JoshRosen/python_rdd_checkpointing
Add checkpointing to the Python API
2013-01-20 17:10:44 -08:00
Josh Rosen 9f211dd3f0 Fix PythonPartitioner equality; see SPARK-654.
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.
2013-01-20 15:41:42 -08:00
Josh Rosen 00d70cd660 Clean up setup code in PySpark checkpointing tests 2013-01-20 15:38:11 -08:00
Josh Rosen 5b6ea9e9a0 Update checkpointing API docs in Python/Java. 2013-01-20 15:31:41 -08:00
Josh Rosen d0ba80dc72 Add checkpointFile() and more tests to PySpark. 2013-01-20 13:59:45 -08:00
Josh Rosen 7ed1bf4b48 Add RDD checkpointing to Python API. 2013-01-20 13:19:19 -08:00
Josh Rosen 17035db159 Add __repr__ to Accumulator; fix bug in sc.accumulator 2013-01-20 11:58:57 -08:00
Josh Rosen 9f54d7e1f5 Merge pull request #387 from mateiz/python-accumulators
Add accumulators to PySpark
2013-01-20 11:00:36 -08:00
Matei Zaharia 2a8c2a6790 Minor formatting fixes 2013-01-20 10:24:53 -08:00
Matei Zaharia a23ed25f3c Add a class comment to Accumulator 2013-01-20 02:10:25 -08:00
Matei Zaharia 61b6382a35 Launch accumulator tests in run-tests 2013-01-20 01:59:07 -08:00
Matei Zaharia 8e7f098a2c Added accumulators to PySpark 2013-01-20 01:57:44 -08:00
Nick Pentreath b77f7390a5 Python ALS example 2013-01-15 09:04:32 +02:00
Josh Rosen 49c74ba2af Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON. 2013-01-10 08:10:59 -08:00
Josh Rosen d55f2b9882 Use take() instead of takeSample() in PySpark kmeans example.
This is a temporary change until we port takeSample().
2013-01-09 21:21:23 -08:00
Josh Rosen 1a64432ba5 Indicate success/failure in PySpark test script. 2013-01-09 20:30:36 -08:00
Josh Rosen b57dd0f160 Add mapPartitionsWithSplit() to PySpark. 2013-01-08 16:05:02 -08:00
Josh Rosen 33beba3965 Change PySpark RDD.take() to not call iterator(). 2013-01-03 14:52:21 -08:00
Josh Rosen ce9f1bbe20 Add pyspark script to replace the other scripts.
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Josh Rosen b58340dbd9 Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
Josh Rosen 9abdfa6633 Fix Python 2.6 compatibility in Python API. 2012-09-17 00:09:16 -07:00
Josh Rosen 886b39de55 Add Python API. 2012-08-18 22:33:51 -07:00