Matei Zaharia
618f0ecb43
Merge pull request #869 from AndreSchumacher/subtract
...
PySpark: implementing subtractByKey(), subtract() and keyBy()
2013-08-30 18:17:13 -07:00
Andre Schumacher
96571c2524
PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.py
2013-08-30 15:00:42 -07:00
Matei Zaharia
ab0e625d9e
Fix PySpark for assembly run and include it in dist
2013-08-29 21:19:06 -07:00
Matei Zaharia
53cd50c069
Change build and run instructions to use assemblies
...
This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.
As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.
2013-08-29 21:19:04 -07:00
Andre Schumacher
a511c5379e
RDD sample() and takeSample() prototypes for PySpark
2013-08-28 16:46:13 -07:00
Josh Rosen
742c44eae6
Don't send SIGINT to Py4J gateway subprocess.
...
This addresses SPARK-885, a usability issue where PySpark's
Java gateway process would be killed if the user hit ctrl-c.
Note that SIGINT still won't cancel the running s
This fix is based on http://stackoverflow.com/questions/5045771
2013-08-28 16:39:44 -07:00
Andre Schumacher
457bcd3343
PySpark: implementing subtractByKey(), subtract() and keyBy()
2013-08-28 16:14:22 -07:00
Andre Schumacher
76077bf9f4
Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark
2013-08-21 17:05:58 -07:00
Andre Schumacher
c7e348faec
Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path
2013-08-16 11:58:20 -07:00
Josh Rosen
7a9abb9ddc
Fix PySpark unit tests on Python 2.6.
2013-08-14 15:12:12 -07:00
Matei Zaharia
e2fdac60da
Merge pull request #802 from stayhf/SPARK-760-Python
...
Simple PageRank algorithm implementation in Python for SPARK-760
2013-08-12 21:26:59 -07:00
Matei Zaharia
d3525babee
Merge pull request #813 from AndreSchumacher/add_files_pyspark
...
Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark
2013-08-12 21:02:39 -07:00
Andre Schumacher
8fd5c7bc00
Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark
...
Now ADD_FILES uses a comma as file name separator.
2013-08-12 20:22:52 -07:00
stayhf
24f02082c7
Code update for Matei's suggestions
2013-08-11 22:54:05 +00:00
stayhf
55d9bde2fa
Simple PageRank algorithm implementation in Python for SPARK-760
2013-08-10 23:48:51 +00:00
Matei Zaharia
3c8478e1fb
Merge pull request #747 from mateiz/improved-lr
...
Update the Python logistic regression example
2013-08-06 23:25:03 -07:00
Matei Zaharia
5ac548397d
Fix string parsing and style in LR
2013-07-31 23:12:30 -07:00
Josh Rosen
b95732632b
Do not inherit master's PYTHONPATH on workers.
...
This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.
This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.
2013-07-29 22:08:57 -07:00
Matei Zaharia
01f94931d5
Update the Python logistic regression example to read from a file and
...
batch input records for more efficient NumPy computations
2013-07-29 19:23:41 -07:00
Matei Zaharia
d8158ced12
Merge branch 'master' of github.com:mesos/spark
2013-07-29 02:52:02 -04:00
Matei Zaharia
feba7ee540
SPARK-815. Python parallelize() should split lists before batching
...
One unfortunate consequence of this fix is that we materialize any
collections that are given to us as generators, but this seems necessary
to get reasonable behavior on small collections. We could add a
batchSize parameter later to bypass auto-computation of batch size if
this becomes a problem (e.g. if users really want to parallelize big
generators nicely)
2013-07-29 02:51:43 -04:00
Matei Zaharia
d75c308695
Use None instead of empty string as it's slightly smaller/faster
2013-07-29 02:51:43 -04:00
Matei Zaharia
96b50e82dc
Allow python/run-tests to run from any directory
2013-07-29 02:51:43 -04:00
Matei Zaharia
b5ec355622
Optimize Python foreach() to not return as many objects
2013-07-29 02:51:43 -04:00
Matei Zaharia
b9d6783f36
Optimize Python take() to not compute entire first partition
2013-07-29 02:51:43 -04:00
Matei Zaharia
f11ad72d4e
Some fixes to Python examples (style and package name for LR)
2013-07-27 21:12:22 -04:00
Matei Zaharia
af3c9d5042
Add Apache license headers and LICENSE and NOTICE files
2013-07-16 17:21:33 -07:00
root
ec31e68d5d
Fixed PySpark perf regression by not using socket.makefile(), and improved
...
debuggability by letting "print" statements show up in the executor's stderr
Conflicts:
core/src/main/scala/spark/api/python/PythonRDD.scala
2013-07-01 06:26:31 +00:00
Jey Kottalam
c75bed0eeb
Fix reporting of PySpark exceptions
2013-06-21 12:14:16 -04:00
Jey Kottalam
7c5ff733ee
PySpark daemon: fix deadlock, improve error handling
2013-06-21 12:14:16 -04:00
Jey Kottalam
62c4781400
Add tests and fixes for Python daemon shutdown
2013-06-21 12:14:16 -04:00
Jey Kottalam
c79a6078c3
Prefork Python worker processes
2013-06-21 12:14:16 -04:00
Jey Kottalam
40afe0d2a5
Add Python timing instrumentation
2013-06-21 12:14:16 -04:00
Jey Kottalam
9a731f5a6d
Fix Python saveAsTextFile doctest to not expect order to be preserved
2013-04-02 11:59:20 -07:00
Jey Kottalam
20604001e2
Fix argv handling in Python transitive closure example
2013-04-02 11:59:07 -07:00
Josh Rosen
2c966c98fb
Change numSplits to numPartitions in PySpark.
2013-02-24 13:25:09 -08:00
Mark Hamstra
b7a1fb5c5d
Add commutative requirement for 'reduce' to Python docstring.
2013-02-09 12:14:11 -08:00
Josh Rosen
e61729113d
Remove unnecessary doctest __main__ methods.
2013-02-03 21:29:40 -08:00
Josh Rosen
8fbd5380b7
Fetch fewer objects in PySpark's take() method.
2013-02-03 06:44:49 +00:00
Josh Rosen
2415c18f48
Fix reporting of PySpark doctest failures.
2013-02-03 06:44:11 +00:00
Josh Rosen
e211f405bc
Use spark.local.dir for PySpark temp files (SPARK-580).
2013-02-01 11:50:27 -08:00
Josh Rosen
9cc6ff9c4e
Do not launch JavaGateways on workers (SPARK-674).
...
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded. The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.
I also made the gateway and jvm variables private.
This change results in ~3-4x performance improvement when running the
PySpark unit tests.
2013-02-01 11:13:10 -08:00
Josh Rosen
57b64d0d19
Fix stdout redirection in PySpark.
2013-02-01 00:25:19 -08:00
Patrick Wendell
3446d5c8d6
SPARK-673: Capture and re-throw Python exceptions
...
This patch alters the Python <-> executor protocol to pass on
exception data when they occur in user Python code.
2013-01-31 18:06:11 -08:00
Matei Zaharia
55327a283e
Merge pull request #430 from pwendell/pyspark-guide
...
Minor improvements to PySpark docs
2013-01-30 15:35:29 -08:00
Patrick Wendell
3f945e3b83
Make module help available in python shell.
...
Also, adds a line in doc explaining how to use.
2013-01-30 15:04:06 -08:00
Stephen Haberman
7dfb82a992
Replace old 'master' term with 'driver'.
2013-01-25 11:03:00 -06:00
Matei Zaharia
a2f4891d1d
Merge pull request #396 from JoshRosen/spark-653
...
Make PySpark AccumulatorParam an abstract base class
2013-01-24 13:05:03 -08:00
Josh Rosen
b47d054cfc
Remove use of abc.ABCMeta due to cloudpickle issue.
...
cloudpickle runs into issues while pickling subclasses of AccumulatorParam,
which may be related to this Python issue:
http://bugs.python.org/issue7689
This seems hard to fix and the ABCMeta wasn't necessary, so I removed it.
2013-01-23 11:47:27 -08:00
Josh Rosen
ae2ed2947d
Allow PySpark's SparkFiles to be used from driver
...
Fix minor documentation formatting issues.
2013-01-23 10:58:50 -08:00
Josh Rosen
35168d9c89
Fix sys.path bug in PySpark SparkContext.addPyFile
2013-01-22 17:54:11 -08:00
Josh Rosen
c75ae3622e
Make AccumulatorParam an abstract base class.
2013-01-21 22:32:57 -08:00
Josh Rosen
ef711902c1
Don't download files to master's working directory.
...
This should avoid exceptions caused by existing
files with different contents.
I also removed some unused code.
2013-01-21 17:34:17 -08:00
Matei Zaharia
c7b5e5f1ec
Merge pull request #389 from JoshRosen/python_rdd_checkpointing
...
Add checkpointing to the Python API
2013-01-20 17:10:44 -08:00
Josh Rosen
9f211dd3f0
Fix PythonPartitioner equality; see SPARK-654.
...
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.
2013-01-20 15:41:42 -08:00
Josh Rosen
00d70cd660
Clean up setup code in PySpark checkpointing tests
2013-01-20 15:38:11 -08:00
Josh Rosen
5b6ea9e9a0
Update checkpointing API docs in Python/Java.
2013-01-20 15:31:41 -08:00
Josh Rosen
d0ba80dc72
Add checkpointFile() and more tests to PySpark.
2013-01-20 13:59:45 -08:00
Josh Rosen
7ed1bf4b48
Add RDD checkpointing to Python API.
2013-01-20 13:19:19 -08:00
Josh Rosen
17035db159
Add __repr__ to Accumulator; fix bug in sc.accumulator
2013-01-20 11:58:57 -08:00
Josh Rosen
9f54d7e1f5
Merge pull request #387 from mateiz/python-accumulators
...
Add accumulators to PySpark
2013-01-20 11:00:36 -08:00
Matei Zaharia
2a8c2a6790
Minor formatting fixes
2013-01-20 10:24:53 -08:00
Matei Zaharia
a23ed25f3c
Add a class comment to Accumulator
2013-01-20 02:10:25 -08:00
Matei Zaharia
61b6382a35
Launch accumulator tests in run-tests
2013-01-20 01:59:07 -08:00
Matei Zaharia
8e7f098a2c
Added accumulators to PySpark
2013-01-20 01:57:44 -08:00
Nick Pentreath
b77f7390a5
Python ALS example
2013-01-15 09:04:32 +02:00
Josh Rosen
49c74ba2af
Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON.
2013-01-10 08:10:59 -08:00
Josh Rosen
d55f2b9882
Use take() instead of takeSample() in PySpark kmeans example.
...
This is a temporary change until we port takeSample().
2013-01-09 21:21:23 -08:00
Josh Rosen
1a64432ba5
Indicate success/failure in PySpark test script.
2013-01-09 20:30:36 -08:00
Josh Rosen
b57dd0f160
Add mapPartitionsWithSplit() to PySpark.
2013-01-08 16:05:02 -08:00
Josh Rosen
33beba3965
Change PySpark RDD.take() to not call iterator().
2013-01-03 14:52:21 -08:00
Josh Rosen
ce9f1bbe20
Add pyspark
script to replace the other scripts.
...
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Josh Rosen
b58340dbd9
Rename top-level 'pyspark' directory to 'python'
2013-01-01 15:05:00 -08:00
Josh Rosen
9abdfa6633
Fix Python 2.6 compatibility in Python API.
2012-09-17 00:09:16 -07:00
Josh Rosen
886b39de55
Add Python API.
2012-08-18 22:33:51 -07:00