Nick Pentreath
a76f53416c
Add toString to Java RDD, and __repr__ to Python RDD
2013-12-19 14:38:20 +02:00
Shivaram Venkataraman
af0cd6bd27
Add collectPartition to JavaRDD interface.
...
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
2013-12-18 11:40:07 -08:00
Prashant Sharma
603af51bb5
Merge branch 'master' into akka-bug-fix
...
Conflicts:
core/pom.xml
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
pom.xml
project/SparkBuild.scala
streaming/pom.xml
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Josh Rosen
3787f514d9
Fix UnicodeEncodeError in PySpark saveAsTextFile().
...
Fixes SPARK-970.
2013-11-28 23:44:56 -08:00
Prashant Sharma
17987778da
Merge branch 'master' into wip-scala-2.10
...
Conflicts:
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
core/src/main/scala/org/apache/spark/rdd/RDD.scala
python/pyspark/rdd.py
2013-11-27 14:44:12 +05:30
Josh Rosen
13122ceb8c
FramedSerializer: _dumps => dumps, _loads => loads.
2013-11-10 17:53:25 -08:00
Josh Rosen
ffa5bedf46
Send PySpark commands as bytes insetad of strings.
2013-11-10 16:46:00 -08:00
Josh Rosen
cbb7f04aef
Add custom serializer support to PySpark.
...
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
2013-11-10 16:45:38 -08:00
Josh Rosen
7d68a81a8e
Remove Pickle-wrapping of Java objects in PySpark.
...
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
2013-11-03 11:03:02 -08:00
Prashant Sharma
026ab75661
Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10
2013-10-10 09:42:55 +05:30
Matei Zaharia
478b2b7edc
Fix PySpark docs and an overly long line of code after fdbae41e
2013-10-09 12:08:04 -07:00
Prashant Sharma
7be75682b9
Merge branch 'master' into wip-merge-master
...
Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Andre Schumacher
fdbae41e88
SPARK-705: implement sortByKey() in PySpark
2013-10-07 12:16:33 -07:00
Andre Schumacher
c84946fe21
Fixing SPARK-602: PythonPartitioner
...
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Prashant Sharma
383e151fd7
Merge branch 'master' of git://github.com/mesos/spark into scala-2.10
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
project/SparkBuild.scala
2013-09-15 10:55:12 +05:30
Aaron Davidson
c1cc8c4da2
Export StorageLevel and refactor
2013-09-07 14:41:31 -07:00
Prashant Sharma
4106ae9fbf
Merged with master
2013-09-06 17:53:01 +05:30
Aaron Davidson
a63d4c7dc2
SPARK-660: Add StorageLevel support in Python
...
It uses reflection... I am not proud of that fact, but it at least ensures
compatibility (sans refactoring of the StorageLevel stuff).
2013-09-05 23:36:27 -07:00
Matei Zaharia
6edef9c833
Merge pull request #861 from AndreSchumacher/pyspark_sampling_function
...
Pyspark sampling function
2013-08-31 13:39:24 -07:00
Andre Schumacher
96571c2524
PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.py
2013-08-30 15:00:42 -07:00
Andre Schumacher
a511c5379e
RDD sample() and takeSample() prototypes for PySpark
2013-08-28 16:46:13 -07:00
Andre Schumacher
457bcd3343
PySpark: implementing subtractByKey(), subtract() and keyBy()
2013-08-28 16:14:22 -07:00
Andre Schumacher
76077bf9f4
Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark
2013-08-21 17:05:58 -07:00
Andre Schumacher
c7e348faec
Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path
2013-08-16 11:58:20 -07:00
Josh Rosen
b95732632b
Do not inherit master's PYTHONPATH on workers.
...
This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.
This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.
2013-07-29 22:08:57 -07:00
Matei Zaharia
d75c308695
Use None instead of empty string as it's slightly smaller/faster
2013-07-29 02:51:43 -04:00
Matei Zaharia
b5ec355622
Optimize Python foreach() to not return as many objects
2013-07-29 02:51:43 -04:00
Matei Zaharia
b9d6783f36
Optimize Python take() to not compute entire first partition
2013-07-29 02:51:43 -04:00
Matei Zaharia
af3c9d5042
Add Apache license headers and LICENSE and NOTICE files
2013-07-16 17:21:33 -07:00
Jey Kottalam
9a731f5a6d
Fix Python saveAsTextFile doctest to not expect order to be preserved
2013-04-02 11:59:20 -07:00
Josh Rosen
2c966c98fb
Change numSplits to numPartitions in PySpark.
2013-02-24 13:25:09 -08:00
Mark Hamstra
b7a1fb5c5d
Add commutative requirement for 'reduce' to Python docstring.
2013-02-09 12:14:11 -08:00
Josh Rosen
8fbd5380b7
Fetch fewer objects in PySpark's take() method.
2013-02-03 06:44:49 +00:00
Josh Rosen
2415c18f48
Fix reporting of PySpark doctest failures.
2013-02-03 06:44:11 +00:00
Josh Rosen
e211f405bc
Use spark.local.dir for PySpark temp files (SPARK-580).
2013-02-01 11:50:27 -08:00
Josh Rosen
9cc6ff9c4e
Do not launch JavaGateways on workers (SPARK-674).
...
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded. The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.
I also made the gateway and jvm variables private.
This change results in ~3-4x performance improvement when running the
PySpark unit tests.
2013-02-01 11:13:10 -08:00
Matei Zaharia
c7b5e5f1ec
Merge pull request #389 from JoshRosen/python_rdd_checkpointing
...
Add checkpointing to the Python API
2013-01-20 17:10:44 -08:00
Josh Rosen
9f211dd3f0
Fix PythonPartitioner equality; see SPARK-654.
...
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.
2013-01-20 15:41:42 -08:00
Josh Rosen
00d70cd660
Clean up setup code in PySpark checkpointing tests
2013-01-20 15:38:11 -08:00
Josh Rosen
5b6ea9e9a0
Update checkpointing API docs in Python/Java.
2013-01-20 15:31:41 -08:00
Josh Rosen
d0ba80dc72
Add checkpointFile() and more tests to PySpark.
2013-01-20 13:59:45 -08:00
Josh Rosen
7ed1bf4b48
Add RDD checkpointing to Python API.
2013-01-20 13:19:19 -08:00
Matei Zaharia
8e7f098a2c
Added accumulators to PySpark
2013-01-20 01:57:44 -08:00
Josh Rosen
b57dd0f160
Add mapPartitionsWithSplit() to PySpark.
2013-01-08 16:05:02 -08:00
Josh Rosen
33beba3965
Change PySpark RDD.take() to not call iterator().
2013-01-03 14:52:21 -08:00
Josh Rosen
b58340dbd9
Rename top-level 'pyspark' directory to 'python'
2013-01-01 15:05:00 -08:00