Prashant Sharma
084839ba35
Merge pull request #498 from ScrapCodes/python-api. Closes #498 .
...
Python api additions
Author: Prashant Sharma <prashant.s@imaginea.com>
== Merge branch commits ==
commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Fri Jan 24 11:50:29 2014 +0530
Josh's and Patricks review comments.
commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Thu Jan 23 17:27:17 2014 +0530
fixed doc tests
commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Thu Jan 23 16:48:43 2014 +0530
Added keys and values methods for PairFunctions in python
commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Thu Jan 23 13:51:26 2014 +0530
Added foreachPartition
commit 05f05341a187cba829ac0e6c2bdf30be49948c89
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Thu Jan 23 13:02:59 2014 +0530
Added coalesce fucntion to python API
commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd
Author: Prashant Sharma <prashant.s@imaginea.com>
Date: Thu Jan 23 12:52:44 2014 +0530
added repartition function to python API.
2014-02-06 14:58:35 -08:00
Josh Rosen
4cebb79c9f
Deprecate mapPartitionsWithSplit in PySpark.
...
Also, replace the last reference to it in the docs.
This fixes SPARK-1026.
2014-01-23 20:01:36 -08:00
Tor Myklebust
fec01664a7
Make Python function/line appear in the UI.
2013-12-28 23:34:16 -05:00
Reynold Xin
7990c56375
Merge pull request #276 from shivaram/collectPartition
...
Add collectPartition to JavaRDD interface.
This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.
Thanks @concretevitamin for the original change and tests.
2013-12-19 13:35:09 -08:00
Shivaram Venkataraman
d3234f9726
Make collectPartitions take an array of partitions
...
Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.
2013-12-19 11:40:34 -08:00
Nick Pentreath
a76f53416c
Add toString to Java RDD, and __repr__ to Python RDD
2013-12-19 14:38:20 +02:00
Shivaram Venkataraman
af0cd6bd27
Add collectPartition to JavaRDD interface.
...
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
2013-12-18 11:40:07 -08:00
Prashant Sharma
603af51bb5
Merge branch 'master' into akka-bug-fix
...
Conflicts:
core/pom.xml
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
pom.xml
project/SparkBuild.scala
streaming/pom.xml
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Josh Rosen
3787f514d9
Fix UnicodeEncodeError in PySpark saveAsTextFile().
...
Fixes SPARK-970.
2013-11-28 23:44:56 -08:00
Prashant Sharma
17987778da
Merge branch 'master' into wip-scala-2.10
...
Conflicts:
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
core/src/main/scala/org/apache/spark/rdd/RDD.scala
python/pyspark/rdd.py
2013-11-27 14:44:12 +05:30
Josh Rosen
13122ceb8c
FramedSerializer: _dumps => dumps, _loads => loads.
2013-11-10 17:53:25 -08:00
Josh Rosen
ffa5bedf46
Send PySpark commands as bytes insetad of strings.
2013-11-10 16:46:00 -08:00
Josh Rosen
cbb7f04aef
Add custom serializer support to PySpark.
...
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
2013-11-10 16:45:38 -08:00
Josh Rosen
7d68a81a8e
Remove Pickle-wrapping of Java objects in PySpark.
...
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
2013-11-03 11:03:02 -08:00
Prashant Sharma
026ab75661
Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10
2013-10-10 09:42:55 +05:30
Matei Zaharia
478b2b7edc
Fix PySpark docs and an overly long line of code after fdbae41e
2013-10-09 12:08:04 -07:00
Prashant Sharma
7be75682b9
Merge branch 'master' into wip-merge-master
...
Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Andre Schumacher
fdbae41e88
SPARK-705: implement sortByKey() in PySpark
2013-10-07 12:16:33 -07:00
Andre Schumacher
c84946fe21
Fixing SPARK-602: PythonPartitioner
...
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Prashant Sharma
383e151fd7
Merge branch 'master' of git://github.com/mesos/spark into scala-2.10
...
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
project/SparkBuild.scala
2013-09-15 10:55:12 +05:30
Aaron Davidson
c1cc8c4da2
Export StorageLevel and refactor
2013-09-07 14:41:31 -07:00
Prashant Sharma
4106ae9fbf
Merged with master
2013-09-06 17:53:01 +05:30
Aaron Davidson
a63d4c7dc2
SPARK-660: Add StorageLevel support in Python
...
It uses reflection... I am not proud of that fact, but it at least ensures
compatibility (sans refactoring of the StorageLevel stuff).
2013-09-05 23:36:27 -07:00
Matei Zaharia
6edef9c833
Merge pull request #861 from AndreSchumacher/pyspark_sampling_function
...
Pyspark sampling function
2013-08-31 13:39:24 -07:00
Andre Schumacher
96571c2524
PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.py
2013-08-30 15:00:42 -07:00
Andre Schumacher
a511c5379e
RDD sample() and takeSample() prototypes for PySpark
2013-08-28 16:46:13 -07:00
Andre Schumacher
457bcd3343
PySpark: implementing subtractByKey(), subtract() and keyBy()
2013-08-28 16:14:22 -07:00
Andre Schumacher
76077bf9f4
Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark
2013-08-21 17:05:58 -07:00
Andre Schumacher
c7e348faec
Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path
2013-08-16 11:58:20 -07:00
Josh Rosen
b95732632b
Do not inherit master's PYTHONPATH on workers.
...
This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.
This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.
2013-07-29 22:08:57 -07:00
Matei Zaharia
d75c308695
Use None instead of empty string as it's slightly smaller/faster
2013-07-29 02:51:43 -04:00
Matei Zaharia
b5ec355622
Optimize Python foreach() to not return as many objects
2013-07-29 02:51:43 -04:00
Matei Zaharia
b9d6783f36
Optimize Python take() to not compute entire first partition
2013-07-29 02:51:43 -04:00
Matei Zaharia
af3c9d5042
Add Apache license headers and LICENSE and NOTICE files
2013-07-16 17:21:33 -07:00
Jey Kottalam
9a731f5a6d
Fix Python saveAsTextFile doctest to not expect order to be preserved
2013-04-02 11:59:20 -07:00
Josh Rosen
2c966c98fb
Change numSplits to numPartitions in PySpark.
2013-02-24 13:25:09 -08:00
Mark Hamstra
b7a1fb5c5d
Add commutative requirement for 'reduce' to Python docstring.
2013-02-09 12:14:11 -08:00
Josh Rosen
8fbd5380b7
Fetch fewer objects in PySpark's take() method.
2013-02-03 06:44:49 +00:00
Josh Rosen
2415c18f48
Fix reporting of PySpark doctest failures.
2013-02-03 06:44:11 +00:00
Josh Rosen
e211f405bc
Use spark.local.dir for PySpark temp files (SPARK-580).
2013-02-01 11:50:27 -08:00
Josh Rosen
9cc6ff9c4e
Do not launch JavaGateways on workers (SPARK-674).
...
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded. The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.
I also made the gateway and jvm variables private.
This change results in ~3-4x performance improvement when running the
PySpark unit tests.
2013-02-01 11:13:10 -08:00
Matei Zaharia
c7b5e5f1ec
Merge pull request #389 from JoshRosen/python_rdd_checkpointing
...
Add checkpointing to the Python API
2013-01-20 17:10:44 -08:00
Josh Rosen
9f211dd3f0
Fix PythonPartitioner equality; see SPARK-654.
...
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.
2013-01-20 15:41:42 -08:00
Josh Rosen
00d70cd660
Clean up setup code in PySpark checkpointing tests
2013-01-20 15:38:11 -08:00
Josh Rosen
5b6ea9e9a0
Update checkpointing API docs in Python/Java.
2013-01-20 15:31:41 -08:00
Josh Rosen
d0ba80dc72
Add checkpointFile() and more tests to PySpark.
2013-01-20 13:59:45 -08:00
Josh Rosen
7ed1bf4b48
Add RDD checkpointing to Python API.
2013-01-20 13:19:19 -08:00
Matei Zaharia
8e7f098a2c
Added accumulators to PySpark
2013-01-20 01:57:44 -08:00
Josh Rosen
b57dd0f160
Add mapPartitionsWithSplit() to PySpark.
2013-01-08 16:05:02 -08:00
Josh Rosen
33beba3965
Change PySpark RDD.take() to not call iterator().
2013-01-03 14:52:21 -08:00
Josh Rosen
b58340dbd9
Rename top-level 'pyspark' directory to 'python'
2013-01-01 15:05:00 -08:00