https://spark-project.atlassian.net/browse/SPARK-1240
It seems that the current implementation does not handle the empty RDD case when run takeSample
In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
In the test case, I also add several lines for this case
Author: CodingCat <zhunansjtu@gmail.com>
Closes#135 from CodingCat/SPARK-1240 and squashes the following commits:
fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
ece1fa4 [Prashant Sharma] Added top in python.
Author: prabinb <prabin.banka@imaginea.com>
Closes#92 from prabinb/python-api-rdd and squashes the following commits:
51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
Author: jyotiska <jyotiska123@gmail.com>
Closes#34 from jyotiska/pyspark_code and squashes the following commits:
c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py
was raised earlier as a part of apache/incubator-spark#486
Author: Prabin Banka <prabin.banka@imaginea.com>
Closes#76 from prabinb/python-api-zip and squashes the following commits:
b1a31a0 [Prabin Banka] Added Python RDD.zip function
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes#80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:
9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()
was raised earlier as a part of apache/incubator-spark#486
Author: Prabin Banka <prabin.banka@imaginea.com>
Closes#75 from prabinb/python-api-backup and squashes the following commits:
cc3c6cd [Prabin Banka] Added missing Python APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:
1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
Updated doctests for mapValues and flatMapValues in rdd.py
Author: jyotiska <jyotiska123@gmail.com>
Closes#621 from jyotiska/python_spark and squashes the following commits:
716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
Add collectPartition to JavaRDD interface.
This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.
Thanks @concretevitamin for the original change and tests.
Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.
This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded. The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.
I also made the gateway and jvm variables private.
This change results in ~3-4x performance improvement when running the
PySpark unit tests.
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.