Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
We have our own implementation of max heap.
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
35f86ba [Prashant Sharma] code review
2b1124d [Prashant Sharma] fixed tests
e8a08e2 [Prashant Sharma] Code review comments.
49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:
f316266 [Prashant Sharma] Minor change in comment.
58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
Doctest added for map in rdd.py
Author: Jyotiska NK <jyotiska123@gmail.com>
Closes#177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:
a38527f [Jyotiska NK] Added doctest for map function in rdd.py
Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.
Author: Dan McClary <dan.mcclary@gmail.com>
Closes#144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:
fd3fd4b [Dan McClary] fixed error, updated test
82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
1e7056d [Dan McClary] added underscore to getBucket
37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
29981f2 [Dan McClary] fixed indentation on doctest comment
eaf89d9 [Dan McClary] added correct doctest for histogram
4916016 [Dan McClary] added histogram method, added max and min to statscounter
https://spark-project.atlassian.net/browse/SPARK-1240
It seems that the current implementation does not handle the empty RDD case when run takeSample
In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
In the test case, I also add several lines for this case
Author: CodingCat <zhunansjtu@gmail.com>
Closes#135 from CodingCat/SPARK-1240 and squashes the following commits:
fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
ece1fa4 [Prashant Sharma] Added top in python.
Author: prabinb <prabin.banka@imaginea.com>
Closes#92 from prabinb/python-api-rdd and squashes the following commits:
51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
Author: jyotiska <jyotiska123@gmail.com>
Closes#34 from jyotiska/pyspark_code and squashes the following commits:
c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py
was raised earlier as a part of apache/incubator-spark#486
Author: Prabin Banka <prabin.banka@imaginea.com>
Closes#76 from prabinb/python-api-zip and squashes the following commits:
b1a31a0 [Prabin Banka] Added Python RDD.zip function
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>
Closes#80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:
9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()
was raised earlier as a part of apache/incubator-spark#486
Author: Prabin Banka <prabin.banka@imaginea.com>
Closes#75 from prabinb/python-api-backup and squashes the following commits:
cc3c6cd [Prabin Banka] Added missing Python APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:
1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
Updated doctests for mapValues and flatMapValues in rdd.py
Author: jyotiska <jyotiska123@gmail.com>
Closes#621 from jyotiska/python_spark and squashes the following commits:
716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
Add collectPartition to JavaRDD interface.
This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.
Thanks @concretevitamin for the original change and tests.
Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
This fixes SPARK-832, an issue where PySpark
would not work when the master and workers used
different SPARK_HOME paths.
This change may potentially break code that relied
on the master's PYTHONPATH being used on workers.
To have custom PYTHONPATH additions used on the
workers, users should set a custom PYTHONPATH in
spark-env.sh rather than setting it in the shell.
The problem was that the gateway was being initialized whenever the
pyspark.context module was loaded. The fix uses lazy initialization
that occurs only when SparkContext instances are actually constructed.
I also made the gateway and jvm variables private.
This change results in ~3-4x performance improvement when running the
PySpark unit tests.
PythonPartitioner did not take the Python-side partitioning function
into account when checking for equality, which might cause problems
in the future.