## What changes were proposed in this pull request?
This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions.
```
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> from pyspark.sql import Window
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> pandas_udf("double", PandasUDFType.GROUPED_AGG)
... def mean_udf(v):
... return v.mean()
>>> w = Window.partitionBy('id')
>>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
+---+----+------+
| id| v|mean_v|
+---+----+------+
| 1| 1.0| 1.5|
| 1| 2.0| 1.5|
| 2| 3.0| 6.0|
| 2| 5.0| 6.0|
| 2|10.0| 6.0|
+---+----+------+
```
The scope of this PR is somewhat limited in terms of:
(1) Only supports unbounded window, which acts essentially as group by.
(2) Only supports aggregation functions, not "transform" like window functions (n -> n mapping)
Both of these are left as future work. Especially, (1) needs careful thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a bit easier but does require more changes therefore I think it's better to leave it as a separate PR.
## How was this patch tested?
WindowPandasUDFTests
Author: Li Jin <ice.xelloss@gmail.com>
Closes#21082 from icexelloss/SPARK-22239-window-udf.
## What changes were proposed in this pull request?
Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s
## How was this patch tested?
Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way?
## License
This is my original work, licensed in the same way as spark
Author: e-dorigatti <emilio.dorigatti@gmail.com>
Author: edorigatti <emilio.dorigatti@gmail.com>
Closes#21383 from e-dorigatti/fix_spark_23754.
The exit() builtin is only for interactive use. applications should use sys.exit().
## What changes were proposed in this pull request?
All usage of the builtin `exit()` function is replaced by `sys.exit()`.
## How was this patch tested?
I ran `python/run-tests`.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Benjamin Peterson <benjamin@python.org>
Closes#20682 from benjaminp/sys-exit.
## What changes were proposed in this pull request?
Rename the public APIs and names of pandas udfs.
- `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF`
- `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF`
- `PANDAS GROUP AGG UDF` -> `GROUPED AGG PANDAS UDF`
## How was this patch tested?
The existing tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20428 from gatorsmile/renamePandasUDFs.
## What changes were proposed in this pull request?
Add support for using pandas UDFs with groupby().agg().
This PR introduces a new type of pandas UDF - group aggregate pandas UDF. This type of UDF defines a transformation of multiple pandas Series -> a scalar value. Group aggregate pandas UDFs can be used with groupby().agg(). Note group aggregate pandas UDF doesn't support partial aggregation, i.e., a full shuffle is required.
This PR doesn't support group aggregate pandas UDFs that return ArrayType, StructType or MapType. Support for these types is left for future PR.
## How was this patch tested?
GroupbyAggPandasUDFTests
Author: Li Jin <ice.xelloss@gmail.com>
Closes#19872 from icexelloss/SPARK-22274-groupby-agg.
## What changes were proposed in this pull request?
Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment.
This patch fixed the problem. The detailed description is added here:
https://issues.apache.org/jira/browse/SPARK-20947
## How was this patch tested?
Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied:
```python
sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
```
Author: 王晓哲 <wxz@linkdoc.com>
Closes#18277 from chaoslawful/fix_pipe_encoding_error.
## What changes were proposed in this pull request?
* Add a "function type" argument to pandas_udf.
* Add a new public enum class `PandasUdfType` in pyspark.sql.functions
* Refactor udf related code from pyspark.sql.functions to pyspark.sql.udf
* Merge "PythonUdfType" and "PythonEvalType" into a single enum class "PythonEvalType"
Example:
```
from pyspark.sql.functions import pandas_udf, PandasUDFType
pandas_udf('double', PandasUDFType.SCALAR):
def plus_one(v):
return v + 1
```
## Design doc
https://docs.google.com/document/d/1KlLaa-xJ3oz28xlEJqXyCAHU3dwFYkFs_ixcUXrJNTc/edit
## How was this patch tested?
Added PandasUDFTests
## TODO:
* [x] Implement proper enum type for `PandasUDFType`
* [x] Update documentation
* [x] Add more tests in PandasUDFTests
Author: Li Jin <ice.xelloss@gmail.com>
Closes#19630 from icexelloss/spark-22409-pandas-udf-type.
## What changes were proposed in this pull request?
This modification increases the timeout for `serveIterator` (which is not dynamically configurable). This fixes timeout issues in pyspark when using `collect` and similar functions, in cases where Python may take more than a couple seconds to connect.
See https://issues.apache.org/jira/browse/SPARK-21551
## How was this patch tested?
Ran the tests.
cc rxin
Author: peay <peay@protonmail.com>
Closes#18752 from peay/spark-21551.
## What changes were proposed in this pull request?
At example of repartitionAndSortWithinPartitions at rdd.py, third argument should be True or False.
I proposed fix of example code.
## How was this patch tested?
* I rename test_repartitionAndSortWithinPartitions to test_repartitionAndSortWIthinPartitions_asc to specify boolean argument.
* I added test_repartitionAndSortWithinPartitions_desc to test False pattern at third argument.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: chie8842 <chie8842@gmail.com>
Closes#18586 from chie8842/SPARK-21358.
## What changes were proposed in this pull request?
**Context**
While reviewing https://github.com/apache/spark/pull/17227, I realised here we type-dispatch per record. The PR itself is fine in terms of performance as is but this prints a prefix, `"obj"` in exception message as below:
```
from pyspark.sql.types import *
schema = StructType([StructField('s', IntegerType(), nullable=False)])
spark.createDataFrame([["1"]], schema)
...
TypeError: obj.s: IntegerType can not accept object '1' in type <type 'str'>
```
I suggested to get rid of this but during investigating this, I realised my approach might bring a performance regression as it is a hot path.
Only for SPARK-19507 and https://github.com/apache/spark/pull/17227, It needs more changes to cleanly get rid of the prefix and I rather decided to fix both issues together.
**Propersal**
This PR tried to
- get rid of per-record type dispatch as we do in many code paths in Scala so that it improves the performance (roughly ~25% improvement) - SPARK-21296
This was tested with a simple code `spark.createDataFrame(range(1000000), "int")`. However, I am quite sure the actual improvement in practice is larger than this, in particular, when the schema is complicated.
- improve error message in exception describing field information as prose - SPARK-19507
## How was this patch tested?
Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.
Benchmark - codes: https://gist.github.com/HyukjinKwon/c3397469c56cb26c2d7dd521ed0bc5a3
Error message - codes: https://gist.github.com/HyukjinKwon/b1b2c7f65865444c4a8836435100e398
**Before**
Benchmark:
- Results: https://gist.github.com/HyukjinKwon/4a291dab45542106301a0c1abcdca924
Error message
- Results: https://gist.github.com/HyukjinKwon/57b1916395794ce924faa32b14a3fe19
**After**
Benchmark
- Results: https://gist.github.com/HyukjinKwon/21496feecc4a920e50c4e455f836266e
Error message
- Results: https://gist.github.com/HyukjinKwon/7a494e4557fe32a652ce1236e504a395Closes#17227
Author: hyukjinkwon <gurwls223@gmail.com>
Author: David Gingrich <david@textio.com>
Closes#18521 from HyukjinKwon/python-type-dispatch.
## What changes were proposed in this pull request?
Improve combineByKey documentation:
* Add note on memory allocation
* Change example code to use different mergeValue and mergeCombiners
## How was this patch tested?
Doctest.
## Legal
This is my original work and I license the work to the project under the project’s open source license.
Author: David Gingrich <david@textio.com>
Closes#17545 from dgingrich/topic-spark-20232-combinebykey-docs.
## What changes were proposed in this pull request?
This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
with the file, `text.txt` below:
```
a
b
d
e
f
g
h
i
j
k
l
```
- Before
```python
>>> sc.textFile('text.txt').repartition(1).collect()
```
```
UTF8Deserializer(True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/rdd.py", line 811, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
yield self.loads(stream)
File ".../spark/python/pyspark/serializers.py", line 544, in loads
return s.decode("utf-8") if self.use_unicode else s
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
```
- After
```python
>>> sc.textFile('text.txt').repartition(1).collect()
```
```
[u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
```
## How was this patch tested?
Unit test in `python/pyspark/tests.py`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17282 from HyukjinKwon/SPARK-19872.
## What changes were proposed in this pull request?
self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3
## How was this patch tested?
Manually tested it.
Author: Jeff Zhang <zjffdu@apache.org>
Closes#11211 from zjffdu/SPARK-13330.
## What changes were proposed in this pull request?
There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()
row = next(it)
df2 = df.repartition(1000) # create many empty partitions which increase materialization time so causing timeout
it2 = df2.toLocalIterator()
row = next(it2)
The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.
In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.
## How was this patch tested?
Added tests into PySpark.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16263 from viirya/fix-pyspark-localiterator.
## What changes were proposed in this pull request?
Expose RDD's localCheckpoint() and associated functions in PySpark.
## How was this patch tested?
I added a UnitTest in python/pyspark/tests.py which passes.
I certify that this is my original work, and I license it to the project under the project's open source license.
Gabriel HUANG
Developer at Cardabel (http://cardabel.com/)
Author: Gabriel Huang <gabi.xiaohuang@gmail.com>
Closes#15811 from gabrielhuang/pyspark-localcheckpoint.
## What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python
## How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Closes#15815 from anabranch/SPARK-18365.
## What changes were proposed in this pull request?
This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too.
Simple benchmark:
import time
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), 2)
start = time.time()
l = a.repartition(num_partitions).glom().map(len).collect()
end = time.time()
print(end - start)
Before: 419.447577953
_to_java_object_rdd(): 421.916361094
decreasing the batch size: 423.712255955
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#15445 from viirya/repartition-batch-size.
## What changes were proposed in this pull request?
Quoted from JIRA description:
Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.
Please reference the following code for a reproducible example of this issue:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions
l = a.repartition(num_partitions).glom().map(len).collect() # get length of each partition
min(l), max(l), sum(l)/len(l), len(l) # skewed!
In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#15389 from viirya/pyspark-rdd-repartition.
## What changes were proposed in this pull request?
doc change only
## How was this patch tested?
doc change only
Author: Mortada Mehyar <mortada.mehyar@gmail.com>
Closes#14253 from mortada/histogram_typos.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT
## How was this patch tested?
Built docs locally.
Author: Holden Karau <holden@us.ibm.com>
Closes#12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
## What changes were proposed in this pull request?
This fix tries to address the issue in PySpark where `spark.python.worker.memory`
could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
conform to the JVM memory string as is specified in the documentation .
## How was this patch tested?
This fix adds additional test to cover the changes.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12163 from yongtang/SPARK-14368.
## What changes were proposed in this pull request?
RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
The JDBC server has been updated to use DataFrame.toIterator.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12114 from davies/local_iterator.
## What changes were proposed in this pull request?
When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
## How was the this patch tested?
by existing unit tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11342 from cloud-fan/python-clean.
Clarify that reduce functions need to be commutative, and fold functions do not
See https://github.com/apache/spark/pull/11091
Author: Sean Owen <sowen@cloudera.com>
Closes#11217 from srowen/SPARK-13339.
rxin srowen
I work out note message for rdd.take function, please help to review.
If it's fine, I can apply to all other function later.
Author: Tommy YU <tummyyu@163.com>
Closes#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
Fix order of arguments that Pyspark RDD.fold passes to its op - should be (acc, obj) like other implementations.
Obviously, this is a potentially breaking change, so can only happen for 2.x
CC davies
Author: Sean Owen <sowen@cloudera.com>
Closes#10771 from srowen/SPARK-7683.
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10092 from gatorsmile/persistStorageLevel.
from the issue:
In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8831 from JoshRosen/remove-ability-to-disable-spilling.
JIRA: https://issues.apache.org/jira/browse/SPARK-10642
When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8796 from viirya/fix-pyrdd-lookup.
Spark has an option called spark.localExecution.enabled; according to the docs:
> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
This pull request simply brings #7484 up to date.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#7585 from rxin/remove-local-exec and squashes the following commits:
84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
As an illustrative example, submit the following to `spark-submit`:
```
from pyspark import SparkConf, SparkContext
import collections
def updateCounter(acc, val):
print 'update acc:', acc
print 'update val:', val
acc[val] += 1
return acc
def comboCounter(acc1, acc2):
print 'combo acc1:', acc1
print 'combo acc2:', acc2
acc1.update(acc2)
return acc1
def main():
conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
sc = SparkContext(conf = conf)
print '======= AGGREGATING with ONE PARTITION ======='
print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
print '======= AGGREGATING with TWO PARTITIONS ======='
print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
if __name__ == "__main__":
main()
```
One probably expects this to output the following:
```
Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
```
But it instead outputs this (regardless of the number of partitions):
```
Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
```
This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
Author: Nicholas Hwang <moogling@gmail.com>
Closes#7378 from njhwang/master and squashes the following commits:
659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1.
90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
This will allow problems with piped commands to be detected.
This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).
Author: Scott Taylor <github@megatron.me.uk>
Closes#6262 from megatron-me-uk/patch-2 and squashes the following commits:
04ae1d5 [Scott Taylor] Remove spurious empty line
98fa101 [Scott Taylor] fix blank line style error
574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
eb4801c [Scott Taylor] fix fail_condition
b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
a307d13 [Scott Taylor] update rdd tests to test pipe modes
34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
a0c0161 [Scott Taylor] fix generator issue
8a9ef9c [Scott Taylor] make check_return_code an iterator
0486ae3 [Scott Taylor] style fixes
8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
4153b02 [Scott Taylor] fix list.sort returns None
491d3fc [Scott Taylor] Pass a function handle to assertRaises
3344a21 [Scott Taylor] wrap assertRaises with QuietTest
3ab8c7a [Scott Taylor] remove whitespace for style
cc1a73d [Scott Taylor] fix style issues in pipe test
8db4073 [Scott Taylor] Add a test for rdd pipe functions
1b3dc4e [Scott Taylor] fix missing space around operator style
0974f98 [Scott Taylor] add space between words in multiline string
45f4977 [Scott Taylor] fix line too long style error
5745d85 [Scott Taylor] Remove space to fix style
f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#7135 from davies/ananylis and squashes the following commits:
dad7ae7 [Davies Liu] add comment
ec0c0e8 [Davies Liu] Update utils.py
cdd7edd [Davies Liu] add doc
7b044c2 [Davies Liu] fix python 3
f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols.
Author: Ai He <ai.he@ussuning.com>
Author: AiHe <ai.he@ussuning.com>
Closes#6338 from AiHe/pyspark-networking-issue and squashes the following commits:
d4fc9c4 [Ai He] handle code review 2
e75c5c8 [Ai He] handle code review
5644953 [AiHe] solve python rdd socket connection problem to jvm
A minor change but one which is (presumably) visible on the public api docs webpage.
Author: Scott Taylor <github@megatron.me.uk>
Closes#6942 from megatron-me-uk/patch-3 and squashes the following commits:
fbed000 [Scott Taylor] test the absolute error in approx doctests
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.
Author: zsxwing <zsxwing@gmail.com>
Closes#6826 from zsxwing/python-emptyRDD and squashes the following commits:
b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
Document current limitation of rdd.fold.
This does not resolve SPARK-6416 but just documents the issue.
CC JoshRosen
Author: Sean Owen <sowen@cloudera.com>
Closes#6231 from srowen/SPARK-6416 and squashes the following commits:
9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
da40d84 [Sean Owen] Document current limitation of rdd.fold.
This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
Author: Davies Liu <davies@databricks.com>
Closes#6203 from davies/py_version and squashes the following commits:
b8fb76e [Davies Liu] fix test
6ce5096 [Davies Liu] use string for version
47c6278 [Davies Liu] check python version of worker with driver
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
8ddbfae [Vinod K C] Remove blank line
b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
122d378 [Vinod K C] Fixed validation of relativeSD in countApproxDistinct