spark-instrumented-optimizer

History

schintap a61911c50c [SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs ### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <schintap@verizonmedia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-05-25 10:29:08 +09:00
..
ml	[SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters	2020-05-18 20:25:02 +09:00
mllib	[SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters	2020-05-18 20:25:02 +09:00
resource	[SPARK-31748][PYTHON] Document resource module in PySpark doc and rename/move classes	2020-05-19 17:09:37 -07:00
sql	[SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters	2020-05-18 20:25:02 +09:00
streaming	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
testing	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
tests	[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs	2020-05-25 10:29:08 +09:00
__init__.py	[SPARK-31767][PYTHON][CORE] Remove ResourceInformation in pyspark module's namespace	2020-05-19 22:36:36 -07:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
broadcast.py	[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0	2019-10-03 19:20:51 +09:00
cloudpickle.py	[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8	2019-10-22 16:18:34 +09:00
conf.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
context.py	[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs	2020-05-25 10:29:08 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
find_spark_home.py	[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake	2020-04-09 11:04:35 +09:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests	2020-04-23 10:20:39 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
rdd.py	[SPARK-31748][PYTHON] Document resource module in PySpark doc and rename/move classes	2020-05-19 17:09:37 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove deprecation warnings	2019-12-10 11:08:13 -08:00
serializers.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-25696] The storage memory displayed on spark Application UI is…	2018-12-10 18:27:01 -06:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-31344][CORE] Polish implementation of barrier() and allGather()	2020-04-16 21:23:32 -07:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests	2020-04-23 10:20:39 +09:00
version.py	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00
worker.py	[SPARK-31748][PYTHON] Document resource module in PySpark doc and rename/move classes	2020-05-19 17:09:37 -07:00