spark-instrumented-optimizer/python/pyspark/tests
HyukjinKwon 29c51d682b [SPARK-31788][CORE][DSTREAM][PYTHON] Recover the support of union for different types of RDD and DStreams
### What changes were proposed in this pull request?

This PR manually specifies the class for the input array being used in `(SparkContext|StreamingContext).union`. It fixes a regression introduced from SPARK-25737.

```python
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
sc.union([pairRDD1, pairRDD1]).collect()
```

in the current master and `branch-3.0`:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/context.py", line 870, in union
    jrdds[i] = rdds[i]._jrdd
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in __setitem__
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD
	at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166)
	at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144)
	at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
```

which works in Spark 2.4.5:

```
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)]
```

It assumed the class of the input array is the same `JavaRDD` or `JavaDStream`; however, that can be different such as `JavaPairRDD`.

This fix is based on redsanket's initial approach, and will be co-authored.

### Why are the changes needed?

To fix a regression from Spark 2.4.5.

### Does this PR introduce _any_ user-facing change?

No, it's only in unreleased branches. This is to fix a regression.

### How was this patch tested?

Manually tested, and a unittest was added.

Closes #28648 from HyukjinKwon/SPARK-31788.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-01 09:43:03 +09:00
..
__init__.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_appsubmit.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_broadcast.py [SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC 2019-08-05 20:18:53 +09:00
test_conf.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_context.py [SPARK-30969][CORE] Remove resource coordination support from Standalone 2020-03-02 11:23:07 -08:00
test_daemon.py [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 2019-08-03 10:31:15 +09:00
test_join.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pin_thread.py [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's 2019-11-08 06:44:58 +09:00
test_profiler.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_rdd.py [SPARK-31788][CORE][DSTREAM][PYTHON] Recover the support of union for different types of RDD and DStreams 2020-06-01 09:43:03 +09:00
test_rddbarrier.py [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier 2019-10-23 13:46:09 +02:00
test_readwrite.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_serializers.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_shuffle.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_taskcontext.py [SPARK-30969][CORE] Remove resource coordination support from Standalone 2020-03-02 11:23:07 -08:00
test_util.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_worker.py [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00