spark-instrumented-optimizer/python/pyspark/tests
Weichen Xu ee1de66fe4 [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group
### What changes were proposed in this pull request?
I add a new API in pyspark RDD class:

def collectWithJobGroup(self, groupId, description, interruptOnCancel=False)

This API do the same thing with `rdd.collect`, but it can specify the job group when do collect.
The purpose of adding this API is, if we use:

```
sc.setJobGroup("group-id...")
rdd.collect()
```
The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in
https://issues.apache.org/jira/browse/SPARK-31549

Note:

This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to  `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0.

- `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications.
- It is impossible to make it runtime configuration as it has to be set before JVM is launched.
- There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability.

Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1.

### Why are the changes needed?
Fix bug.

### Does this PR introduce any user-facing change?
A develop API in pyspark: `pyspark.RDD. collectWithJobGroup`

### How was this patch tested?
Unit test.

Closes #28395 from WeichenXu123/collect_with_job_group.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-01 10:08:16 +09:00
..
__init__.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_appsubmit.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_broadcast.py [SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC 2019-08-05 20:18:53 +09:00
test_conf.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_context.py [SPARK-30969][CORE] Remove resource coordination support from Standalone 2020-03-02 11:23:07 -08:00
test_daemon.py [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 2019-08-03 10:31:15 +09:00
test_join.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pin_thread.py [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's 2019-11-08 06:44:58 +09:00
test_profiler.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_rdd.py [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group 2020-05-01 10:08:16 +09:00
test_rddbarrier.py [SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier 2019-10-23 13:46:09 +02:00
test_readwrite.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_serializers.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_shuffle.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_taskcontext.py [SPARK-30969][CORE] Remove resource coordination support from Standalone 2020-03-02 11:23:07 -08:00
test_util.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_worker.py [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00