spark-instrumented-optimizer

History

Weichen Xu ee1de66fe4 [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group ### What changes were proposed in this pull request? I add a new API in pyspark RDD class: def collectWithJobGroup(self, groupId, description, interruptOnCancel=False) This API do the same thing with `rdd.collect`, but it can specify the job group when do collect. The purpose of adding this API is, if we use: ``` sc.setJobGroup("group-id...") rdd.collect() ``` The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in https://issues.apache.org/jira/browse/SPARK-31549 Note: This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0. - `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications. - It is impossible to make it runtime configuration as it has to be set before JVM is launched. - There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability. Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1. ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? A develop API in pyspark: `pyspark.RDD. collectWithJobGroup` ### How was this patch tested? Unit test. Closes #28395 from WeichenXu123/collect_with_job_group. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-05-01 10:08:16 +09:00
..
__init__.py	[SPARK-26036][PYTHON] Break large tests.py files into smaller files	2018-11-15 12:30:52 +08:00
test_appsubmit.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_broadcast.py	[SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC	2019-08-05 20:18:53 +09:00
test_conf.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_context.py	[SPARK-30969][CORE] Remove resource coordination support from Standalone	2020-03-02 11:23:07 -08:00
test_daemon.py	[SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7	2019-08-03 10:31:15 +09:00
test_join.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_pin_thread.py	[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's	2019-11-08 06:44:58 +09:00
test_profiler.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_rdd.py	[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group	2020-05-01 10:08:16 +09:00
test_rddbarrier.py	[SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier	2019-10-23 13:46:09 +02:00
test_readwrite.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_serializers.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_shuffle.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_taskcontext.py	[SPARK-30969][CORE] Remove resource coordination support from Standalone	2020-03-02 11:23:07 -08:00
test_util.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_worker.py	[SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests	2020-04-23 10:20:39 +09:00