Commit graph

17 commits

Author SHA1 Message Date
Takuya UESHIN 115b8a180f [SPARK-36062][PYTHON] Try to capture faulthanlder when a Python worker crashes
### What changes were proposed in this pull request?

Try to capture the error message from the `faulthandler` when the Python worker crashes.

### Why are the changes needed?

Currently, we just see an error message saying `"exited unexpectedly (crashed)"` when the UDFs causes the Python worker to crash by like segmentation fault.
We should take advantage of [`faulthandler`](https://docs.python.org/3/library/faulthandler.html) and try to capture the error message from the `faulthandler`.

### Does this PR introduce _any_ user-facing change?

Yes, when a Spark config `spark.python.worker.faulthandler.enabled` is `true`, the stack trace will be seen in the error message when the Python worker crashes.

```py
>>> def f():
...   import ctypes
...   ctypes.string_at(0)
...
>>> sc.parallelize([1]).map(lambda x: f()).count()
```

```
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x000000010965b5c0 (most recent call first):
  File "/.../ctypes/__init__.py", line 525 in string_at
  File "<stdin>", line 3 in f
  File "<stdin>", line 1 in <lambda>
...
```

### How was this patch tested?

Added some tests, and manually.

Closes #33273 from ueshin/issues/SPARK-36062/faulthandler.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 11:30:39 +09:00
Hyukjin Kwon 20750a3f9e [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception
### What changes were proposed in this pull request?

This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.

While I am here, I fixed another minor issue at `DataFrams.schema` together:

```diff
- except AttributeError as e:
-     raise Exception(
-         "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+     raise ValueError(
+         "Unable to parse datatype from schema. %s" % e) from e
```

Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.

### Why are the changes needed?

For users to expect the proper exceptions.

### Does this PR introduce _any_ user-facing change?

Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.

### How was this patch tested?

Existing unittests should cover,

Closes #31238

Closes #32650 from HyukjinKwon/SPARK-32194.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 11:54:40 +09:00
Bryan Cutler aeb3649fb9 [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests
### What changes were proposed in this pull request?

This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.

- https://docs.python.org/3/library/unittest.html#deprecated-aliases

### Why are the changes needed?

For consistency and eventual removal of deprecated APIs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-01 10:34:40 +09:00
lrz 27bb40b629 [SPARK-33339][PYTHON] Pyspark application will hang due to non Exception error
### What changes were proposed in this pull request?

When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang.
The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck
we can run a simple test to reproduce this case:

```
from pyspark.sql import SparkSession
def err(line):
  raise SystemExit
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
spark.stop()
```

### Why are the changes needed?

to make sure pyspark application won't hang if there's non-Exception error in python worker

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added a new test and also manually tested the case above

Closes #30248 from li36909/pyspark.

Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 19:39:18 +09:00
HyukjinKwon e11a24c1ba [SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9
### What changes were proposed in this pull request?

This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9.

Also, this PR fixes some minor fixes into the test codes.
- `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed
- Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons.

NOTE that PyArrow does not support Python 3.9 yet.

### Why are the changes needed?

To officially support Python 3.9.

### Does this PR introduce _any_ user-facing change?

Yes, it officially supports Python 3.9.

### How was this patch tested?

Manually ran the tests:

```
$  ./run-tests --python-executable=python
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
python python_implementation is CPython
python version is: Python 3.9.0
Starting test(python): pyspark.ml.tests.test_base
Starting test(python): pyspark.ml.tests.test_evaluation
Starting test(python): pyspark.ml.tests.test_algorithms
Starting test(python): pyspark.ml.tests.test_feature
Finished test(python): pyspark.ml.tests.test_base (12s)
Starting test(python): pyspark.ml.tests.test_image
Finished test(python): pyspark.ml.tests.test_evaluation (15s)
Starting test(python): pyspark.ml.tests.test_linalg
Finished test(python): pyspark.ml.tests.test_feature (25s)
Starting test(python): pyspark.ml.tests.test_param
Finished test(python): pyspark.ml.tests.test_image (17s)
Starting test(python): pyspark.ml.tests.test_persistence
Finished test(python): pyspark.ml.tests.test_param (17s)
Starting test(python): pyspark.ml.tests.test_pipeline
Finished test(python): pyspark.ml.tests.test_linalg (30s)
Starting test(python): pyspark.ml.tests.test_stat
Finished test(python): pyspark.ml.tests.test_pipeline (6s)
Starting test(python): pyspark.ml.tests.test_training_summary
Finished test(python): pyspark.ml.tests.test_stat (12s)
Starting test(python): pyspark.ml.tests.test_tuning
Finished test(python): pyspark.ml.tests.test_algorithms (68s)
Starting test(python): pyspark.ml.tests.test_wrapper
Finished test(python): pyspark.ml.tests.test_persistence (51s)
Starting test(python): pyspark.mllib.tests.test_algorithms
Finished test(python): pyspark.ml.tests.test_training_summary (33s)
Starting test(python): pyspark.mllib.tests.test_feature
Finished test(python): pyspark.ml.tests.test_wrapper (19s)
Starting test(python): pyspark.mllib.tests.test_linalg
Finished test(python): pyspark.mllib.tests.test_feature (26s)
Starting test(python): pyspark.mllib.tests.test_stat
Finished test(python): pyspark.mllib.tests.test_stat (22s)
Starting test(python): pyspark.mllib.tests.test_streaming_algorithms
Finished test(python): pyspark.mllib.tests.test_algorithms (53s)
Starting test(python): pyspark.mllib.tests.test_util
Finished test(python): pyspark.mllib.tests.test_linalg (54s)
Starting test(python): pyspark.sql.tests.test_arrow
Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped
Starting test(python): pyspark.sql.tests.test_catalog
Finished test(python): pyspark.mllib.tests.test_util (11s)
Starting test(python): pyspark.sql.tests.test_column
Finished test(python): pyspark.sql.tests.test_catalog (16s)
Starting test(python): pyspark.sql.tests.test_conf
Finished test(python): pyspark.sql.tests.test_column (17s)
Starting test(python): pyspark.sql.tests.test_context
Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped
Starting test(python): pyspark.sql.tests.test_dataframe
Finished test(python): pyspark.sql.tests.test_conf (11s)
Starting test(python): pyspark.sql.tests.test_datasources
Finished test(python): pyspark.sql.tests.test_datasources (19s)
Starting test(python): pyspark.sql.tests.test_functions
Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped
Starting test(python): pyspark.sql.tests.test_group
Finished test(python): pyspark.sql.tests.test_functions (32s)
Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map
Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_grouped_map
Finished test(python): pyspark.sql.tests.test_group (19s)
Starting test(python): pyspark.sql.tests.test_pandas_map
Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf
Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg
Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar
Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints
Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_window
Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped
Starting test(python): pyspark.sql.tests.test_readwriter
Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped
Starting test(python): pyspark.sql.tests.test_serde
Finished test(python): pyspark.sql.tests.test_serde (19s)
Starting test(python): pyspark.sql.tests.test_session
Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s)
Starting test(python): pyspark.sql.tests.test_streaming
Finished test(python): pyspark.sql.tests.test_readwriter (25s)
Starting test(python): pyspark.sql.tests.test_types
Finished test(python): pyspark.ml.tests.test_tuning (208s)
Starting test(python): pyspark.sql.tests.test_udf
Finished test(python): pyspark.sql.tests.test_session (31s)
Starting test(python): pyspark.sql.tests.test_utils
Finished test(python): pyspark.sql.tests.test_streaming (35s)
Starting test(python): pyspark.streaming.tests.test_context
Finished test(python): pyspark.sql.tests.test_types (34s)
Starting test(python): pyspark.streaming.tests.test_dstream
Finished test(python): pyspark.sql.tests.test_utils (14s)
Starting test(python): pyspark.streaming.tests.test_kinesis
Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
Starting test(python): pyspark.streaming.tests.test_listener
Finished test(python): pyspark.streaming.tests.test_listener (11s)
Starting test(python): pyspark.tests.test_appsubmit
Finished test(python): pyspark.sql.tests.test_udf (39s)
Starting test(python): pyspark.tests.test_broadcast
Finished test(python): pyspark.streaming.tests.test_context (23s)
Starting test(python): pyspark.tests.test_conf
Finished test(python): pyspark.tests.test_conf (15s)
Starting test(python): pyspark.tests.test_context
Finished test(python): pyspark.tests.test_broadcast (33s)
Starting test(python): pyspark.tests.test_daemon
Finished test(python): pyspark.tests.test_daemon (5s)
Starting test(python): pyspark.tests.test_install_spark
Finished test(python): pyspark.tests.test_context (44s)
Starting test(python): pyspark.tests.test_join
Finished test(python): pyspark.tests.test_appsubmit (68s)
Starting test(python): pyspark.tests.test_profiler
Finished test(python): pyspark.tests.test_join (7s)
Starting test(python): pyspark.tests.test_rdd
Finished test(python): pyspark.tests.test_profiler (9s)
Starting test(python): pyspark.tests.test_rddbarrier
Finished test(python): pyspark.tests.test_rddbarrier (7s)
Starting test(python): pyspark.tests.test_readwrite
Finished test(python): pyspark.streaming.tests.test_dstream (107s)
Starting test(python): pyspark.tests.test_serializers
Finished test(python): pyspark.tests.test_serializers (8s)
Starting test(python): pyspark.tests.test_shuffle
Finished test(python): pyspark.tests.test_readwrite (14s)
Starting test(python): pyspark.tests.test_taskcontext
Finished test(python): pyspark.tests.test_install_spark (65s)
Starting test(python): pyspark.tests.test_util
Finished test(python): pyspark.tests.test_shuffle (8s)
Starting test(python): pyspark.tests.test_worker
Finished test(python): pyspark.tests.test_util (5s)
Starting test(python): pyspark.accumulators
Finished test(python): pyspark.accumulators (5s)
Starting test(python): pyspark.broadcast
Finished test(python): pyspark.broadcast (6s)
Starting test(python): pyspark.conf
Finished test(python): pyspark.tests.test_worker (14s)
Starting test(python): pyspark.context
Finished test(python): pyspark.conf (4s)
Starting test(python): pyspark.ml.classification
Finished test(python): pyspark.tests.test_rdd (60s)
Starting test(python): pyspark.ml.clustering
Finished test(python): pyspark.context (21s)
Starting test(python): pyspark.ml.evaluation
Finished test(python): pyspark.tests.test_taskcontext (69s)
Starting test(python): pyspark.ml.feature
Finished test(python): pyspark.ml.evaluation (26s)
Starting test(python): pyspark.ml.fpm
Finished test(python): pyspark.ml.clustering (45s)
Starting test(python): pyspark.ml.functions
Finished test(python): pyspark.ml.fpm (24s)
Starting test(python): pyspark.ml.image
Finished test(python): pyspark.ml.functions (17s)
Starting test(python): pyspark.ml.linalg.__init__
Finished test(python): pyspark.ml.linalg.__init__ (0s)
Starting test(python): pyspark.ml.recommendation
Finished test(python): pyspark.ml.classification (74s)
Starting test(python): pyspark.ml.regression
Finished test(python): pyspark.ml.image (8s)
Starting test(python): pyspark.ml.stat
Finished test(python): pyspark.ml.stat (29s)
Starting test(python): pyspark.ml.tuning
Finished test(python): pyspark.ml.regression (53s)
Starting test(python): pyspark.mllib.classification
Finished test(python): pyspark.ml.tuning (35s)
Starting test(python): pyspark.mllib.clustering
Finished test(python): pyspark.ml.feature (103s)
Starting test(python): pyspark.mllib.evaluation
Finished test(python): pyspark.mllib.classification (33s)
Starting test(python): pyspark.mllib.feature
Finished test(python): pyspark.mllib.evaluation (21s)
Starting test(python): pyspark.mllib.fpm
Finished test(python): pyspark.ml.recommendation (103s)
Starting test(python): pyspark.mllib.linalg.__init__
Finished test(python): pyspark.mllib.linalg.__init__ (1s)
Starting test(python): pyspark.mllib.linalg.distributed
Finished test(python): pyspark.mllib.feature (26s)
Starting test(python): pyspark.mllib.random
Finished test(python): pyspark.mllib.fpm (23s)
Starting test(python): pyspark.mllib.recommendation
Finished test(python): pyspark.mllib.clustering (50s)
Starting test(python): pyspark.mllib.regression
Finished test(python): pyspark.mllib.random (13s)
Starting test(python): pyspark.mllib.stat.KernelDensity
Finished test(python): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python): pyspark.mllib.stat._statistics
Finished test(python): pyspark.mllib.linalg.distributed (42s)
Starting test(python): pyspark.mllib.tree
Finished test(python): pyspark.mllib.stat._statistics (19s)
Starting test(python): pyspark.mllib.util
Finished test(python): pyspark.mllib.regression (33s)
Starting test(python): pyspark.profiler
Finished test(python): pyspark.mllib.recommendation (36s)
Starting test(python): pyspark.rdd
Finished test(python): pyspark.profiler (9s)
Starting test(python): pyspark.resource.tests.test_resources
Finished test(python): pyspark.mllib.tree (19s)
Starting test(python): pyspark.serializers
Finished test(python): pyspark.mllib.util (21s)
Starting test(python): pyspark.shuffle
Finished test(python): pyspark.resource.tests.test_resources (9s)
Starting test(python): pyspark.sql.avro.functions
Finished test(python): pyspark.shuffle (1s)
Starting test(python): pyspark.sql.catalog
Finished test(python): pyspark.rdd (22s)
Starting test(python): pyspark.sql.column
Finished test(python): pyspark.serializers (12s)
Starting test(python): pyspark.sql.conf
Finished test(python): pyspark.sql.conf (6s)
Starting test(python): pyspark.sql.context
Finished test(python): pyspark.sql.catalog (14s)
Starting test(python): pyspark.sql.dataframe
Finished test(python): pyspark.sql.avro.functions (15s)
Starting test(python): pyspark.sql.functions
Finished test(python): pyspark.sql.column (24s)
Starting test(python): pyspark.sql.group
Finished test(python): pyspark.sql.context (20s)
Starting test(python): pyspark.sql.pandas.conversion
Finished test(python): pyspark.sql.pandas.conversion (13s)
Starting test(python): pyspark.sql.pandas.group_ops
Finished test(python): pyspark.sql.group (36s)
Starting test(python): pyspark.sql.pandas.map_ops
Finished test(python): pyspark.sql.pandas.group_ops (21s)
Starting test(python): pyspark.sql.pandas.serializers
Finished test(python): pyspark.sql.pandas.serializers (0s)
Starting test(python): pyspark.sql.pandas.typehints
Finished test(python): pyspark.sql.pandas.typehints (0s)
Starting test(python): pyspark.sql.pandas.types
Finished test(python): pyspark.sql.pandas.types (0s)
Starting test(python): pyspark.sql.pandas.utils
Finished test(python): pyspark.sql.pandas.utils (0s)
Starting test(python): pyspark.sql.readwriter
Finished test(python): pyspark.sql.dataframe (56s)
Starting test(python): pyspark.sql.session
Finished test(python): pyspark.sql.functions (57s)
Starting test(python): pyspark.sql.streaming
Finished test(python): pyspark.sql.pandas.map_ops (12s)
Starting test(python): pyspark.sql.types
Finished test(python): pyspark.sql.types (10s)
Starting test(python): pyspark.sql.udf
Finished test(python): pyspark.sql.streaming (16s)
Starting test(python): pyspark.sql.window
Finished test(python): pyspark.sql.session (19s)
Starting test(python): pyspark.streaming.util
Finished test(python): pyspark.streaming.util (0s)
Starting test(python): pyspark.util
Finished test(python): pyspark.util (0s)
Finished test(python): pyspark.sql.readwriter (24s)
Finished test(python): pyspark.sql.udf (13s)
Finished test(python): pyspark.sql.window (14s)
Tests passed in 780 seconds

```

Closes #30277 from HyukjinKwon/SPARK-33371.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-06 15:05:37 -08:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
Fokko Driesprong 9fcf0ea718 [SPARK-32319][PYSPARK] Disallow the use of unused imports
Disallow the use of unused imports:

- Unnecessary increases the memory footprint of the application
- Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :)

```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused
python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused
python/pyspark/heapq3.py:868:5: F401 '_heapq.*' imported but unused
python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused
python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused
python/pyspark/rdd.py:21:1: F401 're' imported but unused
python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused
python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused
python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused
python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.*' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.*' imported but unused
python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.*' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.*' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.*' imported but unused
python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.*' imported but unused
python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused
python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused
python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.*' imported but unused
python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused
python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.*' imported but unused
python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.*' imported but unused
python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused
python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.*' imported but unused
python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.*' imported but unused
python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.*' imported but unused
python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.*' imported but unused
python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.*' imported but unused
python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused
python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.*' imported but unused
python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused
python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.*' imported but unused
python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.*' imported but unused
python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused
python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.*' imported but unused
python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.*' imported but unused
python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused
python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused
python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.*' imported but unused
python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.*' imported but unused
python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused
python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused
python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.*' imported but unused
python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.*' imported but unused
python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.*' imported but unused
python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.*' imported but unused
python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused
python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused
python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused
python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused
python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused
python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.*' imported but unused
python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.*' imported but unused
python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused
python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.*' imported but unused
python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.*' imported but unused
python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.*' imported but unused
python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused
python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused
python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.*' imported but unused
python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.*' imported but unused
python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.*' imported but unused
python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.*' imported but unused
python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.*' imported but unused
python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.*' imported but unused
python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.*' imported but unused
python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.*' imported but unused
python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused
python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused
python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.*' imported but unused
python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused
python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused
python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.*' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.*' imported but unused
python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.*' imported but unused
python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.*' imported but unused
python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.*' imported but unused
python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.*' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.*' imported but unused
python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused
python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.*' imported but unused
python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused
python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.*' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.*' imported but unused
python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused
python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused
```

After:
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
fokkodriesprongFan spark %
```

### What changes were proposed in this pull request?

Removing unused imports from the Python files to keep everything nice and tidy.

### Why are the changes needed?

Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding the rule to the existing Flake8 checks.

Closes #29121 from Fokko/SPARK-32319.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-08 08:51:57 -07:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
Thomas Graves 95aec091e4 [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests
### What changes were proposed in this pull request?

As part of the Stage level scheduling features, add the Python api's to set resource profiles.
This also adds the functionality to properly apply the pyspark memory configuration when specified in the ResourceProfile. The pyspark memory configuration is being passed in the task local properties. This was an easy way to get it to the PythonRunner that needs it. I modeled this off how the barrier task scheduling is passing the addresses. As part of this I added in the JavaRDD api's because those are needed by python.

### Why are the changes needed?

python api for this feature

### Does this PR introduce any user-facing change?

Yes adds the java and python apis for user to specify a ResourceProfile to use stage level scheduling.

### How was this patch tested?

unit tests and manually tested on yarn. Tests also run to verify it errors properly on standalone and local mode where its not yet supported.

Closes #28085 from tgravescs/SPARK-29641-pr-base.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 10:20:39 +09:00
HyukjinKwon 0823aec463 [SPARK-30480][PYTHON][TESTS] Increases the memory limit being tested in 'WorkerMemoryTest.test_memory_limit'
### What changes were proposed in this pull request?

This PR proposes to increase the memory in `WorkerMemoryTest.test_memory_limit` in order to make the test pass with PyPy.

The test is currently failed only in PyPy as below in some PRs unexpectedly:

```
Current mem limits: 18446744073709551615 of max 18446744073709551615

Setting mem limits to 1048576 of max 1048576

RPython traceback:
  File "pypy_module_pypyjit_interp_jit.c", line 289, in portal_5
  File "pypy_interpreter_pyopcode.c", line 3468, in handle_bytecode__AccessDirect_None
  File "pypy_interpreter_pyopcode.c", line 5558, in dispatch_bytecode__AccessDirect_None
out of memory: couldn't allocate the next arena
ERROR
```

It seems related to how PyPy allocates the memory and GC works PyPy-specifically. There seems nothing wrong in this configuration implementation itself in PySpark side.

I roughly tested in higher PyPy versions on Ubuntu (PyPy v7.3.0) and this test seems passing fine so I suspect this might be an issue in old PyPy behaviours.

The change only increases the limit so it would not affect actual memory allocations. It just needs to test if the limit is properly set in worker sides. For clarification, the memory is unlimited in the machine if not set.

### Why are the changes needed?

To make the tests pass and unblock other PRs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually and Jenkins should test it out.

Closes #27186 from HyukjinKwon/SPARK-30480.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-13 18:47:15 +09:00
HyukjinKwon d0983af38f Revert "[SPARK-30480][PYSPARK][TESTS] Fix 'test_memory_limit' on pyspark test"
This reverts commit afd70a0f6f.
2020-01-10 22:35:54 +09:00
Jungtaek Lim (HeartSaVioR) afd70a0f6f [SPARK-30480][PYSPARK][TESTS] Fix 'test_memory_limit' on pyspark test
### What changes were proposed in this pull request?

This patch increases the memory limit in the test 'test_memory_limit' from 1m to 8m.
Credit to srowen and HyukjinKwon to provide the idea of suspicion and guide how to fix.

### Why are the changes needed?

We observed consistent Pyspark test failures on multiple PRs (#26955, #26201, #27064) which block the PR builds whenever the test is included.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins builds passed in WIP PR (#27159)

Closes #27162 from HeartSaVioR/SPARK-30480.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-10 15:30:54 +09:00
Xianjin YE 8c8016a152 [SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2
### What changes were proposed in this pull request?

This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of `str` in Python 2.

### Why are the changes needed?

Previously PySpark will hang when the `UnicodeDecodeError` occurs and the real exception cannot be passed to the JVM side.

See the reproducer as below:

```python
def f():
    raise Exception("中")
spark = SparkSession.builder.master('local').getOrCreate()
spark.sparkContext.parallelize([1]).map(lambda x: f()).count()
```

### Does this PR introduce any user-facing change?

User may not observe hanging for the similar cases.

### How was this patch tested?

Added a new test and manually checking.

This pr is based on #18324, credits should also go to dataknocker.
To make lint-python happy for python3, it also includes a followup fix for #25814

Closes #25847 from advancedxy/python_exception_19926_and_21045.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-21 08:09:19 +09:00
HyukjinKwon 7c05f61514 [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark
## What changes were proposed in this pull request?

Currently, pretty skipped message added by f7435bec6a mechanism seems not working when xmlrunner is installed apparently.

This PR fixes two things:

1. When `xmlrunner` is installed, seems `xmlrunner` does not respect `vervosity` level in unittests (default is level 1).

    So the output looks as below

    ```
    Running tests...
     ----------------------------------------------------------------------
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
    ----------------------------------------------------------------------
    ```

    So it is not caught by our message detection mechanism.

2. If we manually set the `vervocity` level to `xmlrunner`, it prints messages as below:

    ```
    test_mixed_udf (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s)
    test_mixed_udf_and_sql (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s)
    ...
    ```

    This is different in our Jenkins machine:

    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.'
    ...
    ```

    Note that last `SKIP` is different. This PR fixes the regular expression to catch `SKIP` case as well.

## How was this patch tested?

Manually tested.

**Before:**

```
Starting test(python2.7): pyspark....
Finished test(python2.7): pyspark.... (0s)
...
Tests passed in 562 seconds

========================================================================
...
```

**After:**

```
Starting test(python2.7): pyspark....
Finished test(python2.7): pyspark.... (48s) ... 93 tests were skipped
...
Tests passed in 560 seconds

Skipped tests pyspark.... with python2.7:
      pyspark...(...) ... SKIP (0.000s)
...

========================================================================
...
```

Closes #24927 from HyukjinKwon/SPARK-28130.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-06-24 09:58:17 +09:00
Hyukjin Kwon 3a17c6a06b [SPARK-26743][PYTHON] Adds a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/21977 added a feature to limit Python worker resource limit.
This PR is kind of a followup of it. It proposes to add a test that checks the actual resource limit set by 'spark.executor.pyspark.memory'.

## How was this patch tested?

Unit tests were added.

Closes #23663 from HyukjinKwon/test_rlimit.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-28 10:02:27 +08:00
Yuanjian Li dbbba80b3c [SPARK-26549][PYSPARK] Fix for python worker reuse take no effect for parallelize lazy iterable range
## What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes #23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-09 11:55:12 +08:00
hyukjinkwon 03306a6df3 [SPARK-26036][PYTHON] Break large tests.py files into smaller files
## What changes were proposed in this pull request?

This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy.

Basically this PR proposes to break down `pyspark/tests.py` into ...:

```
pyspark
...
├── testing
...
│   └── utils.py
├── tests
│   ├── __init__.py
│   ├── test_appsubmit.py
│   ├── test_broadcast.py
│   ├── test_conf.py
│   ├── test_context.py
│   ├── test_daemon.py
│   ├── test_join.py
│   ├── test_profiler.py
│   ├── test_rdd.py
│   ├── test_readwrite.py
│   ├── test_serializers.py
│   ├── test_shuffle.py
│   ├── test_taskcontext.py
│   ├── test_util.py
│   └── test_worker.py
...
```

## How was this patch tested?

Existing tests should cover.

`cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran.

Each test (not officially) can be ran via:

```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```

Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.

Closes #23033 from HyukjinKwon/SPARK-26036.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-11-15 12:30:52 +08:00