## What changes were proposed in this pull request?
This PR is a small follow up that puts some logic and functions into smaller scope and make it localized, and deduplicate.
## How was this patch tested?
Manually tested. Jenkins tests as well.
Closes#23200 from HyukjinKwon/followup-SPARK-26034-SPARK-26033.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
## What changes were proposed in this pull request?
Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance:
```python
if sys.version_info[:2] <= (2, 6):
try:
import unittest2 as unittest
except ImportError:
sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
sys.exit(1)
else:
import unittest
```
While I am here, I removed some of unused imports and reordered imports per PEP 8.
We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out.
## How was this patch tested?
Manually tests, and existing tests via Jenkins.
Closes#23077 from HyukjinKwon/SPARK-26105.
Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain.
The tests are broken down as follows:
```
pyspark
├── __init__.py
...
├── ml
│ ├── __init__.py
...
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_algorithms.py
│ │ ├── test_base.py
│ │ ├── test_evaluation.py
│ │ ├── test_feature.py
│ │ ├── test_image.py
│ │ ├── test_linalg.py
│ │ ├── test_param.py
│ │ ├── test_persistence.py
│ │ ├── test_pipeline.py
│ │ ├── test_stat.py
│ │ ├── test_training_summary.py
│ │ ├── test_tuning.py
│ │ └── test_wrapper.py
...
├── testing
...
│ ├── mlutils.py
...
```
## How was this patch tested?
Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6.
Closes#23063 from BryanCutler/python-test-breakup-ml-SPARK-26033.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR breaks down the large mllib/tests.py file that contains all Python MLlib unit tests into several smaller test files to be easier to read and maintain.
The tests are broken down as follows:
```
pyspark
├── __init__.py
...
├── mllib
│ ├── __init__.py
...
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_algorithms.py
│ │ ├── test_feature.py
│ │ ├── test_linalg.py
│ │ ├── test_stat.py
│ │ ├── test_streaming_algorithms.py
│ │ └── test_util.py
...
├── testing
...
│ ├── mllibutils.py
...
```
## How was this patch tested?
Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-mllib` to verify all passing with Python 2.7 and Python 3.6. Also installed scipy to include optional tests in test_linalg.
Closes#23056 from BryanCutler/python-test-breakup-mllib-SPARK-26034.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy.
Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...:
```
pyspark
├── __init__.py
...
├── streaming
│ ├── __init__.py
...
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_context.py
│ │ ├── test_dstream.py
│ │ ├── test_kinesis.py
│ │ └── test_listener.py
...
├── testing
...
│ ├── streamingutils.py
...
```
## How was this patch tested?
Existing tests should cover.
`cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran.
Each test (not officially) can be ran via:
```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```
Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.
Closes#23034 from HyukjinKwon/SPARK-26035.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy.
Basically this PR proposes to break down `pyspark/tests.py` into ...:
```
pyspark
...
├── testing
...
│ └── utils.py
├── tests
│ ├── __init__.py
│ ├── test_appsubmit.py
│ ├── test_broadcast.py
│ ├── test_conf.py
│ ├── test_context.py
│ ├── test_daemon.py
│ ├── test_join.py
│ ├── test_profiler.py
│ ├── test_rdd.py
│ ├── test_readwrite.py
│ ├── test_serializers.py
│ ├── test_shuffle.py
│ ├── test_taskcontext.py
│ ├── test_util.py
│ └── test_worker.py
...
```
## How was this patch tested?
Existing tests should cover.
`cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran.
Each test (not officially) can be ran via:
```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```
Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.
Closes#23033 from HyukjinKwon/SPARK-26036.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file!
This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully - although it will costs to start and stop the context.
We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy.
Basically this PR proposes to break down `pyspark/sql/tests.py` into ...:
```bash
pyspark
...
├── sql
...
│ ├── tests # Includes all tests broken down from 'pyspark/sql/tests.py'
│ │ │ # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can
│ │ │ # be added. For instance, 'test_arrow.py', 'test_datasources.py' ...
│ │ ├── __init__.py
│ │ ├── test_appsubmit.py
│ │ ├── test_arrow.py
│ │ ├── test_catalog.py
│ │ ├── test_column.py
│ │ ├── test_conf.py
│ │ ├── test_context.py
│ │ ├── test_dataframe.py
│ │ ├── test_datasources.py
│ │ ├── test_functions.py
│ │ ├── test_group.py
│ │ ├── test_pandas_udf.py
│ │ ├── test_pandas_udf_grouped_agg.py
│ │ ├── test_pandas_udf_grouped_map.py
│ │ ├── test_pandas_udf_scalar.py
│ │ ├── test_pandas_udf_window.py
│ │ ├── test_readwriter.py
│ │ ├── test_serde.py
│ │ ├── test_session.py
│ │ ├── test_streaming.py
│ │ ├── test_types.py
│ │ ├── test_udf.py
│ │ └── test_utils.py
...
├── testing # Includes testing utils that can be used in unittests.
│ ├── __init__.py
│ └── sqlutils.py
...
```
## How was this patch tested?
Existing tests should cover.
`cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran.
Each test (not officially) can be ran via:
```
SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar
```
Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.
Closes#23021 from HyukjinKwon/SPARK-25344.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>