[SPARK-26252][PYTHON] Add support to run specific unittests and/or doctests in python/run-tests script

## What changes were proposed in this pull request?

This PR proposes add a developer option, `--testnames`, to our testing script to allow run specific set of unittests and doctests.

**1. Run unittests in the class**

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests'
```
```
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow ArrowTests']
Starting test(python2.7): pyspark.sql.tests.test_arrow ArrowTests
Starting test(pypy): pyspark.sql.tests.test_arrow ArrowTests
Finished test(python2.7): pyspark.sql.tests.test_arrow ArrowTests (14s)
Finished test(pypy): pyspark.sql.tests.test_arrow ArrowTests (14s) ... 22 tests were skipped
Tests passed in 14 seconds

Skipped tests in pyspark.sql.tests.test_arrow ArrowTests with pypy:
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_fallback_disabled (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_fallback_enabled (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped
...
```

**2. Run single unittest in the class.**

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion'
```
```
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion']
Starting test(pypy): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion
Starting test(python2.7): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion
Finished test(pypy): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion (0s) ... 1 tests were skipped
Finished test(python2.7): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion (8s)
Tests passed in 8 seconds

Skipped tests in pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion with pypy:
    test_null_conversion (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

**3. Run doctests in single PySpark module.**

```bash
./run-tests --testnames pyspark.sql.dataframe
```

```
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.dataframe']
Starting test(pypy): pyspark.sql.dataframe
Starting test(python2.7): pyspark.sql.dataframe
Finished test(python2.7): pyspark.sql.dataframe (47s)
Finished test(pypy): pyspark.sql.dataframe (48s)
Tests passed in 48 seconds
```

Of course, you can mix them:

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests,pyspark.sql.dataframe'
```

```
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow ArrowTests', 'pyspark.sql.dataframe']
Starting test(pypy): pyspark.sql.dataframe
Starting test(pypy): pyspark.sql.tests.test_arrow ArrowTests
Starting test(python2.7): pyspark.sql.dataframe
Starting test(python2.7): pyspark.sql.tests.test_arrow ArrowTests
Finished test(pypy): pyspark.sql.tests.test_arrow ArrowTests (0s) ... 22 tests were skipped
Finished test(python2.7): pyspark.sql.tests.test_arrow ArrowTests (18s)
Finished test(python2.7): pyspark.sql.dataframe (50s)
Finished test(pypy): pyspark.sql.dataframe (52s)
Tests passed in 52 seconds

Skipped tests in pyspark.sql.tests.test_arrow ArrowTests with pypy:
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_fallback_disabled (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

and also you can use all other options (except `--modules`, which will be ignored)

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion' --python-executables=python
```

```
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion']
Starting test(python): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion
Finished test(python): pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion (12s)
Tests passed in 12 seconds
```

See help below:

```bash
 ./run-tests --help
```

```
Usage: run-tests [options]

Options:
...
  Developer Options:
    --testnames=TESTNAMES
                        A comma-separated list of specific modules, classes
                        and functions of doctest or unittest to test. For
                        example, 'pyspark.sql.foo' to run the module as
                        unittests or doctests, 'pyspark.sql.tests FooTests' to
                        run the specific class of unittests,
                        'pyspark.sql.tests FooTests.test_foo' to run the
                        specific unittest in the class. '--modules' option is
                        ignored if they are given.
```

I intentionally grouped it as a developer option to be more conservative.

## How was this patch tested?

Manually tested. Negative tests were also done.

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion1' --python-executables=python
```

```
...
AttributeError: type object 'ArrowTests' has no attribute 'test_null_conversion1'
...
```

```bash
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowT' --python-executables=python
```

```
...
AttributeError: 'module' object has no attribute 'ArrowT'
...
```

```bash
 ./run-tests --testnames 'pyspark.sql.tests.test_ar' --python-executables=python
```
```
...
/.../python2.7: No module named pyspark.sql.tests.test_ar
```

Closes #23203 from HyukjinKwon/SPARK-26252.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
Hyukjin Kwon 2018-12-05 15:22:08 +08:00
parent 7143e9d722
commit 7e3eb3cd20
2 changed files with 46 additions and 24 deletions

View file

@ -50,8 +50,6 @@ export SPARK_CONF_DIR="$COVERAGE_DIR/conf"
# This environment variable enables the coverage.
export COVERAGE_PROCESS_START="$FWDIR/.coveragerc"
# If you'd like to run a specific unittest class, you could do such as
# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
./run-tests "$@"
# Don't run coverage for the coverage command itself

View file

@ -19,7 +19,7 @@
from __future__ import print_function
import logging
from optparse import OptionParser
from optparse import OptionParser, OptionGroup
import os
import re
import shutil
@ -99,7 +99,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python):
try:
per_test_output = tempfile.TemporaryFile()
retcode = subprocess.Popen(
[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
[os.path.join(SPARK_HOME, "bin/pyspark")] + test_name.split(),
stderr=per_test_output, stdout=per_test_output, env=env).wait()
shutil.rmtree(tmp_dir, ignore_errors=True)
except:
@ -190,6 +190,20 @@ def parse_opts():
help="Enable additional debug logging"
)
group = OptionGroup(parser, "Developer Options")
group.add_option(
"--testnames", type="string",
default=None,
help=(
"A comma-separated list of specific modules, classes and functions of doctest "
"or unittest to test. "
"For example, 'pyspark.sql.foo' to run the module as unittests or doctests, "
"'pyspark.sql.tests FooTests' to run the specific class of unittests, "
"'pyspark.sql.tests FooTests.test_foo' to run the specific unittest in the class. "
"'--modules' option is ignored if they are given.")
)
parser.add_option_group(group)
(opts, args) = parser.parse_args()
if args:
parser.error("Unsupported arguments: %s" % ' '.join(args))
@ -213,25 +227,31 @@ def _check_coverage(python_exec):
def main():
opts = parse_opts()
if (opts.verbose):
if opts.verbose:
log_level = logging.DEBUG
else:
log_level = logging.INFO
should_test_modules = opts.testnames is None
logging.basicConfig(stream=sys.stdout, level=log_level, format="%(message)s")
LOGGER.info("Running PySpark tests. Output is in %s", LOG_FILE)
if os.path.exists(LOG_FILE):
os.remove(LOG_FILE)
python_execs = opts.python_executables.split(',')
modules_to_test = []
for module_name in opts.modules.split(','):
if module_name in python_modules:
modules_to_test.append(python_modules[module_name])
else:
print("Error: unrecognized module '%s'. Supported modules: %s" %
(module_name, ", ".join(python_modules)))
sys.exit(-1)
LOGGER.info("Will test against the following Python executables: %s", python_execs)
LOGGER.info("Will test the following Python modules: %s", [x.name for x in modules_to_test])
if should_test_modules:
modules_to_test = []
for module_name in opts.modules.split(','):
if module_name in python_modules:
modules_to_test.append(python_modules[module_name])
else:
print("Error: unrecognized module '%s'. Supported modules: %s" %
(module_name, ", ".join(python_modules)))
sys.exit(-1)
LOGGER.info("Will test the following Python modules: %s", [x.name for x in modules_to_test])
else:
testnames_to_test = opts.testnames.split(',')
LOGGER.info("Will test the following Python tests: %s", testnames_to_test)
task_queue = Queue.PriorityQueue()
for python_exec in python_execs:
@ -246,16 +266,20 @@ def main():
LOGGER.debug("%s python_implementation is %s", python_exec, python_implementation)
LOGGER.debug("%s version is: %s", python_exec, subprocess_check_output(
[python_exec, "--version"], stderr=subprocess.STDOUT, universal_newlines=True).strip())
for module in modules_to_test:
if python_implementation not in module.blacklisted_python_implementations:
for test_goal in module.python_test_goals:
heavy_tests = ['pyspark.streaming.tests', 'pyspark.mllib.tests',
'pyspark.tests', 'pyspark.sql.tests', 'pyspark.ml.tests']
if any(map(lambda prefix: test_goal.startswith(prefix), heavy_tests)):
priority = 0
else:
priority = 100
task_queue.put((priority, (python_exec, test_goal)))
if should_test_modules:
for module in modules_to_test:
if python_implementation not in module.blacklisted_python_implementations:
for test_goal in module.python_test_goals:
heavy_tests = ['pyspark.streaming.tests', 'pyspark.mllib.tests',
'pyspark.tests', 'pyspark.sql.tests', 'pyspark.ml.tests']
if any(map(lambda prefix: test_goal.startswith(prefix), heavy_tests)):
priority = 0
else:
priority = 100
task_queue.put((priority, (python_exec, test_goal)))
else:
for test_goal in testnames_to_test:
task_queue.put((0, (python_exec, test_goal)))
# Create the target directory before starting tasks to avoid races.
target_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'target'))