## What changes were proposed in this pull request?
### Background
For the current status, the test script that generates coverage information was merged
into Spark, https://github.com/apache/spark/pull/20204
So, we can generate the coverage report and site by, for example:
```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```
like `run-tests` script in `./python`.
### Proposed change
The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).
This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.
To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)
In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:
```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git
# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*
# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/
# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch
# Add all the files.
git add -A
# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"
# Delete the old branch.
git branch -D gh-pages
# Rename the temporary branch to master.
git branch -m gh-pages
# Finally, force update to our repository.
git push -f origin gh-pages
```
So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.
### TODOs
- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
- 6 consecutive passes out of 7 runs
This PR will be co-authored with me and shaneknapp
## How was this patch tested?
It will be tested via Jenkins.
Closes#23117 from HyukjinKwon/SPARK-7721.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy.
Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...:
```
pyspark
├── __init__.py
...
├── streaming
│ ├── __init__.py
...
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_context.py
│ │ ├── test_dstream.py
│ │ ├── test_kinesis.py
│ │ └── test_listener.py
...
├── testing
...
│ ├── streamingutils.py
...
```
## How was this patch tested?
Existing tests should cover.
`cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran.
Each test (not officially) can be ran via:
```bash
SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context
```
Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`.
Closes#23034 from HyukjinKwon/SPARK-26035.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>