spark-instrumented-optimizer/python/pyspark/mllib
Huaxin Gao 75c2c53e93 [SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests
### What changes were proposed in this pull request?
The test creates 10 batches of data  to train the model and expects to see error on test data improves as model is trained. If the difference between the 2nd error and the 10th error is smaller than 2, the assertion fails:
```
FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that error on test data improves as model is trained.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction
    eventually(condition, timeout=180.0)
  File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 81, in eventually
    lastValue = condition()
  File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition
    self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2
```
I saw this quite a few time on Jenkins but was not able to reproduce this on my local. These are the ten errors I got:
```
4.517395047937127
4.894265404350079
3.0392090466559876
1.8786361640757654
0.8973106042078115
0.3715780507684368
0.20815690742907672
0.17333033743125845
0.15686783249863873
0.12584413600569616
```
I am thinking of having 15 batches of data instead of 10, so the model can be trained for a longer time. Hopefully the 15th error - 2nd error will always be larger than 2 on Jenkins. These are the 15 errors I got on my local:
```
4.517395047937127
4.894265404350079
3.0392090466559876
1.8786361640757658
0.8973106042078115
0.3715780507684368
0.20815690742907672
0.17333033743125845
0.15686783249863873
0.12584413600569616
0.11883853835108477
0.09400261862100823
0.08887491447353497
0.05984929624986607
0.07583948141520978
```

### Why are the changes needed?
Fix flaky test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually tested

Closes #29380 from huaxingao/flaky_test.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
2020-08-06 13:54:15 -07:00
..
linalg [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
stat [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
tests [SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests 2020-08-06 13:54:15 -07:00
__init__.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
classification.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
clustering.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
common.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
evaluation.py [SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss 2019-10-18 17:57:13 +08:00
feature.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
fpm.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
random.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
recommendation.py [SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter 2019-03-23 11:26:09 -05:00
regression.py [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis 2019-01-17 19:40:39 -06:00
tree.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00
util.py [SPARK-32138] Drop Python 2.7, 3.4 and 3.5 2020-07-14 11:22:44 +09:00