### What changes were proposed in this pull request?
In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use ```objectiveHistory.length``` as totalIterations, but ```objectiveHistory``` contains init sate, thus ```objectiveHistory.length``` is 1 larger than number of training iterations.
### Why are the changes needed?
correctness
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add new tests and also modify existing tests
Closes#28786 from huaxingao/summary_iter.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark
### Why are the changes needed?
Currently we have
```
since("2.0.0")
def evaluate(self, dataset):
if not isinstance(dataset, DataFrame):
raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
java_blr_summary = self._call_java("evaluate", dataset)
return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic regression
### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python
### How was this patch tested?
unit test
Closes#28503 from huaxingao/lr_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
Currently, pretty skipped message added by f7435bec6a mechanism seems not working when xmlrunner is installed apparently.
This PR fixes two things:
1. When `xmlrunner` is installed, seems `xmlrunner` does not respect `vervosity` level in unittests (default is level 1).
So the output looks as below
```
Running tests...
----------------------------------------------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
----------------------------------------------------------------------
```
So it is not caught by our message detection mechanism.
2. If we manually set the `vervocity` level to `xmlrunner`, it prints messages as below:
```
test_mixed_udf (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s)
test_mixed_udf_and_sql (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s)
...
```
This is different in our Jenkins machine:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.'
...
```
Note that last `SKIP` is different. This PR fixes the regular expression to catch `SKIP` case as well.
## How was this patch tested?
Manually tested.
**Before:**
```
Starting test(python2.7): pyspark....
Finished test(python2.7): pyspark.... (0s)
...
Tests passed in 562 seconds
========================================================================
...
```
**After:**
```
Starting test(python2.7): pyspark....
Finished test(python2.7): pyspark.... (48s) ... 93 tests were skipped
...
Tests passed in 560 seconds
Skipped tests pyspark.... with python2.7:
pyspark...(...) ... SKIP (0.000s)
...
========================================================================
...
```
Closes#24927 from HyukjinKwon/SPARK-28130.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance:
```python
if sys.version_info[:2] <= (2, 6):
try:
import unittest2 as unittest
except ImportError:
sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier')
sys.exit(1)
else:
import unittest
```
While I am here, I removed some of unused imports and reordered imports per PEP 8.
We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out.
## How was this patch tested?
Manually tests, and existing tests via Jenkins.
Closes#23077 from HyukjinKwon/SPARK-26105.
Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain.
The tests are broken down as follows:
```
pyspark
├── __init__.py
...
├── ml
│ ├── __init__.py
...
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_algorithms.py
│ │ ├── test_base.py
│ │ ├── test_evaluation.py
│ │ ├── test_feature.py
│ │ ├── test_image.py
│ │ ├── test_linalg.py
│ │ ├── test_param.py
│ │ ├── test_persistence.py
│ │ ├── test_pipeline.py
│ │ ├── test_stat.py
│ │ ├── test_training_summary.py
│ │ ├── test_tuning.py
│ │ └── test_wrapper.py
...
├── testing
...
│ ├── mlutils.py
...
```
## How was this patch tested?
Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6.
Closes#23063 from BryanCutler/python-test-breakup-ml-SPARK-26033.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>