Commit graph

604 commits

Author SHA1 Message Date
Huaxin Gao 786d773585 [SPARK-36578][ML] UnivariateFeatureSelector API doc improvement
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`

### Why are the changes needed?
make the doc look better

### Does this PR introduce _any_ user-facing change?
yes, API doc change

### How was this patch tested?
Manually checked

Closes #33855 from huaxingao/ml_doc.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 15e42b4442)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-26 21:16:59 -07:00
Dongjoon Hyun d841679ecc [MINOR][3.2] Remove unused numpy import
### What changes were proposed in this pull request?

This fixed Python linter failure.

### Why are the changes needed?

```
flake8 checks failed:
./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused
import numpy as np
F401 'numpy as np' imported but unused
Error: Process completed with exit code 1.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action Linter job.

Closes #33841 from dongjoon-hyun/unused_import.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 09:52:54 +09:00
Gengliang Wang 5463caac0d Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:39:29 -07:00
Dominik Gehl 29c0b1e7b1 [SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings
### What changes were proposed in this pull request?
Changing references to Dataset in python docstrings to DataFrame

### Why are the changes needed?
no Dataset class in pyspark

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc change only

Closes #33438 from dominikgehl/feature/SPARK-36225.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae1c20ee0d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:58:19 +09:00
Hyukjin Kwon e9cc81151d [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs
This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

Python 3.6 is deprecated at SPARK-35938

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a71dd6af2f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 11:41:53 +09:00
Ruifeng Zheng 37f70422b5 [SPARK-35678][ML][FOLLOWUP] Revert changes in ANN
### What changes were proposed in this pull request?
revert changes related to ANN

### Why are the changes needed?
using the new `softmax` may cause flaky failure

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
reverted testsuite

Closes #33049 from zhengruifeng/revert_softmax_ann.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:02:28 +09:00
Ruifeng Zheng a66738823c [SPARK-35678][ML][FOLLOWUP] softmax support offset and step
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB

### Why are the changes needed?
to simplify impl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuite

Closes #32991 from zhengruifeng/softmax_support_offset_step.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
2021-06-23 21:03:18 -05:00
Hyukjin Kwon 6d309914df [SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when starting the thread, and use inheritable thread in the current codebase
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.

This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.

### Why are the changes needed?

To mimic the JVM behaviour about thread lifecycle

### Does this PR introduce _any_ user-facing change?

Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.

### How was this patch tested?

Existing tests should cover this.

Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-20 11:48:38 +09:00
Hyukjin Kwon 20750a3f9e [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception
### What changes were proposed in this pull request?

This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.

While I am here, I fixed another minor issue at `DataFrams.schema` together:

```diff
- except AttributeError as e:
-     raise Exception(
-         "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+     raise ValueError(
+         "Unable to parse datatype from schema. %s" % e) from e
```

Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.

### Why are the changes needed?

For users to expect the proper exceptions.

### Does this PR introduce _any_ user-facing change?

Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.

### How was this patch tested?

Existing unittests should cover,

Closes #31238

Closes #32650 from HyukjinKwon/SPARK-32194.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 11:54:40 +09:00
Sean Owen a37cce95c2 [MINOR][DOCS] Add required imports to CV, train validation split Pyspark ML examples
### What changes were proposed in this pull request?

Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit

### Why are the changes needed?

The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work.

### Does this PR introduce _any_ user-facing change?

No, docs only change.

### How was this patch tested?

Existing tests.

Closes #32554 from srowen/TuningImports.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-05-15 08:13:54 -05:00
Ruifeng Zheng f7704ece40 [SPARK-35392][ML][PYTHON] Fix flaky tests in ml/clustering.py and ml/feature.py
### What changes were proposed in this pull request?

This PR removes the check of `summary.logLikelihood` in  ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if:
- change number of partitions;
- just change the way to compute the sum of weights;
- change the underlying BLAS impl

Also uses more permissive precision on `Word2Vec` test case.

### Why are the changes needed?

To recover the build and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test cases.

Closes #32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test.

Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-13 22:23:51 +09:00
Sean Owen a189be8754 [MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar
### What changes were proposed in this pull request?

Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary.

### Why are the changes needed?

See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated.

### Does this PR introduce _any_ user-facing change?

Only changes docs.

### How was this patch tested?

Manual testing of docs.

Closes #32508 from srowen/TruncatedPythonDesc.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-12 10:38:59 +09:00
Yikun Jiang 44b7931936 [SPARK-35176][PYTHON] Standardize input validation error type
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas

### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.

[1] https://docs.python.org/3/library/exceptions.html#TypeError

Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.

### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.

### How was this patch tested?
Existing test case and UT

Closes #32368 from Yikun/SPARK-35176.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 15:34:24 +09:00
Ludovic Henry 5b77ebb57b [SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib
### What changes were proposed in this pull request?

Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package.

The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation.

Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available.

A table summarising which version gets loaded in which case:

```
|                       | BLAS.nativeBLAS                                    | BLAS.javaBLAS                                      |
| --------------------- | -------------------------------------------------- | -------------------------------------------------- |
| with -Pnetlib-lgpl    | 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a     | 1. dev.ludovic.netlib.blas.VectorizedBLAS          |
|                       |     wrapper for com.github.fommil:all              |    (JDK16+, relies on the Vector API, requires     |
|                       | 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+,    |     `--add-modules=jdk.incubator.vector` on JDK16) |
|                       |    relies on the Foreign Linker API, requires      | 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+)     |
|                       |    `--add-modules=jdk.incubator.foreign            | 3. dev.ludovic.netlib.blas.JavaBLAS                |
|                       |     -Dforeign.restricted=warn`)                    | 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a        |
|                       | 3. fails to load, falls back to BLAS.javaBLAS in   |     wrapper for com.github.fommil:core             |
|                       |     org.apache.spark.ml.linalg.BLAS                |                                                    |
| --------------------- | -------------------------------------------------- | -------------------------------------------------- |
| without -Pnetlib-lgpl | 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+,    | 1. dev.ludovic.netlib.blas.VectorizedBLAS          |
|                       |    relies on the Foreign Linker API, requires      |    (JDK16+, relies on the Vector API, requires     |
|                       |    `--add-modules=jdk.incubator.foreign            |     `--add-modules=jdk.incubator.vector` on JDK16) |
|                       |     -Dforeign.restricted=warn`)                    | 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+)     |
|                       | 2. fails to load, falls back to BLAS.javaBLAS in   | 3. dev.ludovic.netlib.blas.JavaBLAS                |
|                       |     org.apache.spark.ml.linalg.BLAS                | 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a        |
|                       |                                                    |     wrapper for com.github.fommil:core             |
| --------------------- | -------------------------------------------------- | -------------------------------------------------- |
```

### Why are the changes needed?

Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available.

### Does this PR introduce _any_ user-facing change?

No, all changes are transparent to the user.

### How was this patch tested?

The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite.

[1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`:
#### JDK8:
```
[info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic
[info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
[info]
[info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
[info] javaBLAS   = dev.ludovic.netlib.blas.Java8BLAS
[info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS
[info]
[info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 223            232           8        448.0           2.2       1.0X
[info] java                                                221            228           7        453.0           2.2       1.0X
[info]
[info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 122            128           4        821.2           1.2       1.0X
[info] java                                                122            128           4        822.3           1.2       1.0X
[info]
[info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 109            112           2        921.4           1.1       1.0X
[info] java                                                 70             74           3       1423.5           0.7       1.5X
[info]
[info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  96             98           2       1046.1           1.0       1.0X
[info] java                                                 47             49           2       2121.7           0.5       2.0X
[info]
[info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 184            195           8        544.3           1.8       1.0X
[info] java                                                185            196           7        539.5           1.9       1.0X
[info]
[info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  99            104           4       1011.9           1.0       1.0X
[info] java                                                 99            104           4       1010.4           1.0       1.0X
[info]
[info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        947.2           1.1       1.0X
[info] java                                                  0              0           0       1584.8           0.6       1.7X
[info]
[info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        867.4           1.2       1.0X
[info] java                                                  1              1           0        865.0           1.2       1.0X
[info]
[info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        485.9           2.1       1.0X
[info] java                                                  1              1           0        486.8           2.1       1.0X
[info]
[info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1843.0           0.5       1.0X
[info] java                                                  0              0           0       2690.6           0.4       1.5X
[info]
[info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1214.7           0.8       1.0X
[info] java                                                  0              0           0       2536.8           0.4       2.1X
[info]
[info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1895.9           0.5       1.0X
[info] java                                                  0              0           0       2961.1           0.3       1.6X
[info]
[info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1223.4           0.8       1.0X
[info] java                                                  0              0           0       3091.4           0.3       2.5X
[info]
[info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 560            575          20       1787.1           0.6       1.0X
[info] java                                                226            232           5       4432.4           0.2       2.5X
[info]
[info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 570            586          23       1755.2           0.6       1.0X
[info] java                                                227            232           4       4410.1           0.2       2.5X
[info]
[info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 863            879          17       1158.4           0.9       1.0X
[info] java                                                227            231           3       4407.9           0.2       3.8X
[info]
[info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                1282           1305          23        780.0           1.3       1.0X
[info] java                                                227            232           4       4413.4           0.2       5.7X
[info]
[info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 538            548           8       1858.6           0.5       1.0X
[info] java                                                221            226           3       4521.1           0.2       2.4X
[info]
[info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 549            558          10       1819.9           0.5       1.0X
[info] java                                                222            229           7       4503.5           0.2       2.5X
[info]
[info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 838            852          12       1193.0           0.8       1.0X
[info] java                                                222            229           5       4500.5           0.2       3.8X
[info]
[info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 905            919          18       1104.8           0.9       1.0X
[info] java                                                221            228           5       4521.3           0.2       4.1X
```

#### JDK11:
```
[info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic
[info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
[info]
[info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
[info] javaBLAS   = dev.ludovic.netlib.blas.Java11BLAS
[info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS
[info]
[info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 195            204          10        512.7           2.0       1.0X
[info] java                                                195            202           7        512.4           2.0       1.0X
[info]
[info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 108            113           4        923.3           1.1       1.0X
[info] java                                                102            107           4        984.4           1.0       1.1X
[info]
[info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 107            110           3        938.1           1.1       1.0X
[info] java                                                 69             72           3       1447.1           0.7       1.5X
[info]
[info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  96             98           2       1046.5           1.0       1.0X
[info] java                                                 43             45           2       2317.1           0.4       2.2X
[info]
[info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 155            168           8        644.2           1.6       1.0X
[info] java                                                158            169           8        632.8           1.6       1.0X
[info]
[info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  85             90           4       1178.1           0.8       1.0X
[info] java                                                 86             90           4       1167.7           0.9       1.0X
[info]
[info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   0              0           0       1182.1           0.8       1.0X
[info] java                                                  0              0           0       1432.1           0.7       1.2X
[info]
[info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        898.7           1.1       1.0X
[info] java                                                  1              1           0        891.5           1.1       1.0X
[info]
[info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        495.4           2.0       1.0X
[info] java                                                  1              1           0        495.7           2.0       1.0X
[info]
[info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   0              0           0       2271.6           0.4       1.0X
[info] java                                                  0              0           0       3648.1           0.3       1.6X
[info]
[info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1229.3           0.8       1.0X
[info] java                                                  0              0           0       2711.3           0.4       2.2X
[info]
[info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   0              0           0       2677.5           0.4       1.0X
[info] java                                                  0              0           0       3288.2           0.3       1.2X
[info]
[info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1233.0           0.8       1.0X
[info] java                                                  0              0           0       2766.3           0.4       2.2X
[info]
[info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 520            536          16       1923.6           0.5       1.0X
[info] java                                                214            221           7       4669.5           0.2       2.4X
[info]
[info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 593            612          17       1686.5           0.6       1.0X
[info] java                                                215            219           3       4643.3           0.2       2.8X
[info]
[info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 853            870          16       1172.8           0.9       1.0X
[info] java                                                215            218           3       4659.7           0.2       4.0X
[info]
[info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                1350           1370          23        740.8           1.3       1.0X
[info] java                                                215            219           4       4656.6           0.2       6.3X
[info]
[info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 460            468           6       2173.2           0.5       1.0X
[info] java                                                210            213           2       4752.7           0.2       2.2X
[info]
[info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 535            544           8       1869.3           0.5       1.0X
[info] java                                                210            215           5       4761.8           0.2       2.5X
[info]
[info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 843            853          11       1186.8           0.8       1.0X
[info] java                                                209            214           4       4793.4           0.2       4.0X
[info]
[info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 891            904          15       1122.0           0.9       1.0X
[info] java                                                209            214           4       4777.2           0.2       4.3X
```

#### JDK16:
```
[info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic
[info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
[info]
[info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
[info] javaBLAS   = dev.ludovic.netlib.blas.VectorizedBLAS
[info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS
[info]
[info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 194            199           7        515.7           1.9       1.0X
[info] java                                                181            186           3        551.1           1.8       1.1X
[info]
[info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 109            115           4        915.0           1.1       1.0X
[info] java                                                 88             92           3       1138.8           0.9       1.2X
[info]
[info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 108            110           2        922.6           1.1       1.0X
[info] java                                                 54             56           2       1839.2           0.5       2.0X
[info]
[info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  96             97           2       1046.1           1.0       1.0X
[info] java                                                 29             30           1       3393.4           0.3       3.2X
[info]
[info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 156            165           5        643.0           1.6       1.0X
[info] java                                                150            159           5        667.1           1.5       1.0X
[info]
[info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                  85             91           6       1171.0           0.9       1.0X
[info] java                                                 75             79           3       1340.6           0.7       1.1X
[info]
[info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        917.0           1.1       1.0X
[info] java                                                  0              0           0       8147.2           0.1       8.9X
[info]
[info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        859.3           1.2       1.0X
[info] java                                                  1              1           0        859.3           1.2       1.0X
[info]
[info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0        482.1           2.1       1.0X
[info] java                                                  1              1           0        482.6           2.1       1.0X
[info]
[info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   0              0           0       2214.2           0.5       1.0X
[info] java                                                  0              0           0       7975.8           0.1       3.6X
[info]
[info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1231.4           0.8       1.0X
[info] java                                                  0              0           0       8680.9           0.1       7.0X
[info]
[info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   0              0           0       2684.3           0.4       1.0X
[info] java                                                  0              0           0      18527.1           0.1       6.9X
[info]
[info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                   1              1           0       1235.4           0.8       1.0X
[info] java                                                  0              0           0      17347.9           0.1      14.0X
[info]
[info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 530            552          18       1887.5           0.5       1.0X
[info] java                                                 58             64           3      17143.9           0.1       9.1X
[info]
[info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 598            620          17       1671.1           0.6       1.0X
[info] java                                                 58             64           3      17196.6           0.1      10.3X
[info]
[info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 834            847          14       1199.4           0.8       1.0X
[info] java                                                 57             63           4      17486.9           0.1      14.6X
[info]
[info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                1338           1366          22        747.3           1.3       1.0X
[info] java                                                 58             63           3      17356.6           0.1      23.2X
[info]
[info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 489            501           9       2045.5           0.5       1.0X
[info] java                                                 36             38           2      27721.9           0.0      13.6X
[info]
[info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 478            488           9       2094.0           0.5       1.0X
[info] java                                                 36             38           2      27813.2           0.0      13.3X
[info]
[info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 825            837          10       1211.6           0.8       1.0X
[info] java                                                 35             38           2      28433.1           0.0      23.5X
[info]
[info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] f2j                                                 900            918          15       1111.6           0.9       1.0X
[info] java                                                 36             38           2      28073.0           0.0      25.3X
```

[2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas

Closes #32253 from luhenry/master.

Authored-by: Ludovic Henry <git@ludovic.dev>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-04-27 14:00:59 -05:00
Julien Lafaye 592230e47b [MINOR][DOCS][ML] Explicit return type of array_to_vector utility function
There are two types of dense vectors:
* pyspark.ml.linalg.DenseVector
* pyspark.mllib.linalg.DenseVector

In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
The documentation is ambiguous & can lead to the false conclusion that instances of
pyspark.mllib.linalg.DenseVector will be returned.
Conversion from ml versions to mllib versions can easly be achieved with
mlutils.convertVectorColumnsToML helper.

### What changes were proposed in this pull request?
Make documentation more explicit

### Why are the changes needed?
The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No test were run as only the documentation was changed

Closes #32255 from jlafaye/master.

Authored-by: Julien Lafaye <jlafaye@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-04-27 09:08:26 -05:00
Ruifeng Zheng 1f150b9392 [SPARK-35024][ML] Refactor LinearSVC - support virtual centering
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites

### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites and added testsuites

Closes #32124 from zhengruifeng/svc_agg_refactor.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-04-25 13:16:46 +08:00
harupy b6350f5bb0 [SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel
### What changes were proposed in this pull request?

Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`.

### Why are the changes needed?
Bugfix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #32245 from harupy/SPARK-35142.

Authored-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-04-21 16:29:10 +08:00
HyukjinKwon 2ca76a57be [MINOR][DOCS] Use ASCII characters when possible in PySpark documentation
### What changes were proposed in this pull request?

This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation

### Why are the changes needed?

To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782

### Does this PR introduce _any_ user-facing change?

Virtually no.

### How was this patch tested?

Found via (Mac OS):

```bash
# In Spark root directory
cd python
pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .`
```

Closes #32048 from HyukjinKwon/minor-fix.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-04 09:49:36 +03:00
Sean Owen ed641fbad6 [MINOR][DOCS][ML] Doc 'mode' as a supported Imputer strategy in Pyspark
### What changes were proposed in this pull request?

Document `mode` as a supported Imputer strategy in Pyspark docs.

### Why are the changes needed?

Support was added in 3.1, and documented in Scala, but some Python docs were missed.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #31883 from srowen/ImputerModeDocs.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-20 01:16:49 -05:00
Sean Owen 2f30cdebb1 [SPARK-34642][DOCS][ML] Fix TypeError in Pyspark Linear Regression docs
### What changes were proposed in this pull request?

Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError.

### Why are the changes needed?

The example is slightly wrong and we should not show an error in the docs.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Existing tests

Closes #31760 from srowen/SPARK-34642.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-06 07:32:01 -08:00
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
Liang-Chi Hsieh 1fbd576410 [SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document
### What changes were proposed in this pull request?

This follows up #31160 to update score function in the document.

### Why are the changes needed?

Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No, only doc change.

Closes #31531 from viirya/SPARK-34080-minor.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-10 09:24:25 +09:00
Ruifeng Zheng 2c4e4f8412 [SPARK-34189][ML] w2v findSynonyms optimization
### What changes were proposed in this pull request?
1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector

### Why are the changes needed?
this pr is about 30% faster than existing impl

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
existing testsuites

Closes #31276 from zhengruifeng/w2v_findSynonyms_opt.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-01-27 10:08:53 +08:00
Ruifeng Zheng d8cbef1abf [SPARK-34093][ML] param maxDepth should check upper bound
### What changes were proposed in this pull request?
update the ParamValidators of `maxDepth`

### Why are the changes needed?
current impl of tree models only support maxDepth<=30

### Does this PR introduce _any_ user-facing change?
If `maxDepth`>30, fail quickly

### How was this patch tested?
existing testsuites

Closes #31163 from zhengruifeng/param_maxDepth_upbound.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-18 11:36:10 -06:00
zero323 098f2268e4 [SPARK-33730][PYTHON] Standardize warning types
### What changes were proposed in this pull request?

This PR:

- Adds as small  hierarchy of warnings to be used in PySpark applications. These extend built-in classes and top level `PySparkWarning`.
- Replaces `DeprecationWarnings` (intended for developers) with PySpark specific subclasses of `FutureWarning` (intended for end users).

### Why are the changes needed?

- To be more precise and add users additional control (in addition to standard module level filters) over PySpark warnings handling.
- Correct semantics (at the moment we use `DeprecationWarning` in user-facing API, but it is intended "for warnings about deprecated features when those warnings are intended for other Python developers").

### Does this PR introduce _any_ user-facing change?

Yes. Code can raise different type of warning than before.

### How was this patch tested?

Existing tests.

Closes #30985 from zero323/SPARK-33730.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-18 09:32:55 +09:00
Huaxin Gao f3548837c6 [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector

### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.

### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures",  numTopFeatures=100)
```

Or

numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures",  numTopFeatures=100)
```

### How was this patch tested?
Add Unit test

Closes #31160 from huaxingao/UnivariateSelector.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-01-16 11:09:23 +08:00
Fokko Driesprong e4d1c10760 [SPARK-32320][PYSPARK] Remove mutable default arguments
This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

```
fokkodriesprongFan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
```

```
fokkodriesprongFan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):
```

### What changes were proposed in this pull request?

Removing the mutable default arguments.

### Why are the changes needed?

Removing the mutable default arguments, and changing the signature to `Optional[...]`.

### Does this PR introduce _any_ user-facing change?

No 👍

### How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

Closes #29122 from Fokko/SPARK-32320.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-08 09:35:36 +08:00
Weichen Xu 7e759b2d95 [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator
### What changes were proposed in this pull request?
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

### Why are the changes needed?
Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark.

CrossValidator and TrainValidateSplit support tuning these python backend estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance.

OneVsRest saving/load now only support java backend classifier due to similar issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30471 from WeichenXu123/support_pyio_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2020-12-04 08:35:50 +08:00
Liang-Chi Hsieh 3b2ff16ee6 [SPARK-33636][PYTHON][ML][FOLLOWUP] Update since tag of labelsArray in StringIndexer
### What changes were proposed in this pull request?

This is to update `labelsArray`'s since tag.

### Why are the changes needed?

The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A. Just tag change.

Closes #30582 from viirya/SPARK-33636-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 14:34:44 +09:00
Liang-Chi Hsieh 0880989755 [SPARK-22798][PYTHON][ML][FOLLOWUP] Add labelsArray to PySpark StringIndexer
### What changes were proposed in this pull request?

This is a followup to add missing `labelsArray` to PySpark `StringIndexer`.

### Why are the changes needed?

`labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too.

### Does this PR introduce _any_ user-facing change?

Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0.

### How was this patch tested?

Unit test.

Closes #30579 from viirya/SPARK-22798-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 10:57:14 +09:00
Weichen Xu 80161238fe [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-01 09:36:42 +08:00
Bryan Cutler aeb3649fb9 [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests
### What changes were proposed in this pull request?

This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.

- https://docs.python.org/3/library/unittest.html#deprecated-aliases

### Why are the changes needed?

For consistency and eventual removal of deprecated APIs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-01 10:34:40 +09:00
Weichen Xu 596fbc1d29 [SPARK-33556][ML] Add array_to_vector function for dataframe column
### What changes were proposed in this pull request?

Add array_to_vector function for dataframe column

### Why are the changes needed?
Utility function for array to vector conversion.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
scala unit test & doctest.

Closes #30498 from WeichenXu123/array_to_vec.

Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-01 09:52:19 +09:00
Josh Soref 13fd272cd3 Spelling r common dev mlib external project streaming resource managers python
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-27 10:22:45 -06:00
zero323 665817bd4f [SPARK-33457][PYTHON] Adjust mypy configuration
### What changes were proposed in this pull request?

This pull request:

- Adds following flags to the main mypy configuration:
  - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional)
  - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional)
  - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls)

These flags are enabled only for public API and disabled for tests and internal modules.

Additionally, these PR fixes missing annotations.

### Why are the changes needed?

Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example https://github.com/apache/spark/pull/29122#pullrequestreview-513112882

Additionally, it will allow us to detect cases where annotations have unintentionally omitted.

### Does this PR introduce _any_ user-facing change?

Annotations only.

### How was this patch tested?

`dev/lint-python`.

Closes #30382 from zero323/SPARK-33457.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-25 09:27:04 +09:00
Ruifeng Zheng 116b7b72a1 [SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy
### What changes were proposed in this pull request?
impl a new strategy `mode`: replace missing using the most frequent value along each column.

### Why are the changes needed?
it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time.

### Does this PR introduce _any_ user-facing change?
Yes, a new strategy is added

### How was this patch tested?
updated testsuites

Closes #30397 from zhengruifeng/imputer_max_freq.

Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-20 11:35:34 -06:00
zhengruifeng 689c294102 [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR
### What changes were proposed in this pull request?
use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors;

### Why are the changes needed?
the performance gain is mainly related to the nnz of block.

### Does this PR introduce _any_ user-facing change?
yes, param blockSize -> blockSizeInMB in master

### How was this patch tested?
updated testsuites

Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.

Lead-authored-by: zhengruifeng <ruifengz@foxmail.com>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2020-11-18 23:02:31 +08:00
zhengruifeng a2887164bc [SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC
### What changes were proposed in this pull request?
1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors;
2, infer an appropriate `maxBlockSizeInMB` if set 0;

### Why are the changes needed?
the performance gain is mainly related to the nnz of block.

f2jBLAS |   |   |   |   |   |   |   |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Duration(millisecond) | branch 3.0 Impl | blockSizeInMB=0.0625 | blockSizeInMB=0.125 | blockSizeInMB=0.25 | blockSizeInMB=0.5 | blockSizeInMB=1 | blockSizeInMB=2 | blockSizeInMB=4 | blockSizeInMB=8 | blockSizeInMB=16 | blockSizeInMB=32 | blockSizeInMB=64 | blockSizeInMB=128
epsilon(100%) | 326481 | 26143 | 25710 | 24726 | 25395 | 25840 | 26846 | 25927 | 27431 | 26190 | 26056 | 26347 | 27204
epsilon3000(67%) | 455247 | 35893 | 34366 | 34985 | 38387 | 38901 | 40426 | 40044 | 39161 | 38767 | 39965 | 39523 | 39108
epsilon4000(50%) | 306390 | 42256 | 41164 | 43748 | 48638 | 50892 | 50986 | 51091 | 51072 | 51289 | 51652 | 53312 | 52146
epsilon5000(40%) | 307619 | 43639 | 42992 | 44743 | 50800 | 51939 | 51871 | 52190 | 53850 | 52607 | 51062 | 52509 | 51570
epsilon10000(20%) | 310070 | 58371 | 55921 | 56317 | 56618 | 53694 | 52131 | 51768 | 51728 | 52233 | 51881 | 51653 | 52440
epsilon20000(10%) | 316565 | 109193 | 95121 | 82764 | 69653 | 60764 | 56066 | 53371 | 52822 | 52872 | 52769 | 52527 | 53508
epsilon200000(1%) | 336181 | 1569721 | 1069355 | 673718 | 375043 | 218230 | 145393 | 110926 | 94327 | 87039 | 83926 | 81890 | 81787
  |   |   |   |   |   |   |   |   |   |   |   |   |  
  |   |   |   |   |   |   |   |   |   |   |   |   |  
  | Speedup |   |   |   |   |   |   |   |   |   |   |   |  
epsilon(100%) | 1 | 12.48827602 | 12.69859977 | **13.20395535** | 12.85611341 | 12.63471362 | 12.16125307 | 12.59231689 | 11.90189931 | 12.46586483 | 12.5299739 | 12.39158158 | 12.00121306
epsilon3000(67%) | 1 | 12.68344803 | **13.2470174** | 13.01263399 | 11.85940553 | 11.70270687 | 11.26124276 | 11.36866946 | 11.62500958 | 11.74315784 | 11.39114225 | 11.51853351 | 11.64076404
epsilon4000(50%) | 1 | 7.250804619 | **7.443154212** | 7.003520161 | 6.299395534 | 6.020396133 | 6.00929667 | 5.996946625 | 5.999177632 | 5.973795551 | 5.931812902 | 5.747111345 | 5.875618456
epsilon5000(40%) | 1 | 7.049176196 | **7.155261444** | 6.875243055 | 6.055492126 | 5.92269778 | 5.930462108 | 5.894213451 | 5.712516249 | 5.847491779 | 6.024421292 | 5.858405226 | 5.965076595
epsilon10000(20%) | 1 | 5.312055644 | 5.544786395 | 5.505797539 | 5.4765269 | 5.774760681 | 5.947900481 | 5.98960748 | 5.994239097 | 5.93628549 | 5.976561747 | **6.002942714** | 5.912852784
epsilon20000(10%) | 1 | 2.899132728 | 3.328024306 | 3.824911797 | 4.544886796 | 5.209745902 | 5.64629187 | 5.931404695 | 5.993052137 | 5.987384627 | 5.999071425 | **6.026710073** | 5.916218136
epsilon200000(1%) | 1 | 0.214166084 | 0.314377358 | 0.498993644 | 0.896379882 | 1.540489392 | 2.312222734 | 3.03067811 | 3.563995463 | 3.862417997 | 4.005683578 | 4.105275369 | **4.110445425**

OpenBLAS |   |   |   |   |   |   |   |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Duration(millisecond) | branch 3.0 Impl | blockSizeInMB=0.0625 | blockSizeInMB=0.125 | blockSizeInMB=0.25 | blockSizeInMB=0.5 | blockSizeInMB=1 | blockSizeInMB=2 | blockSizeInMB=4 | blockSizeInMB=8 | blockSizeInMB=16 | blockSizeInMB=32 | blockSizeInMB=64 | blockSizeInMB=128
epsilon(100%) | 299119 | 26047 | 25049 | 25239 | 28001 | 35138 | 36438 | 36279 | 36114 | 35111 | 35428 | 36295 | 35197
epsilon3000(67%) | 439798 | 33321 | 34423 | 34336 | 38906 | 51756 | 54138 | 54085 | 53412 | 54766 | 54425 | 54221 | 54842
epsilon4000(50%) | 302963 | 42960 | 40678 | 43483 | 48254 | 50888 | 54990 | 52647 | 51947 | 51843 | 52891 | 53410 | 52020
epsilon5000(40%) | 303569 | 44225 | 44961 | 45065 | 51768 | 52776 | 51930 | 53587 | 53104 | 51833 | 52138 | 52574 | 53756
epsilon10000(20%) | 307403 | 58447 | 55993 | 56757 | 56694 | 54038 | 52734 | 52073 | 52051 | 52150 | 51986 | 52407 | 52390
epsilon20000(10%) | 313344 | 107580 | 94679 | 83329 | 70226 | 60996 | 57130 | 55461 | 54641 | 52712 | 52541 | 53101 | 53312
epsilon200000(1%) | 334679 | 1642726 | 1073148 | 654481 | 364974 | 213881 | 140248 | 107579 | 91757 | 85090 | 81940 | 80492 | 80250
  |   |   |   |   |   |   |   |   |   |   |   |   |  
  |   |   |   |   |   |   |   |   |   |   |   |   |  
  | Speedup |   |   |   |   |   |   |   |   |   |   |   |  
epsilon(100%) | 1 | 11.48381771 | **11.94135494** | 11.85146004 | 10.68243991 | 8.512692811 | 8.208985125 | 8.244962651 | 8.282632774 | 8.519238985 | 8.443011178 | 8.241328007 | 8.498423161
epsilon3000(67%) | 1 | 13.19882356 | 12.7762833 | **12.80865564** | 11.30411762 | 8.497526857 | 8.123646976 | 8.131607655 | 8.234067251 | 8.030493372 | 8.080808452 | 8.111211523 | 8.01936472
epsilon4000(50%) | 1 | 7.052211359 | **7.44783421** | 6.967389555 | 6.278505409 | 5.953525389 | 5.509419895 | 5.754610899 | 5.832155851 | 5.843855487 | 5.728063376 | 5.672402172 | 5.823971549
epsilon5000(40%) | 1 | **6.86419446** | 6.751829363 | 6.736247642 | 5.864027971 | 5.752027437 | 5.845734643 | 5.664974714 | 5.716499699 | 5.856674319 | 5.822413595 | 5.774127896 | 5.647164968
epsilon10000(20%) | 1 | 5.259517169 | 5.490025539 | 5.416124883 | 5.422143437 | 5.688645028 | 5.829313157 | 5.903308816 | 5.905803923 | 5.894592522 | **5.913188166** | 5.865685882 | 5.867589235
epsilon20000(10%) | 1 | 2.912660346 | 3.309540658 | 3.760323537 | 4.461937174 | 5.137123746 | 5.48475407 | 5.649807973 | 5.734594901 | 5.944452876 | **5.963799699** | 5.900905821 | 5.87755102
epsilon200000(1%) | 1 | 0.203733915 | 0.311866583 | 0.511365494 | 0.916994087 | 1.564790701 | 2.38633706 | 3.111006795 | 3.647449241 | 3.933235398 | 4.084439834 | 4.157916315 | **4.170454829**

### Does this PR introduce _any_ user-facing change?
yes, param `blockSize` -> `blockSizeInMB` in master

### How was this patch tested?
added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907))

Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II.

Lead-authored-by: zhengruifeng <ruifengz@foxmail.com>
Co-authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2020-11-12 19:14:07 +08:00
Ruifeng Zheng 6244407ce6 Revert "[WIP] Test (#30327)"
This reverts commit 61ee5d8a4e.

### What changes were proposed in this pull request?
I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009,
but I merged it to master by mistake.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-12 11:32:12 +09:00
WeichenXu 61ee5d8a4e
[WIP] Test (#30327)
* resend

* address comments

* directly gen new Iter

* directly gen new Iter

* update blockify strategy

* address comments

* try to fix 2.13

* try to fix scala 2.13

* use 1.0 as the default value for gemv

* update

Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
2020-11-12 10:20:33 +08:00
zero323 122c8999cb [SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns
### What changes were proposed in this pull request?

Changes

    pyspark.sql.dataframe.DataFrame

to

    :py:class:`pyspark.sql.DataFrame`

### Why are the changes needed?

Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104).

### Does this PR introduce _any_ user-facing change?

User will see shorter reference with a link.

### How was this patch tested?

`dev/lint-python` and manual check of the rendered docs.

Closes #30313 from zero323/SPARK-33251-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
2020-11-10 09:17:00 -08:00
zero323 090962cd42 [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*)
### What changes were proposed in this pull request?

This PR proposes migration of `pyspark.ml` to NumPy documentation style.

### Why are the changes needed?

To improve documentation style.

### Does this PR introduce _any_ user-facing change?

Yes, this changes both rendered HTML docs and console representation (SPARK-33243).

### How was this patch tested?

`dev/lint-python` and manual inspection.

Closes #30285 from zero323/SPARK-33251.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 09:33:48 +09:00
Alessandro Patti 4a33cd928d [SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors
### What changes were proposed in this pull request?

Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment)

### Why are the changes needed?
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with
```
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
AssertionError: False is not true
```
```
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS
Failed example:
    predictions[0]
Expected:
    Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
    Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This path changes a test target. Just executed the tests to verify they pass.

Closes #30104 from AlessandroPatti/apatti/rounding-errors.

Authored-by: Alessandro Patti <ale812@yahoo.it>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-21 18:14:21 -07:00
zero323 72da6f86cf [SPARK-33002][PYTHON] Remove non-API annotations
### What changes were proposed in this pull request?

This PR:

- removes annotations for modules which are not part of the public API.
- removes `__init__.pyi` files, if no annotations, beyond exports, are present.

### Why are the changes needed?

Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests and additional MyPy checks:

```
mypy --no-incremental --config python/mypy.ini python/pyspark
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```

Closes #29879 from zero323/SPARK-33002.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 19:53:59 +09:00
zero323 31a16fbb40 [SPARK-32714][PYTHON] Initial pyspark-stubs port
### What changes were proposed in this pull request?

This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.

### How was this patch tested?

- [x] MyPy tests of the PySpark source
    ```
    mypy --no-incremental --config python/mypy.ini python/pyspark
    ```
- [x] MyPy tests of Spark examples
    ```
   MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
    ```
- [x] Existing Flake8 linter

- [x] Existing unit tests

Tested against:

- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`

Closes #29591 from zero323/SPARK-32681.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-24 14:15:36 +09:00
zhengruifeng 432afac07e [SPARK-32907][ML] adaptively blockify instances - revert blockify gmm
### What changes were proposed in this pull request?
revert blockify gmm

### Why are the changes needed?
WeichenXu123  and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it.
In GMM, we found that the pre-allocated memory maybe too large and should be discarded:
```
transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)
```
We had some offline discuss and thought it is better to revert blockify GMM.

### Does this PR introduce _any_ user-facing change?
blockSize added in master branch will be removed

### How was this patch tested?
existing testsuites

Closes #29782 from zhengruifeng/unblockify_gmm.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-09-23 15:54:56 +08:00
zero323 779f0a84ea [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods
### What changes were proposed in this pull request?

This PR adjusts signatures of methods decorated with `keyword_only` to indicate  using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/).

__Note__:

For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579

### Why are the changes needed?

Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that.

In practice we could probably capture `locals` and drop `keyword_only` completel, i.e:

```python
keyword_only
def __init__(self, *, featuresCol="features"):
    ...
    kwargs = self._input_kwargs
    self.setParams(**kwargs)
```

could be replaced with

```python
def __init__(self, *, featuresCol="features"):
    kwargs = locals()
    del kwargs["self"]
    ...
    self.setParams(**kwargs)
```

### Does this PR introduce _any_ user-facing change?

Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments.

For example with ` LinearSVC` will change from

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__(
    self,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
)
Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2):
File:      /path/to/python/pyspark/ml/classification.py
Type:      function
```

to

```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__   (
    self,
    *,
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    maxIter=100,
    regParam=0.0,
    tol=1e-06,
    rawPredictionCol='rawPrediction',
    fitIntercept=True,
    standardization=True,
    threshold=0.0,
    weightCol=None,
    aggregationDepth=2,
    blockSize=1,
)
Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction",                  maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction",                  fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,                  aggregationDepth=2, blockSize=1):
File:      ~/Workspace/spark/python/pyspark/ml/classification.py
Type:      function
```

### How was this patch tested?

Existing tests.

Closes #29799 from zero323/SPARK-32933.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:28:33 +09:00
zero323 c918909c1a [SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword
### What changes were proposed in this pull request?

Replace `__metaclass__` fields with `metaclass` keyword in the class statements.

### Why are the changes needed?

`__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons.

```
>>> from pyspark.sql.types import BooleanType
>>> BooleanType() is BooleanType()
False
```

and classes, which suppose to be abstract, are not

```
>>> import inspect
>>> from pyspark.ml import Estimator
>>> inspect.isabstract(Estimator)
False
```

### Does this PR introduce _any_ user-facing change?

Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix.

### How was this patch tested?

Existing tests.

Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:22:11 +09:00
Fokko Driesprong a1e459ed9f [SPARK-32719][PYTHON] Add Flake8 check missing imports
https://issues.apache.org/jira/browse/SPARK-32719

### What changes were proposed in this pull request?

Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).

### Why are the changes needed?

To make sure that the quality of the Python code is up to standard.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit-tests and Flake8 static analysis

Closes #29563 from Fokko/fd-add-check-missing-imports.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:23:31 +09:00
Louiszr a0bd273bb0 [SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to copy models instead of list
### What changes were proposed in this pull request?

Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models.

### Why are the changes needed?

`copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated `test_copy` to make sure `copy()` is called on models instead of lists of models.

Closes #29553 from Louiszr/fix-cv-copy.

Authored-by: Louiszr <zxhst14@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
2020-08-28 10:15:16 -07:00