Commit graph

142 commits

Author SHA1 Message Date
zhengruifeng 9dacdd38b3 [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
### What changes were proposed in this pull request?
Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds`

### Why are the changes needed?
similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer`

### Does this PR introduce any user-facing change?
yes, add setter/getter of `thresholds`/`inputCols`/`outputCols`

### How was this patch tested?
added suites

Closes #26064 from zhengruifeng/binarizer_multicols.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-10-16 18:32:07 +08:00
Huaxin Gao 2399134456 [SPARK-29143][PYTHON][ML] Pyspark feature models support column setters/getters
### What changes were proposed in this pull request?
add column setters/getters support in Pyspark feature models

### Why are the changes needed?
keep parity between Pyspark and Scala

### Does this PR introduce any user-facing change?
Yes.
After the change, Pyspark feature models have column setters/getters support.

### How was this patch tested?
Add some doctests

Closes #25908 from huaxingao/spark-29143.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-07 10:55:48 -05:00
Huaxin Gao db9e0fda6b [SPARK-22796][PYTHON][ML] Add multiple columns support to PySpark QuantileDiscretizer
### What changes were proposed in this pull request?
Add multiple columns support to PySpark QuantileDiscretizer

### Why are the changes needed?
Multiple columns support for  QuantileDiscretizer was in scala side a while ago. We need to add multiple columns support to python too.

### Does this PR introduce any user-facing change?
Yes. New Python is added

### How was this patch tested?
Add doctest

Closes #25812 from huaxingao/spark-22796.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-09-18 12:16:06 -07:00
zhengruifeng 4d27a25908 [SPARK-22797][ML][PYTHON] Bucketizer support multi-column
### What changes were proposed in this pull request?
Bucketizer support multi-column in the python side

### Why are the changes needed?
Bucketizer should support multi-column like the scala side.

### Does this PR introduce any user-facing change?
yes, this PR add new Python API

### How was this patch tested?
added testsuites

Closes #25801 from zhengruifeng/20542_py.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-17 11:52:20 +08:00
Sean Owen eb037a8180 [SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations
### What changes were proposed in this pull request?

The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.

The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)

It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.

### Why are the changes needed?

Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25558 from srowen/SPARK-28855.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-01 10:15:00 -05:00
Huaxin Gao 660423d717 [SPARK-23469][ML] HashingTF should use corrected MurmurHash3 implementation
## What changes were proposed in this pull request?

Update HashingTF to use new implementation of MurmurHash3
Make HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded

## How was this patch tested?

Change existing unit tests. Also add one unit test to make sure HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded

Closes #25303 from huaxingao/spark-23469.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-08-02 10:53:36 -05:00
zhengruifeng 44c28d7515 [SPARK-28399][ML][PYTHON] implement RobustScaler
## What changes were proposed in this pull request?
Implement `RobustScaler`
Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers.

## How was this patch tested?
existing and added tests

Closes #25160 from zhengruifeng/robust_scaler.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-30 10:24:33 -05:00
Huaxin Gao 70f82fd298 [SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF
## What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

## How was this patch tested?

Add Unit test.

Closes #25250 from huaxingao/spark-21481.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-28 08:32:43 -05:00
HyukjinKwon fe75ff8bea [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation
## What changes were proposed in this pull request?

Seems like we used to generate PySpark API documentation by Epydoc almost at the very first place (see 85b8f2c64f).

This fixes an actual issue:

Before:

![Screen Shot 2019-07-05 at 8 20 01 PM](https://user-images.githubusercontent.com/6477701/60720491-e9879180-9f65-11e9-9562-100830a456cd.png)

After:

![Screen Shot 2019-07-05 at 8 20 05 PM](https://user-images.githubusercontent.com/6477701/60720495-ec828200-9f65-11e9-8277-8f689e292cb0.png)

It seems apparently a bug within `epytext` plugin during the conversion between`param` and `:param` syntax. See also [Epydoc syntax](http://epydoc.sourceforge.net/manual-epytext.html).

Actually, Epydoc syntax violates [PEP-257](https://www.python.org/dev/peps/pep-0257/) IIRC and blocks us to enable some rules for doctest linter as well.

We should remove this legacy away and I guess Spark 3 is good timing to do it.

## How was this patch tested?

Manually built the doc and check each.

I had to manually find the Epydoc syntax by `git grep -r "{L"`, for instance.

Closes #25060 from HyukjinKwon/SPARK-28206.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2019-07-05 10:08:22 -07:00
ozan a38d605d0d [SPARK-18570][ML][R] RFormula support * and ^ operators
## What changes were proposed in this pull request?

Added support for `*` and `^` operators, along with expressions within parentheses. New operators just expand to already supported terms, such as;

 - y ~ a * b = y ~ a + b + a : b
 - y ~ (a+b+c)^3 = y ~ a + b + c + a : b + a : c + a :b : c

## How was this patch tested?

Added new unit tests to RFormulaParserSuite

mengxr yanboliang

Closes #24764 from ozancicek/rformula.

Authored-by: ozan <ozancancicekci@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-06-04 08:59:30 -05:00
Andrew-Crosby 5bf5d9d854 [SPARK-26970][PYTHON][ML] Add Spark ML interaction transformer to PySpark
## What changes were proposed in this pull request?

Adds the Spark ML Interaction transformer to PySpark

## How was this patch tested?

- Added Python doctest
- Ran the newly added example code
- Manually confirmed that a PipelineModel that contains an Interaction transformer can now be loaded in PySpark

Closes #24426 from Andrew-Crosby/pyspark-interaction-transformer.

Lead-authored-by: Andrew-Crosby <37139900+Andrew-Crosby@users.noreply.github.com>
Co-authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-04-23 13:53:33 -07:00
Maxim Gekk 027ed2d11b [SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter
## What changes were proposed in this pull request?

The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs.

## How was this patch tested?

By running the existing tests - XORShiftRandomSuite

Closes #20793 from MaxGekk/hash-buff-size.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-23 11:26:09 -05:00
Brooke Wenig 340c8b8387 [MINOR][DOC] Updated PySpark Binarizer docstring to match Scala's.
## What changes were proposed in this pull request?

PySpark's Binarizer docstring had two issues:
1) The values did not need to be in the range [0, 1].
2) It can be used for binary classification prediction.

This change corrects both of these issues by making it consistent with Scala's docstring for Binarizer.

## How was this patch tested?

Not applicable because I only changed the docstring. But if I need to do any testing, let me know and I'll do it.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23934 from brookewenig/binarizer-docs-fix.

Authored-by: Brooke Wenig <brookewenig@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-06 19:42:41 -06:00
Liang-Chi Hsieh 91caf0bfce [DOCS] MINOR Complement the document of stringOrderType for StringIndexer in PySpark
## What changes were proposed in this pull request?

We revised the behavior of the param `stringOrderType` of `StringIndexer` in case of equal frequency when under frequencyDesc/Asc. This isn't reflected in PySpark's document. We should do it.

## How was this patch tested?

Only document change.

Closes #23849 from viirya/py-stringindexer-doc.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Holden Karau <holden@pigscanfly.ca>
2019-02-21 08:36:48 -08:00
Huaxin Gao 74e9e1c192 [SPARK-22798][PYTHON][ML] Add multiple column support to PySpark StringIndexer
## What changes were proposed in this pull request?

Add multiple column support to PySpark StringIndexer

## How was this patch tested?

Add doctest

Closes #23741 from huaxingao/spark-22798.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-20 08:52:46 -06:00
Jatin Puri d2e86cb3cd [SPARK-26616][MLLIB] Expose document frequency in IDFModel
## What changes were proposed in this pull request?

This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model.

* The document frequency is returned as an `Array[Long]`
* If the minimum  document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms
* numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value
* Pyspark changes

## How was this patch tested?

The existing test case was edited to also check for the document frequency values.

I  am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything

Reviewer request: mengxr  zjffdu yinxusen

Closes #23549 from purijatin/master.

Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 07:41:54 -06:00
Jing Chen He 860f4497f2 [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
## What changes were proposed in this pull request?

If the input parameter 'threshold' to the function approxSimilarityJoin is not a float, we would get an exception.  The fix is to convert the 'threshold' into a float before calling the java implementation method.

## How was this patch tested?

Added a new test case.  Without this fix, the test will throw an exception as reported in the JIRA. With the fix, the test passes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23313 from jerryjch/SPARK-26315.

Authored-by: Jing Chen He <jinghe@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-12-15 08:41:16 -06:00
Liang-Chi Hsieh 169d9ad8f1 [SPARK-26133][ML][FOLLOWUP] Fix doc for OneHotEncoder
## What changes were proposed in this pull request?

This fixes doc of renamed OneHotEncoder in PySpark.

## How was this patch tested?

N/A

Closes #23230 from viirya/remove_one_hot_encoder_followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2018-12-05 19:30:25 +08:00
Liang-Chi Hsieh 8bfea86b1c
[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder
## What changes were proposed in this pull request?

We have deprecated `OneHotEncoder` at Spark 2.3.0 and introduced `OneHotEncoderEstimator`. At 3.0.0, we remove deprecated `OneHotEncoder` and rename `OneHotEncoderEstimator` to `OneHotEncoder`.

TODO: According to ML migration guide, we need to keep `OneHotEncoderEstimator` as an alias after renaming. This is not done at this patch in order to facilitate review.

## How was this patch tested?

Existing tests.

Closes #23100 from viirya/remove_one_hot_encoder.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2018-11-29 01:54:06 +00:00
Sean Owen 510ec77a60 [SPARK-19714][DOCS] Clarify Bucketizer handling of invalid input
## What changes were proposed in this pull request?

Clarify Bucketizer handleInvalid docs. Just a resubmit of https://github.com/apache/spark/pull/17169

## How was this patch tested?

N/A

Closes #23003 from srowen/SPARK-19714.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-11 09:21:40 -06:00
Sean Owen 08c76b5d39 [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)

## What changes were proposed in this pull request?

Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.

## How was this patch tested?

Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.

Closes #22400 from srowen/SPARK-25238.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-09-13 11:19:43 +08:00
Huaxin Gao b5e1188087 [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values
## What changes were proposed in this pull request?

In feature.py, VectorSizeHint setSize and getSize don't return value. Add return.

## How was this patch tested?

I tested the changes on my local.

Closes #22136 from huaxingao/spark-25124.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2018-08-23 16:17:27 -07:00
Sean Owen 8aceb961c3 [SPARK-24754][ML] Minhash integer overflow
## What changes were proposed in this pull request?

Use longs in calculating min hash to avoid bias due to int overflow.

## How was this patch tested?

Existing tests.

Author: Sean Owen <srowen@gmail.com>

Closes #21750 from srowen/SPARK-24754.
2018-07-14 15:59:17 -05:00
Lee Dongjin 5d6a53d983 [SPARK-15064][ML] Locale support in StopWordsRemover
## What changes were proposed in this pull request?

Add locale support for `StopWordsRemover`.

## How was this patch tested?

[Scala|Python] unit tests.

Author: Lee Dongjin <dongjin@apache.org>

Closes #21501 from dongjinleekr/feature/SPARK-15064.
2018-06-12 08:16:37 -07:00
Huaxin Gao 4f1e8b9bb7 [SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid
## What changes were proposed in this pull request?

add python api for VectorAssembler handleInvalid

## How was this patch tested?

Add doctest

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21003 from huaxingao/spark-23871.
2018-04-10 15:41:45 -07:00
Huaxin Gao e998250588 [SPARK-23828][ML][PYTHON] PySpark StringIndexerModel should have constructor from labels
## What changes were proposed in this pull request?

The Scala StringIndexerModel has an alternate constructor that will create the model from an array of label strings.  Add the corresponding Python API:

model = StringIndexerModel.from_labels(["a", "b", "c"])

## How was this patch tested?

Add doctest and unit test.

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #20968 from huaxingao/spark-23828.
2018-04-06 11:51:36 -07:00
Huaxin Gao a33655348c [SPARK-23615][ML][PYSPARK] Add maxDF Parameter to Python CountVectorizer
## What changes were proposed in this pull request?

The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also.

## How was this patch tested?

add test

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #20777 from huaxingao/spark-23615.
2018-03-23 15:58:48 -07:00
Bryan Cutler 8a72734f33 [SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list
## What changes were proposed in this pull request?

Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version.  Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters.  This now matches the Scala class hierarchy.

## How was this patch tested?

Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.
2018-03-16 11:42:57 -07:00
Benjamin Peterson 7013eea11c [SPARK-23522][PYTHON] always use sys.exit over builtin exit
The exit() builtin is only for interactive use. applications should use sys.exit().

## What changes were proposed in this pull request?

All usage of the builtin `exit()` function is replaced by `sys.exit()`.

## How was this patch tested?

I ran `python/run-tests`.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Benjamin Peterson <benjamin@python.org>

Closes #20682 from benjaminp/sys-exit.
2018-03-08 20:38:34 +09:00
Shintaro Murakami d5ed2108d3 [SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations
## What changes were proposed in this pull request?
Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4.

## How was this patch tested?
Added a unit test.

**Note: When we merge this PR, please give all the credits to Shintaro Murakami.**

Author: Shintaro Murakami <mrkm4ntrgmail.com>

Author: gatorsmile <gatorsmile@gmail.com>
Author: Shintaro Murakami <mrkm4ntr@gmail.com>

Closes #20630 from gatorsmile/pr-20568.
2018-02-16 17:17:55 -08:00
Nick Pentreath a8a3e9b7cf Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column"
This reverts commit c22eaa94e8.
2018-01-26 23:48:02 +02:00
Zheng RuiFeng c22eaa94e8 [SPARK-22797][PYSPARK] Bucketizer support multi-column
## What changes were proposed in this pull request?
Bucketizer support multi-column in the python side

## How was this patch tested?
existing tests and added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19892 from zhengruifeng/20542_py.
2018-01-26 12:28:27 +02:00
Bryan Cutler 39ee2acf96 [SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala
## What changes were proposed in this pull request?

This syncs the ML Python API with Scala for differences found after the 2.3 QA audit.

## How was this patch tested?

NA

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #20354 from BryanCutler/pyspark-ml-doc-sync-23163.
2018-01-25 01:48:11 -08:00
WeichenXu a7d98d53ce [SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated
## What changes were proposed in this pull request?

mark OneHotEncoder python API deprecated

## How was this patch tested?

N/A

Author: WeichenXu <weichen.xu@databricks.com>

Closes #20241 from WeichenXu123/mark_ohe_deprecated.
2018-01-12 11:27:02 +02:00
WeichenXu b5042d75c2 [SPARK-23008][ML] OnehotEncoderEstimator python API
## What changes were proposed in this pull request?

OnehotEncoderEstimator python API.

## How was this patch tested?

doctest

Author: WeichenXu <weichen.xu@databricks.com>

Closes #20209 from WeichenXu123/ohe_py.
2018-01-11 16:20:30 -08:00
Nick Pentreath 028ee40165 [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat numeric columns as categorical
Previously, `FeatureHasher` always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes in data sources.

In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome.

Add a new param `categoricalCols` which specifies the numeric columns that should be treated as categorical features.

## How was this patch tested?

New unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #19991 from MLnick/hasher-num-cat.
2017-12-31 14:51:38 +02:00
Bago Amirbekian 816963043a [SPARK-22734][ML][PYSPARK] Added Python API for VectorSizeHint.
(Please fill in changes proposed in this fix)

Python API for VectorSizeHint Transformer.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

doc-tests.

Author: Bago Amirbekian <bago@databricks.com>

Closes #20112 from MrBago/vectorSizeHint-PythonAPI.
2017-12-29 19:45:14 -08:00
WeichenXu 2d868d9398 [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API
## What changes were proposed in this pull request?

Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.

## How was this patch tested?

doctest added.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19753 from WeichenXu123/vector_indexer_invalid_py.
2017-11-21 10:53:53 -08:00
Xin Ren 31c74fec24 [SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API
https://issues.apache.org/jira/browse/SPARK-19866

## What changes were proposed in this pull request?

Add Python API for findSynonymsArray matching Scala API.

## How was this patch tested?

Manual test
`./python/run-tests --python-executables=python2.7 --modules=pyspark-ml`

Author: Xin Ren <iamshrek@126.com>
Author: Xin Ren <renxin.ubc@gmail.com>
Author: Xin Ren <keypointt@users.noreply.github.com>

Closes #17451 from keypointt/SPARK-19866.
2017-09-08 12:09:00 -07:00
Nick Pentreath 988b84d7ed [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher
Add Python API for `FeatureHasher` transformer.

## How was this patch tested?

New doc test.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
2017-08-21 14:35:38 +02:00
Yanbo Liang 69e5282d3c [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18613 from yanboliang/spark-20307.
2017-07-15 20:56:38 +08:00
Zheng RuiFeng d2d2a5de18 [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
## What changes were proposed in this pull request?
1, HasHandleInvaild support override
2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid

## How was this patch tested?
existing tests

[JIRA](https://issues.apache.org/jira/browse/SPARK-18619)

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.
2017-07-12 22:09:03 +08:00
Yanbo Liang c19680be1c [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data
## What changes were proposed in this pull request?
This PR is to maintain API parity with changes made in SPARK-17498 to support a new option
'keep' in StringIndexer to handle unseen labels or NULL values with PySpark.

Note: This is updated version of #17237 , the primary author of this PR is VinceShieh .
## How was this patch tested?
Unit tests.

Author: VinceShieh <vincent.xie@intel.com>
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18453 from yanboliang/spark-19852.
2017-07-02 16:17:03 +08:00
actuaryzhang ff5676b01f [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula
## What changes were proposed in this pull request?

PySpark supports stringIndexerOrderType in RFormula as in #17967.

## How was this patch tested?
docstring test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18122 from actuaryzhang/PythonRFormula.
2017-05-31 01:02:19 +08:00
Wayne Zhang 0f2f56c37b [SPARK-20736][PYTHON] PySpark StringIndexer supports StringOrderType
## What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in #17879.

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #17978 from actuaryzhang/PythonStringIndexer.
2017-05-21 16:51:55 -07:00
Nick Pentreath d9f4ce6943 [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark
Add Python wrapper for `Imputer` feature transformer.

## How was this patch tested?

New doc tests and tweak to PySpark ML `tests.py`

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
2017-03-24 08:01:15 -07:00
Bryan Cutler 44281ca81d [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
## What changes were proposed in this pull request?
The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

## How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
2017-03-03 16:43:45 -08:00
Mark Grover d2a879762a [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast
## What changes were proposed in this pull request?
Updates the doc string to match up with the code
i.e. say dropLast instead of includeFirst

## How was this patch tested?
Not much, since it's a doc-like change. Will run unit tests via Jenkins job.

Author: Mark Grover <mark@apache.org>

Closes #17127 from markgrover/spark_19734.
2017-03-01 22:57:34 -08:00
Yun Ni 3bd8ddf7c3 [MINOR][ML] Fix comments in LSH Examples and Python API
## What changes were proposed in this pull request?
Remove `org.apache.spark.examples.` in
Add slash in one of the python doc.

## How was this patch tested?
Run examples using the commands in the comments.

Author: Yun Ni <yunn@uber.com>

Closes #17104 from Yunni/yunn_minor.
2017-03-01 22:55:13 -08:00
Yun Ni 08c1972a06 [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing
## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni <yunn@uber.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Yunni <Euler57721@gmail.com>

Closes #16715 from Yunni/spark-18080.
2017-02-15 16:26:05 -08:00