spark-instrumented-optimizer

History

Yacine Mazari c40fda9e4c [SPARK-23166][ML] Add maxDF Parameter to CountVectorizer ## What changes were proposed in this pull request? Currently, the CountVectorizer has a minDF parameter. It might be useful to also have a maxDF parameter. It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words. This is analogous to scikit-learn, CountVectorizer, max_df. Other changes: - Refactored code to invoke "filter()" conditioned on maxDF or minDF set. - Refactored code to unpersist input after counting is done. ## How was this patch tested? Unit tests. Author: Yacine Mazari <y.mazari@gmail.com> Closes #20367 from ymazari/SPARK-23166.	2018-01-28 10:27:59 -06:00
..
src	[SPARK-23166][ML] Add maxDF Parameter to CountVectorizer	2018-01-28 10:27:59 -06:00
pom.xml	[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT	2018-01-13 00:37:59 +08:00

Yacine Mazari c40fda9e4c [SPARK-23166][ML] Add maxDF Parameter to CountVectorizer

## What changes were proposed in this pull request?
Currently, the CountVectorizer has a minDF parameter.

It might be useful to also have a maxDF parameter.
It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words.

This is analogous to scikit-learn, CountVectorizer, max_df.

Other changes:
- Refactored code to invoke "filter()" conditioned on maxDF or minDF set.
- Refactored code to unpersist input after counting is done.

## How was this patch tested?
Unit tests.

Author: Yacine Mazari <y.mazari@gmail.com>

Closes #20367 from ymazari/SPARK-23166.

2018-01-28 10:27:59 -06:00

src

[SPARK-23166][ML] Add maxDF Parameter to CountVectorizer

2018-01-28 10:27:59 -06:00

pom.xml

[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT

2018-01-13 00:37:59 +08:00