spark-instrumented-optimizer/mllib
Yacine Mazari c40fda9e4c [SPARK-23166][ML] Add maxDF Parameter to CountVectorizer
## What changes were proposed in this pull request?
Currently, the CountVectorizer has a minDF parameter.

It might be useful to also have a maxDF parameter.
It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words.

This is analogous to scikit-learn, CountVectorizer, max_df.

Other changes:
- Refactored code to invoke "filter()" conditioned on maxDF or minDF set.
- Refactored code to unpersist input after counting is done.

## How was this patch tested?
Unit tests.

Author: Yacine Mazari <y.mazari@gmail.com>

Closes #20367 from ymazari/SPARK-23166.
2018-01-28 10:27:59 -06:00
..
src [SPARK-23166][ML] Add maxDF Parameter to CountVectorizer 2018-01-28 10:27:59 -06:00
pom.xml [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00