Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2903 from sarutak/SPARK-4055 and squashes the following commits:
b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"
This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
This PR makes the following changes:
* Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
* Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
* Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
* Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
* Updated the MLLib Feature Extraction programming guide to describe the new feature
Author: RJ Nowling <rnowling@gmail.com>
Closes#2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
0aa3c63 [RJ Nowling] Fix identation
e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
bfa82ec [RJ Nowling] Add space after if
30d20b3 [RJ Nowling] Add spaces around equals signs
9013447 [RJ Nowling] Add space before division operator
79978fc [RJ Nowling] Remove unnecessary semi-colon
40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
1801fd2 [RJ Nowling] Fix style errors in IDF.scala
6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
a200bab [RJ Nowling] Remove unnecessary else statement
4b974f5 [RJ Nowling] Remove accidentally-added import from testing
c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
Author: RJ Nowling <rnowling@gmail.com>
Closes#2459 from rnowling/tfidf-fix and squashes the following commits:
b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#2068 from dbtsai/transformer-documentation and squashes the following commits:
109f324 [DB Tsai] address feedback
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes#2061 from mengxr/tfidf-doc and squashes the following commits:
ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
Author: Ameet Talwalkar <atalwalkar@gmail.com>
Closes#1908 from atalwalkar/master and squashes the following commits:
fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation