spark-instrumented-optimizer/examples/src/main
Maxime Rihouey e3bf37fa3a
Fix example of tf_idf with minDocFreq
## What changes were proposed in this pull request?

The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter.
The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore".

## How was this patch tested?

Before the results for "tfidf" and "tfidfIgnore" were the same:
tfidf:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])
tfidfIgnore:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])

After the fix those are how they should be:
tfidf:
(1048576,[1046921],[3.75828890549])
(1048576,[1046920],[3.75828890549])
(1048576,[1046923],[3.75828890549])
(1048576,[892732],[3.75828890549])
(1048576,[892733],[3.75828890549])
(1048576,[892734],[3.75828890549])
tfidfIgnore:
(1048576,[1046921],[0.0])
(1048576,[1046920],[0.0])
(1048576,[1046923],[0.0])
(1048576,[892732],[0.0])
(1048576,[892733],[0.0])
(1048576,[892734],[0.0])

Author: Maxime Rihouey <maxime.rihouey@gmail.com>

Closes #15503 from maximerihouey/patch-1.
2016-10-17 10:56:22 +01:00
..
java/org/apache/spark/examples [SPARK-17338][SQL] add global temp view 2016-10-10 15:48:57 +08:00
python Fix example of tf_idf with minDocFreq 2016-10-17 10:56:22 +01:00
r [SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc 2016-09-26 09:54:22 +01:00
resources [SPARK-3389] Add Converter for ease of Parquet reading in PySpark 2014-09-27 21:48:05 -07:00
scala/org/apache/spark/examples [SPARK-17338][SQL] add global temp view 2016-10-10 15:48:57 +08:00