spark-instrumented-optimizer/project
Jatin Puri d2e86cb3cd [SPARK-26616][MLLIB] Expose document frequency in IDFModel
## What changes were proposed in this pull request?

This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model.

* The document frequency is returned as an `Array[Long]`
* If the minimum  document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms
* numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value
* Pyspark changes

## How was this patch tested?

The existing test case was edited to also check for the document frequency values.

I  am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything

Reviewer request: mengxr  zjffdu yinxusen

Closes #23549 from purijatin/master.

Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 07:41:54 -06:00
..
build.properties [SPARK-26317][BUILD] Upgrade SBT to 0.13.18 2018-12-10 12:04:44 -08:00
MimaBuild.scala [SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 2018-11-14 16:22:23 -08:00
MimaExcludes.scala [SPARK-26616][MLLIB] Expose document frequency in IDFModel 2019-01-22 07:41:54 -06:00
plugins.sbt [SPARK-26124][BUILD] Update plugins to latest versions 2018-11-20 18:05:39 -06:00
SparkBuild.scala [SPARK-26306][TEST][BUILD] More memory to de-flake SorterSuite 2019-01-04 15:35:23 -06:00