spark-instrumented-optimizer

History

Jatin Puri d2e86cb3cd [SPARK-26616][MLLIB] Expose document frequency in IDFModel ## What changes were proposed in this pull request? This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model. * The document frequency is returned as an `Array[Long]` * If the minimum document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms * numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value * Pyspark changes ## How was this patch tested? The existing test case was edited to also check for the document frequency values. I am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything Reviewer request: mengxr zjffdu yinxusen Closes #23549 from purijatin/master. Authored-by: Jatin Puri <purijatin@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>		2019-01-22 07:41:54 -06:00
..
build.properties	[SPARK-26317][BUILD] Upgrade SBT to 0.13.18	2018-12-10 12:04:44 -08:00
MimaBuild.scala	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0	2018-11-14 16:22:23 -08:00
MimaExcludes.scala	[SPARK-26616][MLLIB] Expose document frequency in IDFModel	2019-01-22 07:41:54 -06:00
plugins.sbt	[SPARK-26124][BUILD] Update plugins to latest versions	2018-11-20 18:05:39 -06:00
SparkBuild.scala	[SPARK-26306][TEST][BUILD] More memory to de-flake SorterSuite	2019-01-04 15:35:23 -06:00