spark-instrumented-optimizer

History

Jatin Puri d2e86cb3cd [SPARK-26616][MLLIB] Expose document frequency in IDFModel ## What changes were proposed in this pull request? This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model. * The document frequency is returned as an `Array[Long]` * If the minimum document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms * numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value * Pyspark changes ## How was this patch tested? The existing test case was edited to also check for the document frequency values. I am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything Reviewer request: mengxr zjffdu yinxusen Closes #23549 from purijatin/master. Authored-by: Jatin Puri <purijatin@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>		2019-01-22 07:41:54 -06:00
..
linalg	[SPARK-26638][PYSPARK][ML] Pyspark vector classes always return error for unary negation	2019-01-17 14:24:21 -06:00
stat	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
tests	[SPARK-26646][TEST][PYSPARK] Fix flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction	2019-01-18 23:53:11 +08:00
__init__.py	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide	2016-07-15 13:38:23 -07:00
classification.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
clustering.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
common.py	[SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch	2016-10-03 14:12:03 -07:00
evaluation.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
feature.py	[SPARK-26616][MLLIB] Expose document frequency in IDFModel	2019-01-22 07:41:54 -06:00
fpm.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
random.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
recommendation.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
regression.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
tree.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
util.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00