spark-instrumented-optimizer/python/pyspark/ml
Peng c8b612decb
[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <peng.meng@intel.com>

Closes #15444 from mpjlu/chisqure-bug.
2016-10-14 12:48:57 +01:00
..
linalg [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract 2016-10-03 17:57:54 -07:00
param [SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most one 0 2016-09-24 08:15:55 +01:00
__init__.py [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide 2016-07-15 13:38:23 -07:00
base.py [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python 2016-06-13 19:59:53 -07:00
classification.py [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter 2016-10-12 19:52:57 -07:00
clustering.py [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2. 2016-09-11 13:47:13 +01:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load 2016-10-14 04:17:03 -07:00
feature.py [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference 2016-10-14 12:48:57 +01:00
pipeline.py [SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used without stages 2016-08-19 23:46:36 -07:00
recommendation.py [SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None 2016-06-21 11:43:25 -07:00
regression.py [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression 2016-09-22 04:35:54 -07:00
tests.py [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel. 2016-10-13 19:44:24 -07:00
tuning.py [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics 2016-08-03 04:18:28 -07:00
util.py [SPARK-15113][PYSPARK][ML] Add missing num features num classes 2016-08-22 12:21:22 +02:00
wrapper.py [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python 2016-06-13 19:59:53 -07:00