spark-instrumented-optimizer/python/pyspark
Peng c8b612decb
[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <peng.meng@intel.com>

Closes #15444 from mpjlu/chisqure-bug.
2016-10-14 12:48:57 +01:00
..
ml [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference 2016-10-14 12:48:57 +01:00
mllib [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference 2016-10-14 12:48:57 +01:00
sql [SPARK-17731][SQL][STREAMING] Metrics for structured streaming 2016-10-13 13:36:26 -07:00
streaming [SPARK-16950] [PYSPARK] fromOffsets parameter support in KafkaUtils.createDirectStream for python3 2016-08-09 09:44:43 -07:00
__init__.py [SPARK-14555] First cut of Python API for Structured Streaming 2016-04-20 10:32:01 -07:00
accumulators.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
broadcast.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
cloudpickle.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
conf.py [SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf 2016-10-11 14:56:26 -07:00
context.py [SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf 2016-10-11 14:56:26 -07:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
heapq3.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
java_gateway.py [SPARK-17387][PYSPARK] Creating SparkContext() from python without spark-submit ignores user conf 2016-10-11 14:56:26 -07:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
rdd.py [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes 2016-10-11 11:43:24 -07:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-10542] [PYSPARK] fix serialize namedtuple 2015-09-14 19:46:34 -07:00
shell.py [SPARK-16536][SQL][PYSPARK][MINOR] Expose sql in PySpark Shell 2016-07-13 22:24:26 -07:00
shuffle.py [SPARK-10710] Remove ability to disable spilling in core and SQL 2015-09-19 21:40:21 -07:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api 2016-04-12 23:06:55 -07:00
tests.py [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes 2016-10-11 11:43:24 -07:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
worker.py [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch 2016-03-31 16:40:20 -07:00