7aa94ca9cb
### What changes were proposed in this pull request? This is the very first PR for supporting continuous distribution features selectors. It adds the algorithm to compute fvalue for continuous features and continuous labels. This algorithm will be used for FValueRegressionSelector. ### Why are the changes needed? Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features. I will add two new selectors: 1. FValueRegressionSelector for continuous features and continuous labels. 2. ANOVAFValueClassificationSelector for continuous features and categorical labels. I will use subtasks to add these two selectors: add FValueRegressionSelector on scala side - add FValueRegressionTest, this contains the algorithm to compute FValue - add FValueRegressionSelector using the above algorithm - add a common Selector, make FValueRegressionSelector and ChisqSelector to extend common selector add FValueRegressionSelector on python side add samples and doc do the same for ANOVAFValueClassificationSelector ### Does this PR introduce any user-facing change? Yes. ``` /** * param dataset DataFrame of continuous labels and continuous features. * param featuresCol Name of features column in dataset, of type `Vector` (`VectorUDT`) * param labelCol Name of label column in dataset, of any numerical type * return Array containing the SelectionTestResult for every feature against the label. */ SelectionTest.fValueRegressionTest(dataset: Dataset[_], featuresCol: String, labelCol: String) ``` ### How was this patch tested? Add Unit test. Closes #27623 from huaxingao/spark-30867. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |