194ac3be8b
### What changes were proposed in this pull request? Add docs and examples for ANOVASelector and FValueSelector ### Why are the changes needed? Complete the implementation of ANOVASelector and FValueSelector ### Does this PR introduce _any_ user-facing change? Yes <img width="850" alt="Screen Shot 2020-05-13 at 5 17 44 PM" src="https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 05 15 PM" src="https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 06 06 PM" src="https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 06 21 PM" src="https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 10 PM" src="https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 33 PM" src="https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 47 PM" src="https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png"> <img width="851" alt="Screen Shot 2020-05-13 at 5 19 35 PM" src="https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 08 42 PM" src="https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 09 01 PM" src="https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 09 50 PM" src="https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png"> <img width="851" alt="Screen Shot 2020-05-13 at 5 10 06 PM" src="https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 10 59 PM" src="https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 11 27 PM" src="https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 14 54 PM" src="https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 15 17 PM" src="https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png"> ### How was this patch tested? Manually build and check Closes #28524 from huaxingao/examples. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>
191 lines
7 KiB
Markdown
191 lines
7 KiB
Markdown
---
|
|
layout: global
|
|
title: Basic Statistics
|
|
displayTitle: Basic Statistics
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
|
|
`\[
|
|
\newcommand{\R}{\mathbb{R}}
|
|
\newcommand{\E}{\mathbb{E}}
|
|
\newcommand{\x}{\mathbf{x}}
|
|
\newcommand{\y}{\mathbf{y}}
|
|
\newcommand{\wv}{\mathbf{w}}
|
|
\newcommand{\av}{\mathbf{\alpha}}
|
|
\newcommand{\bv}{\mathbf{b}}
|
|
\newcommand{\N}{\mathbb{N}}
|
|
\newcommand{\id}{\mathbf{I}}
|
|
\newcommand{\ind}{\mathbf{1}}
|
|
\newcommand{\0}{\mathbf{0}}
|
|
\newcommand{\unit}{\mathbf{e}}
|
|
\newcommand{\one}{\mathbf{1}}
|
|
\newcommand{\zero}{\mathbf{0}}
|
|
\]`
|
|
|
|
**Table of Contents**
|
|
|
|
* This will become a table of contents (this text will be scraped).
|
|
{:toc}
|
|
|
|
## Correlation
|
|
|
|
Calculating the correlation between two series of data is a common operation in Statistics. In `spark.ml`
|
|
we provide the flexibility to calculate pairwise correlations among many series. The supported
|
|
correlation methods are currently Pearson's and Spearman's correlation.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
[`Correlation`](api/scala/org/apache/spark/ml/stat/Correlation$.html)
|
|
computes the correlation matrix for the input Dataset of Vectors using the specified method.
|
|
The output will be a DataFrame that contains the correlation matrix of the column of vectors.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/CorrelationExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
[`Correlation`](api/java/org/apache/spark/ml/stat/Correlation.html)
|
|
computes the correlation matrix for the input Dataset of Vectors using the specified method.
|
|
The output will be a DataFrame that contains the correlation matrix of the column of vectors.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaCorrelationExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
[`Correlation`](api/python/pyspark.ml.html#pyspark.ml.stat.Correlation$)
|
|
computes the correlation matrix for the input Dataset of Vectors using the specified method.
|
|
The output will be a DataFrame that contains the correlation matrix of the column of vectors.
|
|
|
|
{% include_example python/ml/correlation_example.py %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## Hypothesis testing
|
|
|
|
Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
|
|
significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's
|
|
Chi-squared ( $\chi^2$) tests for independence, as well as ANOVA test for classification tasks and
|
|
F-value test for regression tasks.
|
|
|
|
### ANOVATest
|
|
|
|
`ANOVATest` computes ANOVA F-values between labels and features for classification tasks. The labels should be categorical
|
|
and features should be continuous.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
Refer to the [`ANOVATest` Scala docs](api/scala/org/apache/spark/ml/stat/ANOVATest$.html) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/ANOVATestExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
Refer to the [`ANOVATest` Java docs](api/java/org/apache/spark/ml/stat/ANOVATest.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaANOVATestExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
Refer to the [`ANOVATest` Python docs](api/python/index.html#pyspark.ml.stat.ANOVATest$) for details on the API.
|
|
|
|
{% include_example python/ml/anova_test_example.py %}
|
|
</div>
|
|
</div>
|
|
|
|
### ChiSquareTest
|
|
|
|
`ChiSquareTest` conducts Pearson's independence test for every feature against the label.
|
|
For each feature, the (feature, label) pairs are converted into a contingency matrix for which
|
|
the Chi-squared statistic is computed. All label and feature values must be categorical.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
Refer to the [`ChiSquareTest` Scala docs](api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
Refer to the [`ChiSquareTest` Java docs](api/java/org/apache/spark/ml/stat/ChiSquareTest.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaChiSquareTestExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat.ChiSquareTest$) for details on the API.
|
|
|
|
{% include_example python/ml/chi_square_test_example.py %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
### FValueTest
|
|
|
|
`FValueTest` computes F-values between labels and features for regression tasks. Both the labels
|
|
and features should be continuous.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
Refer to the [`FValueTest` Scala docs](api/scala/org/apache/spark/ml/stat/FValueTest$.html) for details on the API.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/FValueTestExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
Refer to the [`FValueTest` Java docs](api/java/org/apache/spark/ml/stat/FValueTest.html) for details on the API.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaFValueTestExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
Refer to the [`FValueTest` Python docs](api/python/index.html#pyspark.ml.stat.FValueTest$) for details on the API.
|
|
|
|
{% include_example python/ml/fvalue_test_example.py %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## Summarizer
|
|
|
|
We provide vector column summary statistics for `Dataframe` through `Summarizer`.
|
|
Available metrics are the column-wise max, min, mean, sum, variance, std, and number of nonzeros,
|
|
as well as the total count.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
The following example demonstrates using [`Summarizer`](api/scala/org/apache/spark/ml/stat/Summarizer$.html)
|
|
to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
The following example demonstrates using [`Summarizer`](api/java/org/apache/spark/ml/stat/Summarizer.html)
|
|
to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaSummarizerExample.java %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
Refer to the [`Summarizer` Python docs](api/python/index.html#pyspark.ml.stat.Summarizer$) for details on the API.
|
|
|
|
{% include_example python/ml/summarizer_example.py %}
|
|
</div>
|
|
|
|
</div>
|