spark-instrumented-optimizer/docs/mllib-stats.md
Ameet Talwalkar c235b83e27 SPARK-2830 [MLlib]: re-organize mllib documentation
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.

Author: Ameet Talwalkar <atalwalkar@gmail.com>

Closes #1908 from atalwalkar/master and squashes the following commits:

fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
2014-08-12 17:15:21 -07:00

3.3 KiB

layout title displayTitle
global Statistics Functionality - MLlib <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
  • Table of contents {:toc}

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

Data Generators

Stratified Sampling

Summary Statistics

Multivariate summary statistics

We provide column summary statistics for RowMatrix (note: this functionality is not currently supported in IndexedRowMatrix or CoordinateMatrix). If the number of columns is not large, e.g., on the order of thousands, then the covariance matrix can also be computed as a local matrix, which requires \mathcal{O}(n^2) storage where n is the number of columns. The total CPU time is \mathcal{O}(m n^2), where m is the number of rows, and is faster if the rows are sparse.

computeColumnSummaryStatistics() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

{% highlight scala %} import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.stat.MultivariateStatisticalSummary

val mat: RowMatrix = ... // a RowMatrix

// Compute column summary statistics. val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics() println(summary.mean) // a dense vector containing the mean value for each column println(summary.variance) // column-wise variance println(summary.numNonzeros) // number of nonzeros in each column

// Compute the covariance matrix. val cov: Matrix = mat.computeCovariance() {% endhighlight %}

RowMatrix#computeColumnSummaryStatistics returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

{% highlight java %} import org.apache.spark.mllib.linalg.Matrix; import org.apache.spark.mllib.linalg.distributed.RowMatrix; import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;

RowMatrix mat = ... // a RowMatrix

// Compute column summary statistics. MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics(); System.out.println(summary.mean()); // a dense vector containing the mean value for each column System.out.println(summary.variance()); // column-wise variance System.out.println(summary.numNonzeros()); // number of nonzeros in each column

// Compute the covariance matrix. Matrix cov = mat.computeCovariance(); {% endhighlight %}

Hypothesis Testing