df0aa8353a
Some improvements to MLlib guide: 1. [SPARK-1872] Update API links for unidoc. 2. [SPARK-1783] Added `page.displayTitle` to the global layout. If it is defined, use it instead of `page.title` for title display. 3. Add more Java/Python examples. Author: Xiangrui Meng <meng@databricks.com> Closes #816 from mengxr/mllib-doc and squashes the following commits: ec2e407 [Xiangrui Meng] format scala example for ALS cd9f40b [Xiangrui Meng] add a paragraph to summarize distributed matrix types 4617f04 [Xiangrui Meng] add python example to loadLibSVMFile and fix Java example d6509c2 [Xiangrui Meng] [SPARK-1783] update mllib titles 561fdc0 [Xiangrui Meng] add a displayTitle option to global layout 195d06f [Xiangrui Meng] add Java example for summary stats and minor fix 9f1ff89 [Xiangrui Meng] update java api links in mllib-basics 7dad18e [Xiangrui Meng] update java api links in NB 3a0f4a6 [Xiangrui Meng] api/pyspark -> api/python 35bdeb9 [Xiangrui Meng] api/mllib -> api/scala e4afaa8 [Xiangrui Meng] explicity state what might change
528 lines
20 KiB
Markdown
528 lines
20 KiB
Markdown
---
|
|
layout: global
|
|
title: Basics - MLlib
|
|
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
MLlib supports local vectors and matrices stored on a single machine,
|
|
as well as distributed matrices backed by one or more RDDs.
|
|
In the current implementation, local vectors and matrices are simple data models
|
|
to serve public interfaces. The underlying linear algebra operations are provided by
|
|
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
|
|
A training example used in supervised learning is called "labeled point" in MLlib.
|
|
|
|
## Local vector
|
|
|
|
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single
|
|
machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by
|
|
a double array representing its entry values, while a sparse vector is backed by two parallel
|
|
arrays: indices and values. For example, a vector $(1.0, 0.0, 3.0)$ can be represented in dense
|
|
format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size
|
|
of the vector.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
The base class of local vectors is
|
|
[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
|
|
implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and
|
|
[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend
|
|
using the factory methods implemented in
|
|
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.{Vector, Vectors}
|
|
|
|
// Create a dense vector (1.0, 0.0, 3.0).
|
|
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
|
|
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
|
|
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
|
|
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
|
|
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
|
|
{% endhighlight %}
|
|
|
|
***Note***
|
|
|
|
Scala imports `scala.collection.immutable.Vector` by default, so you have to import
|
|
`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
|
|
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
The base class of local vectors is
|
|
[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two
|
|
implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
|
|
[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html). We recommend
|
|
using the factory methods implemented in
|
|
[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create local vectors.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.mllib.linalg.Vector;
|
|
import org.apache.spark.mllib.linalg.Vectors;
|
|
|
|
// Create a dense vector (1.0, 0.0, 3.0).
|
|
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
|
|
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
|
|
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
MLlib recognizes the following types as dense vectors:
|
|
|
|
* NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
|
|
* Python's list, e.g., `[1, 2, 3]`
|
|
|
|
and the following as sparse vectors:
|
|
|
|
* MLlib's [`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html).
|
|
* SciPy's
|
|
[`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
|
|
with a single column
|
|
|
|
We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
|
|
in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors.
|
|
|
|
{% highlight python %}
|
|
import numpy as np
|
|
import scipy.sparse as sps
|
|
from pyspark.mllib.linalg import Vectors
|
|
|
|
# Use a NumPy array as a dense vector.
|
|
dv1 = np.array([1.0, 0.0, 3.0])
|
|
# Use a Python list as a dense vector.
|
|
dv2 = [1.0, 0.0, 3.0]
|
|
# Create a SparseVector.
|
|
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
|
|
# Use a single-column SciPy csc_matrix as a sparse vector.
|
|
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))
|
|
{% endhighlight %}
|
|
|
|
</div>
|
|
</div>
|
|
|
|
## Labeled point
|
|
|
|
A labeled point is a local vector, either dense or sparse, associated with a label/response.
|
|
In MLlib, labeled points are used in supervised learning algorithms.
|
|
We use a double to store a label, so we can use labeled points in both regression and classification.
|
|
For binary classification, label should be either $0$ (negative) or $1$ (positive).
|
|
For multiclass classification, labels should be class indices staring from zero: $0, 1, 2, \ldots$.
|
|
|
|
<div class="codetabs">
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
A labeled point is represented by the case class
|
|
[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.Vectors
|
|
import org.apache.spark.mllib.regression.LabeledPoint
|
|
|
|
// Create a labeled point with a positive label and a dense feature vector.
|
|
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
|
|
|
|
// Create a labeled point with a negative label and a sparse feature vector.
|
|
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
A labeled point is represented by
|
|
[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.mllib.linalg.Vectors;
|
|
import org.apache.spark.mllib.regression.LabeledPoint;
|
|
|
|
// Create a labeled point with a positive label and a dense feature vector.
|
|
LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
|
|
|
|
// Create a labeled point with a negative label and a sparse feature vector.
|
|
LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
A labeled point is represented by
|
|
[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html).
|
|
|
|
{% highlight python %}
|
|
from pyspark.mllib.linalg import SparseVector
|
|
from pyspark.mllib.regression import LabeledPoint
|
|
|
|
# Create a labeled point with a positive label and a dense feature vector.
|
|
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
|
|
|
|
# Create a labeled point with a negative label and a sparse feature vector.
|
|
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|
|
|
|
***Sparse data***
|
|
|
|
It is very common in practice to have sparse training data. MLlib supports reading training
|
|
examples stored in `LIBSVM` format, which is the default format used by
|
|
[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
|
|
[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format. Each line
|
|
represents a labeled sparse feature vector using the following format:
|
|
|
|
~~~
|
|
label index1:value1 index2:value2 ...
|
|
~~~
|
|
|
|
where the indices are one-based and in ascending order.
|
|
After loading, the feature indices are converted to zero-based.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
|
|
examples stored in LIBSVM format.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.regression.LabeledPoint
|
|
import org.apache.spark.mllib.util.MLUtils
|
|
import org.apache.spark.rdd.RDD
|
|
|
|
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training
|
|
examples stored in LIBSVM format.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.mllib.regression.LabeledPoint;
|
|
import org.apache.spark.mllib.util.MLUtils;
|
|
import org.apache.spark.api.java.JavaRDD;
|
|
|
|
JavaRDD<LabeledPoint> examples =
|
|
MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) reads training
|
|
examples stored in LIBSVM format.
|
|
|
|
{% highlight python %}
|
|
from pyspark.mllib.util import MLUtils
|
|
|
|
examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|
|
|
|
## Local matrix
|
|
|
|
A local matrix has integer-typed row and column indices and double-typed values, stored on a single
|
|
machine. MLlib supports dense matrix, whose entry values are stored in a single double array in
|
|
column major. For example, the following matrix `\[ \begin{pmatrix}
|
|
1.0 & 2.0 \\
|
|
3.0 & 4.0 \\
|
|
5.0 & 6.0
|
|
\end{pmatrix}
|
|
\]`
|
|
is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`.
|
|
We are going to add sparse matrix in the next release.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
The base class of local matrices is
|
|
[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
|
|
implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
|
|
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
|
|
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
|
|
matrices.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
|
|
|
|
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
|
|
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
The base class of local matrices is
|
|
[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one
|
|
implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
|
|
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
|
|
in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
|
|
matrices.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.mllib.linalg.Matrix;
|
|
import org.apache.spark.mllib.linalg.Matrices;
|
|
|
|
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
|
|
Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## Distributed matrix
|
|
|
|
A distributed matrix has long-typed row and column indices and double-typed values, stored
|
|
distributively in one or more RDDs. It is very important to choose the right format to store large
|
|
and distributed matrices. Converting a distributed matrix to a different format may require a
|
|
global shuffle, which is quite expensive. We implemented three types of distributed matrices in
|
|
this release and will add more types in the future.
|
|
|
|
The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
|
|
matrix without meaningful row indices, e.g., a collection of feature vectors.
|
|
It is backed by an RDD of its rows, where each row is a local vector.
|
|
We assume that the number of columns is not huge for a `RowMatrix`.
|
|
An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
|
|
which can be used for identifying rows and joins.
|
|
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix) format,
|
|
backed by an RDD of its entries.
|
|
|
|
***Note***
|
|
|
|
The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
|
|
It is always error-prone to have non-deterministic RDDs.
|
|
|
|
### RowMatrix
|
|
|
|
A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
|
|
of its rows, where each row is a local vector. This is similar to `data matrix` in the context of
|
|
multivariate statistics. Since each row is represented by a local vector, the number of columns is
|
|
limited by the integer range but it should be much smaller in practice.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
|
|
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.Vector
|
|
import org.apache.spark.mllib.linalg.distributed.RowMatrix
|
|
|
|
val rows: RDD[Vector] = ... // an RDD of local vectors
|
|
// Create a RowMatrix from an RDD[Vector].
|
|
val mat: RowMatrix = new RowMatrix(rows)
|
|
|
|
// Get its size.
|
|
val m = mat.numRows()
|
|
val n = mat.numCols()
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
|
|
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.api.java.JavaRDD;
|
|
import org.apache.spark.mllib.linalg.Vector;
|
|
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
|
|
|
|
JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
|
|
// Create a RowMatrix from an JavaRDD<Vector>.
|
|
RowMatrix mat = new RowMatrix(rows.rdd());
|
|
|
|
// Get its size.
|
|
long m = mat.numRows();
|
|
long n = mat.numCols();
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|
|
|
|
#### Multivariate summary statistics
|
|
|
|
We provide column summary statistics for `RowMatrix`.
|
|
If the number of columns is not large, say, smaller than 3000, you can also compute
|
|
the covariance matrix as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
|
|
number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
|
|
which could be faster if the rows are sparse.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
[`RowMatrix#computeColumnSummaryStatistics`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
|
|
[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
|
|
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
|
|
total count.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.Matrix
|
|
import org.apache.spark.mllib.linalg.distributed.RowMatrix
|
|
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
|
|
|
|
val mat: RowMatrix = ... // a RowMatrix
|
|
|
|
// Compute column summary statistics.
|
|
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
|
|
println(summary.mean) // a dense vector containing the mean value for each column
|
|
println(summary.variance) // column-wise variance
|
|
println(summary.numNonzeros) // number of nonzeros in each column
|
|
|
|
// Compute the covariance matrix.
|
|
val cov: Matrix = mat.computeCovariance()
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
|
|
[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
|
|
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
|
|
total count.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.mllib.linalg.Matrix;
|
|
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
|
|
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
|
|
|
|
RowMatrix mat = ... // a RowMatrix
|
|
|
|
// Compute column summary statistics.
|
|
MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
|
|
System.out.println(summary.mean()); // a dense vector containing the mean value for each column
|
|
System.out.println(summary.variance()); // column-wise variance
|
|
System.out.println(summary.numNonzeros()); // number of nonzeros in each column
|
|
|
|
// Compute the covariance matrix.
|
|
Matrix cov = mat.computeCovariance();
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|
|
|
|
### IndexedRowMatrix
|
|
|
|
An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by
|
|
an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
An
|
|
[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
|
|
can be created from an `RDD[IndexedRow]` instance, where
|
|
[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
|
|
wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
|
|
its row indices.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
|
|
|
|
val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
|
|
// Create an IndexedRowMatrix from an RDD[IndexedRow].
|
|
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
|
|
|
|
// Get its size.
|
|
val m = mat.numRows()
|
|
val n = mat.numCols()
|
|
|
|
// Drop its row indices.
|
|
val rowMat: RowMatrix = mat.toRowMatrix()
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
An
|
|
[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
|
|
can be created from an `JavaRDD<IndexedRow>` instance, where
|
|
[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
|
|
wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
|
|
its row indices.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.api.java.JavaRDD;
|
|
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
|
|
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
|
|
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
|
|
|
|
JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
|
|
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
|
|
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
|
|
|
|
// Get its size.
|
|
long m = mat.numRows();
|
|
long n = mat.numCols();
|
|
|
|
// Drop its row indices.
|
|
RowMatrix rowMat = mat.toRowMatrix();
|
|
{% endhighlight %}
|
|
</div></div>
|
|
|
|
### CoordinateMatrix
|
|
|
|
A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple
|
|
of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and
|
|
`value` is the entry value. A `CoordinateMatrix` should be used only in the case when both
|
|
dimensions of the matrix are huge and the matrix is very sparse.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
A
|
|
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
|
|
can be created from an `RDD[MatrixEntry]` instance, where
|
|
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
|
|
wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
|
|
with sparse rows by calling `toIndexedRowMatrix`. In this release, we do not provide other
|
|
computation for `CoordinateMatrix`.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
|
|
|
|
val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
|
|
// Create a CoordinateMatrix from an RDD[MatrixEntry].
|
|
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
|
|
|
|
// Get its size.
|
|
val m = mat.numRows()
|
|
val n = mat.numCols()
|
|
|
|
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
|
|
val indexedRowMatrix = mat.toIndexedRowMatrix()
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
A
|
|
[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
|
|
can be created from a `JavaRDD<MatrixEntry>` instance, where
|
|
[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
|
|
wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
|
|
with sparse rows by calling `toIndexedRowMatrix`.
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.api.java.JavaRDD;
|
|
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
|
|
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
|
|
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
|
|
|
|
JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
|
|
// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
|
|
CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
|
|
|
|
// Get its size.
|
|
long m = mat.numRows();
|
|
long n = mat.numCols();
|
|
|
|
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
|
|
IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
|
|
{% endhighlight %}
|
|
</div>
|
|
</div>
|