2014-04-22 14:20:47 -04:00
---
layout: global
2014-05-18 20:00:57 -04:00
title: Basics - MLlib
displayTitle: < a href = "mllib-guide.html" > MLlib< / a > - Basics
2014-04-22 14:20:47 -04:00
---
* Table of contents
{:toc}
MLlib supports local vectors and matrices stored on a single machine,
as well as distributed matrices backed by one or more RDDs.
In the current implementation, local vectors and matrices are simple data models
2014-05-06 23:07:22 -04:00
to serve public interfaces. The underlying linear algebra operations are provided by
2014-04-22 14:20:47 -04:00
[Breeze ](http://www.scalanlp.org/ ) and [jblas ](http://jblas.org/ ).
A training example used in supervised learning is called "labeled point" in MLlib.
## Local vector
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single
machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by
a double array representing its entry values, while a sparse vector is backed by two parallel
arrays: indices and values. For example, a vector $(1.0, 0.0, 3.0)$ can be represented in dense
format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])` , where `3` is the size
of the vector.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
The base class of local vectors is
2014-05-18 20:00:57 -04:00
[`Vector` ](api/scala/index.html#org.apache.spark.mllib.linalg.Vector ), and we provide two
implementations: [`DenseVector` ](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector ) and
[`SparseVector` ](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector ). We recommend
2014-04-22 14:20:47 -04:00
using the factory methods implemented in
2014-05-18 20:00:57 -04:00
[`Vectors` ](api/scala/index.html#org.apache.spark.mllib.linalg.Vector ) to create local vectors.
2014-04-22 14:20:47 -04:00
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
{% endhighlight %}
***Note***
Scala imports `scala.collection.immutable.Vector` by default, so you have to import
`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector` .
< / div >
< div data-lang = "java" markdown = "1" >
The base class of local vectors is
2014-05-18 20:00:57 -04:00
[`Vector` ](api/java/org/apache/spark/mllib/linalg/Vector.html ), and we provide two
implementations: [`DenseVector` ](api/java/org/apache/spark/mllib/linalg/DenseVector.html ) and
[`SparseVector` ](api/java/org/apache/spark/mllib/linalg/SparseVector.html ). We recommend
2014-04-22 14:20:47 -04:00
using the factory methods implemented in
2014-05-18 20:00:57 -04:00
[`Vectors` ](api/java/org/apache/spark/mllib/linalg/Vector.html ) to create local vectors.
2014-04-22 14:20:47 -04:00
{% highlight java %}
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
{% endhighlight %}
< / div >
< div data-lang = "python" markdown = "1" >
MLlib recognizes the following types as dense vectors:
* NumPy's [`array` ](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html )
* Python's list, e.g., `[1, 2, 3]`
and the following as sparse vectors:
2014-05-18 20:00:57 -04:00
* MLlib's [`SparseVector` ](api/python/pyspark.mllib.linalg.SparseVector-class.html ).
2014-04-22 14:20:47 -04:00
* SciPy's
[`csc_matrix` ](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix )
with a single column
We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
2014-05-18 20:00:57 -04:00
in [`Vectors` ](api/python/pyspark.mllib.linalg.Vectors-class.html ) to create sparse vectors.
2014-04-22 14:20:47 -04:00
{% highlight python %}
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors
# Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0])
# Use a Python list as a dense vector.
dv2 = [1.0, 0.0, 3.0]
# Create a SparseVector.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
# Use a single-column SciPy csc_matrix as a sparse vector.
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))
{% endhighlight %}
< / div >
< / div >
## Labeled point
A labeled point is a local vector, either dense or sparse, associated with a label/response.
In MLlib, labeled points are used in supervised learning algorithms.
We use a double to store a label, so we can use labeled points in both regression and classification.
For binary classification, label should be either $0$ (negative) or $1$ (positive).
For multiclass classification, labels should be class indices staring from zero: $0, 1, 2, \ldots$.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
A labeled point is represented by the case class
2014-05-18 20:00:57 -04:00
[`LabeledPoint` ](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint ).
2014-04-22 14:20:47 -04:00
{% highlight scala %}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
A labeled point is represented by
2014-05-18 20:00:57 -04:00
[`LabeledPoint` ](api/java/org/apache/spark/mllib/regression/LabeledPoint.html ).
2014-04-22 14:20:47 -04:00
{% highlight java %}
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
// Create a labeled point with a positive label and a dense feature vector.
LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
// Create a labeled point with a negative label and a sparse feature vector.
LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}));
{% endhighlight %}
< / div >
< div data-lang = "python" markdown = "1" >
A labeled point is represented by
2014-05-18 20:00:57 -04:00
[`LabeledPoint` ](api/python/pyspark.mllib.regression.LabeledPoint-class.html ).
2014-04-22 14:20:47 -04:00
{% highlight python %}
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
{% endhighlight %}
< / div >
< / div >
***Sparse data***
It is very common in practice to have sparse training data. MLlib supports reading training
examples stored in `LIBSVM` format, which is the default format used by
[`LIBSVM` ](http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ) and
[`LIBLINEAR` ](http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ). It is a text format. Each line
represents a labeled sparse feature vector using the following format:
~~~
label index1:value1 index2:value2 ...
~~~
where the indices are one-based and in ascending order.
After loading, the feature indices are converted to zero-based.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-05-18 20:00:57 -04:00
[`MLUtils.loadLibSVMFile` ](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$ ) reads training
2014-04-22 14:20:47 -04:00
examples stored in LIBSVM format.
{% highlight scala %}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
2014-05-18 20:00:57 -04:00
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
2014-05-18 20:00:57 -04:00
[`MLUtils.loadLibSVMFile` ](api/java/org/apache/spark/mllib/util/MLUtils.html ) reads training
2014-04-22 14:20:47 -04:00
examples stored in LIBSVM format.
{% highlight java %}
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
2014-05-18 20:00:57 -04:00
import org.apache.spark.api.java.JavaRDD;
JavaRDD< LabeledPoint > examples =
MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
{% endhighlight %}
< / div >
< div data-lang = "python" markdown = "1" >
[`MLUtils.loadLibSVMFile` ](api/python/pyspark.mllib.util.MLUtils-class.html ) reads training
examples stored in LIBSVM format.
2014-04-22 14:20:47 -04:00
2014-05-18 20:00:57 -04:00
{% highlight python %}
from pyspark.mllib.util import MLUtils
examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
< / div >
## Local matrix
A local matrix has integer-typed row and column indices and double-typed values, stored on a single
machine. MLlib supports dense matrix, whose entry values are stored in a single double array in
column major. For example, the following matrix `\[ \begin{pmatrix}
1.0 & 2.0 \\
3.0 & 4.0 \\
5.0 & 6.0
\end{pmatrix}
\]`
is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)` .
We are going to add sparse matrix in the next release.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
The base class of local matrices is
2014-05-18 20:00:57 -04:00
[`Matrix` ](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix ), and we provide one
implementation: [`DenseMatrix` ](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix ).
2014-04-22 14:20:47 -04:00
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
2014-05-18 20:00:57 -04:00
in [`Matrices` ](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices ) to create local
2014-04-22 14:20:47 -04:00
matrices.
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
The base class of local matrices is
2014-05-18 20:00:57 -04:00
[`Matrix` ](api/java/org/apache/spark/mllib/linalg/Matrix.html ), and we provide one
implementation: [`DenseMatrix` ](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html ).
2014-04-22 14:20:47 -04:00
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
2014-05-18 20:00:57 -04:00
in [`Matrices` ](api/java/org/apache/spark/mllib/linalg/Matrices.html ) to create local
2014-04-22 14:20:47 -04:00
matrices.
{% highlight java %}
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
{% endhighlight %}
< / div >
< / div >
## Distributed matrix
A distributed matrix has long-typed row and column indices and double-typed values, stored
distributively in one or more RDDs. It is very important to choose the right format to store large
and distributed matrices. Converting a distributed matrix to a different format may require a
global shuffle, which is quite expensive. We implemented three types of distributed matrices in
this release and will add more types in the future.
2014-05-18 20:00:57 -04:00
The basic type is called `RowMatrix` . A `RowMatrix` is a row-oriented distributed
matrix without meaningful row indices, e.g., a collection of feature vectors.
It is backed by an RDD of its rows, where each row is a local vector.
We assume that the number of columns is not huge for a `RowMatrix` .
An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
which can be used for identifying rows and joins.
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO) ](https://en.wikipedia.org/wiki/Sparse_matrix ) format,
backed by an RDD of its entries.
2014-04-22 14:20:47 -04:00
***Note***
The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
It is always error-prone to have non-deterministic RDDs.
### RowMatrix
A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
of its rows, where each row is a local vector. This is similar to `data matrix` in the context of
multivariate statistics. Since each row is represented by a local vector, the number of columns is
limited by the integer range but it should be much smaller in practice.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-05-18 20:00:57 -04:00
A [`RowMatrix` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) can be
2014-04-22 14:20:47 -04:00
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
2014-05-18 20:00:57 -04:00
A [`RowMatrix` ](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html ) can be
2014-04-22 14:20:47 -04:00
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
{% highlight java %}
2014-05-06 23:07:22 -04:00
import org.apache.spark.api.java.JavaRDD;
2014-04-22 14:20:47 -04:00
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
JavaRDD< Vector > rows = ... // a JavaRDD of local vectors
// Create a RowMatrix from an JavaRDD< Vector > .
RowMatrix mat = new RowMatrix(rows.rdd());
// Get its size.
long m = mat.numRows();
long n = mat.numCols();
{% endhighlight %}
< / div >
< / div >
#### Multivariate summary statistics
We provide column summary statistics for `RowMatrix` .
If the number of columns is not large, say, smaller than 3000, you can also compute
the covariance matrix as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
which could be faster if the rows are sparse.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
2014-05-18 20:00:57 -04:00
[`RowMatrix#computeColumnSummaryStatistics` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) returns an instance of
[`MultivariateStatisticalSummary` ](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary ),
2014-04-22 14:20:47 -04:00
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
val mat: RowMatrix = ... // a RowMatrix
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
2014-05-06 23:07:22 -04:00
println(summary.numNonzeros) // number of nonzeros in each column
2014-04-22 14:20:47 -04:00
// Compute the covariance matrix.
2014-05-06 23:07:22 -04:00
val cov: Matrix = mat.computeCovariance()
2014-04-22 14:20:47 -04:00
{% endhighlight %}
< / div >
2014-05-18 20:00:57 -04:00
< div data-lang = "java" markdown = "1" >
[`RowMatrix#computeColumnSummaryStatistics` ](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics( )) returns an instance of
[`MultivariateStatisticalSummary` ](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html ),
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count.
{% highlight java %}
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
RowMatrix mat = ... // a RowMatrix
// Compute column summary statistics.
MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
System.out.println(summary.mean()); // a dense vector containing the mean value for each column
System.out.println(summary.variance()); // column-wise variance
System.out.println(summary.numNonzeros()); // number of nonzeros in each column
// Compute the covariance matrix.
Matrix cov = mat.computeCovariance();
{% endhighlight %}
< / div >
2014-04-22 14:20:47 -04:00
< / div >
### IndexedRowMatrix
An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by
an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
An
2014-05-18 20:00:57 -04:00
[`IndexedRowMatrix` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix )
2014-04-22 14:20:47 -04:00
can be created from an `RDD[IndexedRow]` instance, where
2014-05-18 20:00:57 -04:00
[`IndexedRow` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow ) is a
2014-04-22 14:20:47 -04:00
wrapper over `(Long, Vector)` . An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
its row indices.
{% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
An
2014-05-18 20:00:57 -04:00
[`IndexedRowMatrix` ](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html )
2014-04-22 14:20:47 -04:00
can be created from an `JavaRDD<IndexedRow>` instance, where
2014-05-18 20:00:57 -04:00
[`IndexedRow` ](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html ) is a
2014-04-22 14:20:47 -04:00
wrapper over `(long, Vector)` . An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
its row indices.
{% highlight java %}
2014-05-06 23:07:22 -04:00
import org.apache.spark.api.java.JavaRDD;
2014-04-22 14:20:47 -04:00
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
2014-05-06 23:07:22 -04:00
JavaRDD< IndexedRow > rows = ... // a JavaRDD of indexed rows
2014-04-22 14:20:47 -04:00
// Create an IndexedRowMatrix from a JavaRDD< IndexedRow > .
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
// Get its size.
long m = mat.numRows();
long n = mat.numCols();
// Drop its row indices.
RowMatrix rowMat = mat.toRowMatrix();
{% endhighlight %}
< / div > < / div >
### CoordinateMatrix
A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple
of `(i: Long, j: Long, value: Double)` , where `i` is the row index, `j` is the column index, and
`value` is the entry value. A `CoordinateMatrix` should be used only in the case when both
dimensions of the matrix are huge and the matrix is very sparse.
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
A
2014-05-18 20:00:57 -04:00
[`CoordinateMatrix` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix )
2014-04-22 14:20:47 -04:00
can be created from an `RDD[MatrixEntry]` instance, where
2014-05-18 20:00:57 -04:00
[`MatrixEntry` ](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry ) is a
2014-04-22 14:20:47 -04:00
wrapper over `(Long, Long, Double)` . A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
with sparse rows by calling `toIndexedRowMatrix` . In this release, we do not provide other
computation for `CoordinateMatrix` .
{% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
{% endhighlight %}
< / div >
< div data-lang = "java" markdown = "1" >
A
2014-05-18 20:00:57 -04:00
[`CoordinateMatrix` ](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html )
2014-04-22 14:20:47 -04:00
can be created from a `JavaRDD<MatrixEntry>` instance, where
2014-05-18 20:00:57 -04:00
[`MatrixEntry` ](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html ) is a
2014-04-22 14:20:47 -04:00
wrapper over `(long, long, double)` . A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
with sparse rows by calling `toIndexedRowMatrix` .
2014-05-18 20:00:57 -04:00
{% highlight java %}
2014-05-06 23:07:22 -04:00
import org.apache.spark.api.java.JavaRDD;
2014-04-22 14:20:47 -04:00
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
2014-05-06 23:07:22 -04:00
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
2014-04-22 14:20:47 -04:00
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
JavaRDD< MatrixEntry > entries = ... // a JavaRDD of matrix entries
// Create a CoordinateMatrix from a JavaRDD< MatrixEntry > .
2014-05-18 20:00:57 -04:00
CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
2014-04-22 14:20:47 -04:00
// Get its size.
long m = mat.numRows();
long n = mat.numCols();
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
{% endhighlight %}
< / div >
< / div >