spark-instrumented-optimizer/python/pyspark/mllib
Liang-Chi Hsieh 12206058e8 [SPARK-20214][ML] Make sure converted csc matrix has sorted indices
## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.
2017-04-05 17:46:44 -07:00
..
linalg [SPARK-20214][ML] Make sure converted csc matrix has sorted indices 2017-04-05 17:46:44 -07:00
stat [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
__init__.py [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide 2016-07-15 13:38:23 -07:00
classification.py [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML 2016-07-13 12:33:39 -07:00
clustering.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
common.py [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch 2016-10-03 14:12:03 -07:00
evaluation.py [SPARK-15823][PYSPARK][ML] Add @property for 'accuracy' in MulticlassMetrics 2016-06-10 10:09:19 +01:00
feature.py [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change 2017-01-10 13:09:58 +00:00
fpm.py [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML 2016-07-13 12:33:39 -07:00
random.py [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code 2016-05-23 18:14:48 -07:00
recommendation.py [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter 2017-03-21 13:23:59 +00:00
regression.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
tests.py [SPARK-20214][ML] Make sure converted csc matrix has sorted indices 2017-04-05 17:46:44 -07:00
tree.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
util.py [SPARK-18445][BUILD][DOCS] Fix the markdown for Note:/NOTE:/Note that/'''Note:''' across Scala/Java API documentation 2016-11-19 11:24:15 +00:00