Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck.
cc mengxr jkbradley
Author: Davies Liu <davies@databricks.com>
Closes#7476 from davies/fix_vectors_dense and squashes the following commits:
e0fd292 [Davies Liu] fix Vectors.dense
Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005).
Author: Feynman Liang <fliang@databricks.com>
Closes#7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits:
f1112fc [Feynman Liang] Add explainedVariance formula
1a3d098 [Feynman Liang] SROwen code review comments
08a0e1b [Feynman Liang] Fix pyspark tests
db8605a [Feynman Liang] Style fix
bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased
c235de0 [Feynman Liang] Fix RegressionMetrics tests
4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.
TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#6791 from yu-iskw/SPARK-6259 and squashes the following commits:
6855f59 [Yu ISHIKAWA] LDA inherits object
28bd165 [Yu ISHIKAWA] Change the place of testing code
d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value
083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer`
9f8bed8 [Yu ISHIKAWA] Simplify casting
faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters
98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented.
57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing
73412c3 [Yu ISHIKAWA] Fix the typo
2278829 [Yu ISHIKAWA] Fix the indentation
39514ec [Yu ISHIKAWA] Modify how to cast the input data
8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala`
77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint
68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python
25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7286 from yanboliang/spark-8068 and squashes the following commits:
6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
Adding __str__ and __repr__ to DenseMatrix and SparseMatrix
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6342 from MechCoder/spark-7785 and squashes the following commits:
7b9a82c [MechCoder] Add tests for greater than 16 elements
b88e9dd [MechCoder] Increment limit to 16
1425a01 [MechCoder] Change tests
36bd166 [MechCoder] Change str and repr representation
97f0da9 [MechCoder] zip is same as izip in python3
94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs
b26fa89 [MechCoder] minor
394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
Follow up for https://github.com/apache/spark/pull/5946
Currently we iterate over indices and values in SparseVector and can be vectorized.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7222 from MechCoder/sparse_optim and squashes the following commits:
dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
PySpark PowerIterationClustering test failure due to bad demo data.
If the data is small, PowerIterationClustering will behavior indeterministic.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7177 from yanboliang/spark-8765 and squashes the following commits:
392ae54 [Yanbo Liang] fix model.assignments output
5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
This reverts commit 25f574eb9a. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#7240 from mengxr/SPARK-7212.revert and squashes the following commits:
2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
Currently we iterate over indices which can be vectorized.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5946 from MechCoder/spark-7203 and squashes the following commits:
034d086 [MechCoder] Vectorize dot calculation for numpy arrays for ndim=2
bce2b07 [MechCoder] fix doctest
fcad0a3 [MechCoder] Remove type checks for list, pyarray etc
0ee5dd4 [MechCoder] Add tests and other isinstance changes
e5f1de0 [MechCoder] [SPARK-7401] Vectorize dot product and sq_dist
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#6821 from yu-iskw/SPARK-7104 and squashes the following commits:
975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
Python bindings for StreamingLinearRegressionWithSGD
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6744 from MechCoder/spark-4127 and squashes the following commits:
d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
1b4ddd6 [MechCoder] minor
4de6c68 [MechCoder] Minor refactor
5e85a3b [MechCoder] Add tests for simultaneous training and prediction
fb27889 [MechCoder] Add example and docs
505380b [MechCoder] Add tests
d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
Python support for Power Iteration Clustering
https://issues.apache.org/jira/browse/SPARK-5962
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6992 from yanboliang/pyspark-pic and squashes the following commits:
6b03d82 [Yanbo Liang] address comments
4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
Support mining of ordered frequent item sequences.
Author: Feynman Liang <fliang@databricks.com>
Closes#6997 from feynmanliang/fp-sequence and squashes the following commits:
7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
0d3e4b6 [Feynman Liang] Fix python test
ce987cb [Feynman Liang] Backwards compatibility aux constructor
34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
648d4d4 [Feynman Liang] Test case for frequent item sequences
252a36a [Feynman Liang] Add sequence learning flag
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
Add Python bindings to StreamingLogisticRegressionwithSGD.
No Java wrappers are needed as models are updated directly using train.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6849 from MechCoder/spark-3258 and squashes the following commits:
b4376a5 [MechCoder] minor
d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6715 from MechCoder/generate_linear_input and squashes the following commits:
6182884 [MechCoder] Minor changes
8bda047 [MechCoder] Minor style fixes
0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
Author: Holden Karau <holden@pigscanfly.ca>
Closes#6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:
2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#6926 from yu-iskw/SPARK-8511 and squashes the following commits:
7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
Python API for PCA and PCAModel
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6315 from yanboliang/spark-7604 and squashes the following commits:
1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
Python bindings for StreamingKMeans
Will change status to MRG once docs, tests and examples are updated.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6499 from MechCoder/spark-4118 and squashes the following commits:
7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
Python API for org.apache.spark.mllib.feature.ElementwiseProduct
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6346 from MechCoder/spark-7605 and squashes the following commits:
79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
MatrixUDT was recently coded in scala. This has been ported to PySpark
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6354 from MechCoder/spark-6390 and squashes the following commits:
fc4dc1e [MechCoder] Better error message
c940a44 [MechCoder] Added test
aa9c391 [MechCoder] Add pyUDT to MatrixUDT
62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6460 from yanboliang/spark-7916 and squashes the following commits:
f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
Python API for KernelDensity
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6387 from MechCoder/spark-7639 and squashes the following commits:
17abc62 [MechCoder] add tests
2de6540 [MechCoder] style tests
bf4acc0 [MechCoder] Added doctests
84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`
It fails in my system since I have version `1.10` :P
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6579 from MechCoder/np_ver and squashes the following commits:
15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6461 from yanboliang/spark-7918 and squashes the following commits:
940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse
a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`.
Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python.
Author: Michael Nazario <mnazario@palantir.com>
Closes#6439 from mnazario/feature/SPARK-7899 and squashes the following commits:
366ef30 [Michael Nazario] Remove hack on random.py
bb8b04d [Michael Nazario] Make doctests consistent with other tests
6ee4f75 [Michael Nazario] Change test scripts to use "-m"
673528f [Michael Nazario] Move _types back to types
Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`.
coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#6468 from mengxr/SPARK-7922 and squashes the following commits:
7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark
1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
Add MultilabelMetrics in PySpark/MLlib
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6276 from yanboliang/spark-6094 and squashes the following commits:
b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
Fixed the following warnings in `make clean html` under `python/docs`:
~~~
/Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent.
~~~
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#6221 from mengxr/SPARK-6657 and squashes the following commits:
e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings
2b4371e [Xiangrui Meng] fix mllib python doc warnings
In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD.
Author: FlytxtRnD <meethu.mathew@flytxt.com>
Closes#6180 from FlytxtRnD/GmmPredictException and squashes the following commits:
4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
```scala
GaussianMixture
setInitialModel
GaussianMixtureModel
k
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6087 from yanboliang/spark-6258 and squashes the following commits:
b3af21c [Yanbo Liang] fix typo
2b645c1 [Yanbo Liang] fix doc
638b4b7 [Yanbo Liang] address comments
b5bcade [Yanbo Liang] GaussianMixture Python API parity check
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6044 from yanboliang/spark-6092 and squashes the following commits:
726a9b1 [Yanbo Liang] add newRankingMetrics
33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5939 from yanboliang/spark-5913 and squashes the following commits:
cdaac99 [Yanbo Liang] Python API for ChiSqSelector
Add
1. Class methods squared_dist
3. parse
4. norm
5. numNonzeros
6. copy
I made a few vectorizations wrt squared_dist and dot as well. I have added support for SparseMatrix serialization in a separate PR (https://github.com/apache/spark/pull/5775) and plan to complete support for Matrices in another PR.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5872 from MechCoder/local_linalg_api and squashes the following commits:
a8ff1e0 [MechCoder] minor
ce3e53e [MechCoder] Add error message for parser
1bd3c04 [MechCoder] Robust parser and removed unnecessary methods
f779561 [MechCoder] [SPARK-7328] Pyspark.mllib.linalg.Vectors: Missing items
https://issues.apache.org/jira/browse/SPARK-6093
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5941 from yanboliang/spark-6093 and squashes the following commits:
6934af3 [Yanbo Liang] change to @property
aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
The following items are added to Python kmeans:
kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k
Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
Closes#5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
Utilities for pickling and unpickling SparseMatrices using SerDe
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5775 from MechCoder/spark-7202 and squashes the following commits:
7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe
Adds
rank, recommendUsers and RecommendProducts to MatrixFactorizationModel in PySpark.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5807 from MechCoder/spark-6257 and squashes the following commits:
09629c6 [MechCoder] doc
953b326 [MechCoder] [SPARK-6257] MLlib API missing items in Recommendation
Added Matrix, SparseMatrix to __all__ list in linalg.py
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5759 from jkbradley/SPARK-7208 and squashes the following commits:
deb51a2 [Joseph K. Bradley] Added Matrix, SparseMatrix to __all__ list in linalg.py
Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5614 from yanboliang/spark-6827 and squashes the following commits:
da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
This PR try to speed up some python tests:
```
tests.py 144s -> 103s -41s
mllib/classification.py 24s -> 17s -7s
mllib/regression.py 27s -> 15s -12s
mllib/tree.py 27s -> 13s -14s
mllib/tests.py 64s -> 31s -33s
streaming/tests.py 185s -> 84s -101s
```
Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
During testing, it will show used time for each test file:
```
Run core tests ...
Running test: pyspark/rdd.py ... ok (22s)
Running test: pyspark/context.py ... ok (16s)
Running test: pyspark/conf.py ... ok (4s)
Running test: pyspark/broadcast.py ... ok (4s)
Running test: pyspark/accumulators.py ... ok (4s)
Running test: pyspark/serializers.py ... ok (6s)
Running test: pyspark/profiler.py ... ok (5s)
Running test: pyspark/shuffle.py ... ok (1s)
Running test: pyspark/tests.py ... ok (103s) 144s
```
Author: Reynold Xin <rxin@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5605 from rxin/python-tests-speed and squashes the following commits:
d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
89321ee [Xiangrui Meng] fix seed in tests
3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
SchemaRDD works with ALS.train in 1.2, so we should continue support DataFrames for compatibility. coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5619 from mengxr/SPARK-7036 and squashes the following commits:
dfcaf5a [Xiangrui Meng] ALS.train should support DataFrames in PySpark
Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5455 from MechCoder/spark-6845 and squashes the following commits:
525c370 [MechCoder] minor
004a37f [MechCoder] Cast boolean to int
151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix