Commit graph

1151 commits

Author SHA1 Message Date
Xiangrui Meng 9bb695b7a8 [SPARK-12000] do not specify arg types when reference a method in ScalaDoc
This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations.

cc: JoshRosen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #10114 from mengxr/SPARK-12000.2.
2015-12-02 17:19:31 -08:00
Yu ISHIKAWA de07d06abe [SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning
cc mengxr noel-smith

I worked on this issues based on https://github.com/apache/spark/pull/8729.
ehsanmok  thank you for your contricution!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #9338 from yu-iskw/JIRA-10266.
2015-12-02 14:15:54 -08:00
Cheng Lian 69dbe6b40d [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues
This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.
2015-12-01 10:21:31 -08:00
Yuhao Yang a0af0e351e [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,

1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.

This will also help extend the upper limit for Word2Vec.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9878 from hhbyyh/w2vBC.
2015-12-01 09:26:58 +00:00
Yuhao Yang 52bc25c8e2 [SPARK-11847][ML] Model export/import for spark.ml: LDA
Add read/write support to LDA, similar to ALS.

save/load for ml.LocalLDAModel is done.
For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9894 from hhbyyh/ldaMLsave.
2015-11-24 09:56:17 -08:00
Joseph K. Bradley 9e24ba667e [SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ignore weight col
Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7

CC: mengxr thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9927 from jkbradley/linregsummary-doc.
2015-11-24 09:54:55 -08:00
BenFradet 4be360d4ee [SPARK-11902][ML] Unhandled case in VectorAssembler#transform
There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT.

So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType".

This PR aims to fix this, throwing a SparkException when dealing with an unknown column type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #9885 from BenFradet/SPARK-11902.
2015-11-22 22:05:01 -08:00
Yanbo Liang d9cf9c21fc [SPARK-11912][ML] ml.feature.PCA minor refactor
Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel```  to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9897 from yanboliang/spark-11912.
2015-11-22 21:56:07 -08:00
Joseph K. Bradley a6fda0bfc1 [SPARK-6791][ML] Add read/write for CrossValidator and Evaluators
I believe this works for general estimators within CrossValidator, including compound estimators.  (See the complex unit test.)

Added read/write for all 3 Evaluators as well.

CC: mengxr yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9848 from jkbradley/cv-io.
2015-11-22 21:48:48 -08:00
Yanbo Liang 9ace2e5c8d [SPARK-11852][ML] StandardScaler minor refactor
```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9839 from yanboliang/standardScaler-refactor.
2015-11-20 09:55:53 -08:00
Xusen Yin 3e1d120ced [SPARK-11867] Add save/load for kmeans and naive bayes
https://issues.apache.org/jira/browse/SPARK-11867

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9849 from yinxusen/SPARK-11867.
2015-11-19 23:43:18 -08:00
Joseph K. Bradley 0fff8eb3e4 [SPARK-11869][ML] Clean up TempDirectory properly in ML tests
Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```)

I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem.

CC: mengxr  Can you confirm this is fine?  I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9851 from jkbradley/tempdir-cleanup.
2015-11-19 23:42:24 -08:00
Yanbo Liang 3b7f056da8 [SPARK-11829][ML] Add read/write to estimators under ml.feature (II)
Add read/write support to the following estimators under spark.ml:
* ChiSqSelector
* PCA
* VectorIndexer
* Word2Vec

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9838 from yanboliang/spark-11829.
2015-11-19 22:02:17 -08:00
Xusen Yin 4114ce20fb [SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression
https://issues.apache.org/jira/browse/SPARK-11846

mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9836 from yinxusen/SPARK-11846.
2015-11-19 22:01:02 -08:00
Joseph K. Bradley d02d5b9295 [SPARK-11842][ML] Small cleanups to existing Readers and Writers
Updates:
* Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression
* Switch from hand-written class names in Readers to using getClass

CC: mengxr

CC: yanboliang Would you mind taking a look at this PR?  mengxr might not be able to soon.  Thank you!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9829 from jkbradley/ml-io-cleanups.
2015-11-18 21:44:01 -08:00
Xiangrui Meng e99d339206 [SPARK-11839][ML] refactor save/write traits
* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.*
* define `DefaultParamsReadable/Writable` and use them to save some code
* use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9827 from mengxr/SPARK-11839.
2015-11-18 18:34:01 -08:00
Xiangrui Meng 7e987de177 [SPARK-6787][ML] add read/write to estimators under ml.feature (1)
Add read/write support to the following estimators under spark.ml:

* CountVectorizer
* IDF
* MinMaxScaler
* StandardScaler (a little awkward because we store some params in spark.mllib model)
* StringIndexer

Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9798 from mengxr/SPARK-6787.
2015-11-18 15:47:49 -08:00
Yanbo Liang e222d75849 [SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
  * supporting feature interaction in R formula.
  * summary for gaussian GLM model.
  * coefficients for binomial GLM model.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9727 from yanboliang/spark-11684.
2015-11-18 13:30:29 -08:00
Yuhao Yang e391abdf2c [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9803 from hhbyyh/w2vVocab.
2015-11-18 13:25:15 -08:00
Joseph K. Bradley 2acdf10b1f [SPARK-6789][ML] Add Readable, Writable support for spark.ml ALS, ALSModel
Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata.

CC: mengxr yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9786 from jkbradley/als-io.
2015-11-18 13:16:31 -08:00
Wenjian Huang 045a4f0458 [SPARK-6790][ML] Add spark.ml LinearRegression import/export
This replaces [https://github.com/apache/spark/pull/9656] with updates.

fayeshine should be the main author when this PR is committed.

CC: mengxr fayeshine

Author: Wenjian Huang <nextrush@163.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9814 from jkbradley/fayeshine-patch-6790.
2015-11-18 13:06:25 -08:00
RoyGaoVLIS 67a5132c21 [SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler
I have added unit test for ML's StandardScaler By comparing with R's output, please review  for me.
Thx.

Author: RoyGaoVLIS <roygao@zju.edu.cn>

Closes #6665 from RoyGao/7013.
2015-11-17 23:00:49 -08:00
Xiangrui Meng 3e9e638023 [SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector
This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9776 from mengxr/SPARK-11764.
2015-11-17 14:04:49 -08:00
Joseph K. Bradley 6eb7008b7f [SPARK-11763][ML] Add save,load to LogisticRegression Estimator
Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.

Moved LogisticRegressionReader/Writer to within LogisticRegressionModel

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9749 from jkbradley/lr-io-2.
2015-11-17 14:03:49 -08:00
Joseph K. Bradley d98d1cb000 [SPARK-11769][ML] Add save, load to all basic Transformers
This excludes Estimators and ones which include Vector and other non-basic types for Params or data.  This adds:
* Bucketizer
* DCT
* HashingTF
* Interaction
* NGram
* Normalizer
* OneHotEncoder
* PolynomialExpansion
* QuantileDiscretizer
* RFormula
* SQLTransformer
* StopWordsRemover
* StringIndexer
* Tokenizer
* VectorAssembler
* VectorSlicer

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9755 from jkbradley/transformer-io.
2015-11-17 12:43:56 -08:00
Xiangrui Meng 21fac54341 [SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors
This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9751 from mengxr/SPARK-11766.
2015-11-17 10:17:16 -08:00
Joseph K. Bradley 1c5475f140 [SPARK-11612][ML] Pipeline and PipelineModel persistence
Pipeline and PipelineModel extend Readable and Writable.  Persistence succeeds only when all stages are Writable.

Note: This PR reinstates tests for other read/write functionality.  It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9674 from jkbradley/pipeline-io.
2015-11-16 17:12:39 -08:00
Xiangrui Meng 64e5551103 [SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite
The same as #9694, but for Java test suite. yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #9719 from mengxr/SPARK-11672.4.
2015-11-15 13:23:05 -08:00
Xiangrui Meng bdfbc1dcaf [MINOR][ML] remove MLlibTestsSparkContext from ImpuritySuite
ImpuritySuite doesn't need SparkContext.

Author: Xiangrui Meng <meng@databricks.com>

Closes #9698 from mengxr/remove-mllib-test-context-in-impurity-suite.
2015-11-13 13:19:04 -08:00
Xiangrui Meng 2d2411faa2 [SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAll
Still saw some error messages caused by `SQLContext.getOrCreate`:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/

This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext.

cc: yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #9694 from mengxr/SPARK-11672.3.
2015-11-13 13:09:28 -08:00
Yanbo Liang 99693fef0a [SPARK-11723][ML][DOC] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9690 from yanboliang/spark-11723.
2015-11-13 08:43:05 -08:00
Xiangrui Meng e71c07557c [SPARK-11672][ML] flaky spark.ml read/write tests
We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext`  and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites.

cc: yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #9677 from mengxr/SPARK-11672.2.
2015-11-12 20:01:13 -08:00
Joseph K. Bradley dcb896fd8c [SPARK-11712][ML] Make spark.ml LDAModel be abstract
Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.

CC feynmanliang mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9678 from jkbradley/lda-pipelines-2.
2015-11-12 17:03:19 -08:00
Xiangrui Meng e2957bc085 [SPARK-11674][ML] add private val after @transient in Word2VecModel
This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9644 from mengxr/SPARK-11674.
2015-11-11 21:01:14 -08:00
Xiangrui Meng 1a21be15f6 [SPARK-11672][ML] disable spark.ml read/write tests
Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/

I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile.

Author: Xiangrui Meng <meng@databricks.com>

Closes #9641 from mengxr/SPARK-11672.
2015-11-11 15:41:36 -08:00
Yuming Wang 27524a3a9c [SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slow
org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence.

Author: Yuming Wang <q79969786@gmail.com>
Author: yuming.wang <q79969786@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #9592 from 979969786/master.
2015-11-11 09:43:26 -08:00
Joseph K. Bradley 6e101d2e9d [SPARK-6726][ML] Import/export for spark.ml LogisticRegressionModel
This PR adds model save/load for spark.ml's LogisticRegressionModel.  It also does minor refactoring of the default save/load classes to reuse code.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9606 from jkbradley/logreg-io2.
2015-11-10 18:45:48 -08:00
Yu ISHIKAWA c0e48dfa61 [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9534 from yu-iskw/SPARK-11566.
2015-11-10 16:42:28 -08:00
Joseph K. Bradley e281b87398 [SPARK-5565][ML] LDA wrapper for Pipelines API
This adds LDA to spark.ml, the Pipelines API.  It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs.  These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR.

CC: hhbyyh feynmanliang  If you have a chance to make a pass, that'd be really helpful--thanks!  Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9513 from jkbradley/lda-pipelines.
2015-11-10 16:20:10 -08:00
unknown dba1a62cf1 [SPARK-7316][MLLIB] RDD sliding window with step
Implementation of step capability for sliding window function in MLlib's RDD.

Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points:

Window | Step | Time | Windows produced
------------ | ------------- | ---------- | ----------
128 | 1 |  6.38 | 9999873
128 | 10 | 0.9 | 999988
128 | 100 | 0.41 | 99999
1024 | 1 | 44.67 | 9998977
1024 | 10 | 4.74 | 999898
1024 | 100 | 0.78 | 99990
```
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(1 to 10000000, 10)
rdd.count
val window = 1024
val step = 1
val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9)
```

Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net>
Author: Alexander Ulanov <nashb@yandex.ru>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5855 from avulanov/SPARK-7316-sliding.
2015-11-10 14:25:06 -08:00
Joseph K. Bradley 18350a5700 [SPARK-11618][ML] Minor refactoring of basic ML import/export
Refactoring
* separated overwrite and param save logic in DefaultParamsWriter
* added sparkVersion to DefaultParamsWriter

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9587 from jkbradley/logreg-io.
2015-11-10 11:36:43 -08:00
Yuhao Yang 61f9c8711c [SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase
jira: https://issues.apache.org/jira/browse/SPARK-11069
quotes from jira:
Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
call the Boolean Param "toLowercase"
set default to false (so behavior does not change)

Actually sklearn converts to lowercase before tokenizing too

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9092 from hhbyyh/tokenLower.
2015-11-09 16:55:23 -08:00
Yu ISHIKAWA 8a2336893a [SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering
I implemented a hierarchical clustering algorithm again.  This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517

- This implementation based on a bi-sectiong K-means clustering.
    - It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
    - However, It is 1000x faster than the previous version through parallel processing.

Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>

Closes #5267 from yu-iskw/new-hierarchical-clustering.
2015-11-09 14:56:36 -08:00
fazlan-nazeem 9b88e1dcad [SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model
The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.

Author: fazlan-nazeem <fazlann@wso2.com>

Closes #9558 from fazlan-nazeem/master.
2015-11-09 08:58:55 -08:00
Yanbo Liang 8c0e1b50e9 [SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
 Min        Max
 -0.9509607 0.7291832

$Coefficients
                   Estimate   Std. Error t value   Pr(>|t|)
(Intercept)        1.6765     0.2353597  7.123139  4.456124e-11
Sepal_Length       0.3498801  0.04630128 7.556598  4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica  -1.00751   0.09330565 -10.79796 0
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9561 from yanboliang/spark-11494.
2015-11-09 08:56:22 -08:00
Yu ISHIKAWA 2ff0e79a86 [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python
Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.
2015-11-06 22:56:29 -08:00
Xiangrui Meng c447c9d546 [SPARK-11217][ML] save/load for non-meta estimators and transformers
This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:

* class name
* uid
* timestamp
* paramMap

The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.

~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")

Instance.load("path")
~~~

The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.

TODOs:

* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers

cc jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9454 from mengxr/SPARK-11217.
2015-11-06 14:51:03 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Yu ISHIKAWA 8fa8c8375d [SPARK-11514][ML] Pass random seed to spark.ml DecisionTree*
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9486 from yu-iskw/SPARK-11514.
2015-11-05 17:59:01 -08:00
Ehsan M.Kermani f80f7b69a3 [SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression
Here is my first commit.

Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>

Closes #8728 from ehsanmok/SinceAnn.
2015-11-05 12:11:57 -08:00