Commit graph

1409 commits

Author SHA1 Message Date
wm624@hotmail.com b6fa7e5934 [SPARK-14571][ML] Log instrumentation in ALS
## What changes were proposed in this pull request?

Add log instrumentation for parameters:
rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
userCol, itemCol, ratingCol, predictionCol, maxIter,
regParam, nonnegative, checkpointInterval, seed

Add log instrumentation for numUserFeatures and numItemFeatures

## How was this patch tested?

Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12560 from wangmiao1981/log.
2016-04-29 16:18:25 +02:00
dding3 6d5aeaae26 [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient
## What changes were proposed in this pull request?

This PR removes duplicate implementation of compute in LogisticGradient class

## How was this patch tested?

unit tests

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12747 from dding3/master.
2016-04-29 10:19:51 +01:00
Sean Owen d1cf320105 [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
## What changes were proposed in this pull request?

Handle case where number of predictions is less than label set, k in nDCG computation

## How was this patch tested?

New unit test; existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12756 from srowen/SPARK-14886.
2016-04-29 09:21:27 +02:00
Zheng RuiFeng cabd54d931 [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12596 from zhengruifeng/deprecate_sgd.
2016-04-28 22:44:14 -07:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
Joseph K. Bradley 4f4721a21c [SPARK-14862][ML] Updated Classifiers to not require labelCol metadata
## What changes were proposed in this pull request?

Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
* They first check for metadata.
* If numClasses is not specified in metadata, they identify the largest label value (up to a limit).

This functionality is implemented in a new Classifier.getNumClasses method.

Also
* Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.

## How was this patch tested?

* Unit tests in ClassifierSuite for helper methods
* Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12663 from jkbradley/trees-no-metadata.
2016-04-28 16:20:00 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Yuhao Yang d5ab42ceb9 [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpm
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14916
FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
sample:
{a, b}: 5
{x, y, z}: 4

## How was this patch tested?

existing unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12698 from hhbyyh/freqtos.
2016-04-28 19:52:09 +01:00
Joseph K. Bradley 5ee72454df [SPARK-14852][ML] refactored GLM summary into training, non-training summaries
## What changes were proposed in this pull request?

This splits GeneralizedLinearRegressionSummary into 2 summary types:
* GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
* GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting

This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.

The summary no longer provides the model itself as a public val.

Also:
* Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
* Adds hasSummary method.
* Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
* In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
* Pardon the style fixes; that is IntelliJ being obnoxious.

## How was this patch tested?

Existing unit tests + updated test for evaluate and hasSummary

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12624 from jkbradley/model-summary-api.
2016-04-28 11:22:13 -07:00
Liang-Chi Hsieh 7c6937a885 [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation
## What changes were proposed in this pull request?

Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.

## How was this patch tested?

`UserDefinedTypeSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12259 from viirya/improve-sql-usertype.
2016-04-28 01:14:49 -07:00
Joseph K. Bradley f5ebb18c45 [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStage
## What changes were proposed in this pull request?

Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6.  This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage].  This PR modifies the following to accept subclasses of PipelineStage:
* Pipeline.setStages()
* Params.w()

## How was this patch tested?

Unit test which fails to compile before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12430 from jkbradley/pipeline-setstages.
2016-04-27 16:11:12 -07:00
Yanbo Liang 4672e9838b [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12702 from yanboliang/spark-14899.
2016-04-27 14:08:26 -07:00
Mike Dusenberry 607f50341c [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 19:48:05 +02:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joseph K. Bradley 6c5a837c50 [SPARK-12301][ML] Made all tree and ensemble classes not final
## What changes were proposed in this pull request?

There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms.

This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final.  This matches most other spark.ml algorithms.

Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12711 from jkbradley/final-trees.
2016-04-26 14:44:39 -07:00
Dongjoon Hyun e4f3eec5b7 [SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.save
## What changes were proposed in this pull request?

This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write.
```
- val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
- // TODO: repartition with 1 partition after SPARK-5532 gets fixed
- dataRDD.write.parquet(Loader.dataPath(path))
+ sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12676 from dongjoon-hyun/SPARK-14907.
2016-04-26 13:58:29 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Andrew Or 2a3d39f48b [MINOR] Follow-up to #12625
## What changes were proposed in this pull request?

That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.

Author: Andrew Or <andrew@databricks.com>

Closes #12686 from andrewor14/visibility.
2016-04-26 11:08:08 -07:00
Reynold Xin 5cb03220a0 [SPARK-14912][SQL] Propagate data source options to Hadoop configuration
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.

## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.

Author: Reynold Xin <rxin@databricks.com>

Closes #12688 from rxin/SPARK-14912.
2016-04-26 10:58:56 -07:00
Yanbo Liang 92f66331b4 [SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12685 from yanboliang/spark-14313.
2016-04-26 10:30:24 -07:00
BenFradet 2a5c930790 [SPARK-13962][ML] spark.ml Evaluators should support other numeric types for label
## What changes were proposed in this pull request?

Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12500 from BenFradet/SPARK-13962.
2016-04-26 08:55:50 +02:00
Andrew Or 18c2c92580 [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession
## What changes were proposed in this pull request?

In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.

In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.

**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.

## How was this patch tested?

No change in functionality intended.

Author: Andrew Or <andrew@databricks.com>

Closes #12625 from andrewor14/spark-session-refactor.
2016-04-25 20:54:31 -07:00
Yanbo Liang 9cb3ba1013 [SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```

## How was this patch tested?
Add unit tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12573 from yanboliang/spark-14312.
2016-04-25 14:08:41 -07:00
Yanbo Liang 425f691646 [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.

Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.

## How was this patch tested?
unit tests.

cc jkbradley MLnick

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12498 from yanboliang/spark-10574.
2016-04-25 12:08:43 -07:00
wm624@hotmail.com b50e2eca93 [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture
## What changes were proposed in this pull request?

Add Python API in ML for GaussianMixture

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12402 from wangmiao1981/gmm.
2016-04-25 10:48:15 -07:00
Zheng RuiFeng e6f954a579 [SPARK-14758][ML] Add checking for StepSize and Tol
## What changes were proposed in this pull request?
add the checking for StepSize and Tol in sharedParams

## How was this patch tested?
Unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12530 from zhengruifeng/ml_args_checking.
2016-04-25 10:30:55 +02:00
Dongjoon Hyun d34d650378 [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
## What changes were proposed in this pull request?

Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.

- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)

## How was this patch tested?

After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12632 from dongjoon-hyun/SPARK-14868.
2016-04-24 20:40:03 -07:00
Zheng RuiFeng 86ca8fefc8 [MINOR][ML][MLLIB] Remove unused imports
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12497 from zhengruifeng/del_unused_imports.
2016-04-22 23:20:10 -07:00
Liang-Chi Hsieh 8098f15857 [SPARK-14843][ML] Fix encoding error in LibSVMRelation
## What changes were proposed in this pull request?

We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later.

## How was this patch tested?
`LibSVMRelationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12611 from viirya/fix-libsvm.
2016-04-23 01:11:36 +08:00
Zheng RuiFeng 92675471b7 [MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifier
## What changes were proposed in this pull request?
1, fix the indentation
2, add a missing param desc

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12499 from zhengruifeng/fix_doc.
2016-04-22 14:52:37 +01:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
Yanbo Liang 4e726227a3 [SPARK-14479][ML] GLM supports output link prediction
## What changes were proposed in this pull request?
GLM supports output link prediction.
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12287 from yanboliang/spark-14479.
2016-04-21 17:31:33 -07:00
Joseph K. Bradley f25a3ea8d3 [SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib Vector, Matrix types
## What changes were proposed in this pull request?

For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another.
This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types.

## How was this patch tested?

Unit tests for all conversions

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12504 from jkbradley/linalg-conversions.
2016-04-21 16:50:09 -07:00
Xin Ren 6d1e4c4a65 [SPARK-14569][ML] Log instrumentation in KMeans
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14569

Log instrumentation in KMeans:

- featuresCol
- predictionCol
- k
- initMode
- initSteps
- maxIter
- seed
- tol
- summary

## How was this patch tested?

Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample

Author: Xin Ren <iamshrek@126.com>

Closes #12432 from keypointt/SPARK-14569.
2016-04-21 16:29:39 -07:00
Joseph K. Bradley acc7e592c4 [SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std
## What changes were proposed in this pull request?

Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.

This PR documents this fact.

## How was this patch tested?

doc only

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12519 from jkbradley/scaler-variance-doc.
2016-04-20 11:48:30 -07:00
Liwei Lin 17db4bfeaa [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.
2016-04-20 11:28:51 +01:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Xusen Yin b64482f49f [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support export/import
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14306

Add PySpark OneVsRest save/load supports.

## How was this patch tested?

Test with Python unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12439 from yinxusen/SPARK-14306-0415.
2016-04-18 11:52:29 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Yanbo Liang 83af297ac4 [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12393 from yanboliang/spark-13925.
2016-04-15 08:23:51 -07:00
Pravin Gadakh e24923267f [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizer
## What changes were proposed in this pull request?

Removed duplicated generation of `ids` in OnlineLDAOptimizer.

## How was this patch tested?

tested with existing unit tests.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12176 from pravingadakh/SPARK-14370.
2016-04-15 13:08:30 +01:00
DB Tsai 96534aa47c [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local
## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12317 from dbtsai/dbtsai-ml-vector.
2016-04-15 01:17:03 -07:00
Fokko Driesprong c80586d9e8 [SPARK-12869] Implemented an improved version of the toIndexedRowMatrix
Hi guys,

I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.

Author: Fokko Driesprong <f.driesprong@catawiki.nl>
Author: Fokko Driesprong <fokko@driesprongen.nl>

Closes #10839 from Fokko/master.
2016-04-14 17:32:20 -07:00
Yong Tang 01dd1f5c07 [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
## What changes were proposed in this pull request?

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.

## How was this patch tested?

Existing tests passed.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12360 from yongtang/SPARK-14565.
2016-04-14 17:23:16 -07:00
Joseph K. Bradley bf65c87f70 [SPARK-14618][ML][DOC] Updated RegressionEvaluator.metricName param doc
## What changes were proposed in this pull request?

In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs.

## How was this patch tested?

no tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12377 from jkbradley/regeval-doc.
2016-04-14 12:44:59 -07:00
Sean Owen 9fa43a33b9 [SPARK-14612][ML] Consolidate the version of dependencies in mllib and mllib-local into one place
## What changes were proposed in this pull request?

Move json4s, breeze dependency declaration into parent

## How was this patch tested?

Should be no functional change, but Jenkins tests will test that.

Author: Sean Owen <sowen@cloudera.com>

Closes #12390 from srowen/SPARK-14612.
2016-04-14 10:48:17 -07:00
Yanbo Liang a91aaf5a8c [SPARK-14375][ML] Unit test for spark.ml KMeansSummary
## What changes were proposed in this pull request?
* Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters.
* Add unit test for spark.ml ```KMeansSummary```.
* Add Since tag.

## How was this patch tested?
unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12254 from yanboliang/spark-14375.
2016-04-13 13:23:10 -07:00
Yanbo Liang 0d17593b32 [SPARK-14461][ML] GLM training summaries should provide solver
## What changes were proposed in this pull request?
GLM training summaries should provide solver.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12253 from yanboliang/spark-14461.
2016-04-13 13:20:29 -07:00
Yanbo Liang b0adb9f543 [SPARK-10386][MLLIB] PrefixSpanModel supports save/load
```PrefixSpanModel``` supports ```save/load```. It's similar with #9267.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10664 from yanboliang/spark-10386.
2016-04-13 13:18:02 -07:00