## What changes were proposed in this pull request?
Remove last usage of jblas, in tests
## How was this patch tested?
Jenkins tests -- the same ones that are being modified.
Author: Sean Owen <sowen@cloudera.com>
Closes#11560 from srowen/SPARK-13715.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11509 from marmbrus/fileDataSource.
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11203 from yinxusen/SPARK-13036.
## What changes were proposed in this pull request?
It avoids counting the dataframe twice.
Author: Abou Haydar Elias <abouhaydar.elias@gmail.com>
Author: Elie A <abouhaydar.elias@gmail.com>
Closes#11491 from eliasah/quantile-discretizer-patch.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
Remove duplicated periods at the end of some sharedParams in ScalaDoc, such as [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367)
cc mengxr srowen
## How was this patch tested?
Documents change, no test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11344 from yanboliang/shared-cleanup.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11136 from yanboliang/spark-12811.
JIRA: https://issues.apache.org/jira/browse/SPARK-13506
## What changes were proposed in this pull request?
just chang R Snippet Comment in AssociationRulesSuite
## How was this patch tested?
unit test passsed
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#11387 from zhengruifeng/ars.
## What changes were proposed in this pull request?
* The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
* BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
* Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.
cc mengxr dbtsai
## How was this patch tested?
No new tests, it should pass all current tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11424 from yanboliang/spark-13545.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the tree module.
closes#10601
Author: Bryan Cutler <cutlerb@gmail.com>
Author: vijaykiran <mail@vijaykiran.com>
Closes#11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
## What changes were proposed in this pull request?
This is another try of PR #11323.
This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11388 from liancheng/remove-df-rdd-ops.
jira: https://issues.apache.org/jira/browse/SPARK-13028
MaxAbsScaler works in a very similar way as MinMaxScaler, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. The motivation to use this scaling includes robustness to very small standard deviations of features and preserving zero entries in sparse data.
Unlike StandardScaler and MinMaxScaler, MaxAbsScaler does not shift/center the data, and thus does not destroy any sparsity.
Something similar from sklearn:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10939 from hhbyyh/maxabs and squashes the following commits:
fd8bdcd [Yuhao Yang] add tag and some optimization on fit
648fced [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
75bebc2 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
cb10bb6 [Yuhao Yang] remove minmax
91ef8f3 [Yuhao Yang] ut added
8ab0747 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into maxabs
a9215b5 [Yuhao Yang] max abs scaler
## What changes were proposed in this pull request?
ML StringIndexer does not protect itself from column name duplication.
We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue.
## How was this patch tested?
unit test
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11370 from yu-iskw/SPARK-12874.
## What changes were proposed in this pull request?
This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11323 from liancheng/remove-df-rdd-ops.
## What changes were proposed in this pull request?
Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration.
cc srowen
## How was this patch tested?
No extra tests are added. It should pass all existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11367 from yanboliang/spark-13490.
## What changes were proposed in this pull request?
Change line 113 of QuantileDiscretizer.scala to
`val requiredSamples = math.max(numBins * numBins, 10000.0)`
so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
## How was the this patch tested?
Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
Author: Oliver Pierson <ocp@gatech.edu>
Author: Oliver Pierson <opierson@umd.edu>
Closes#11319 from oliverpierson/SPARK-13444.
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave
Author: Xiangrui Meng <meng@databricks.com>
Closes#11226 from mengxr/SPARK-13355.
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11299 from yanboliang/spark-13429.
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
This is expensive and not necessarily beautiful.
I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.
Please let me know what do you think and if you have any questions.
Thanks,
Narine
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes#11179 from NarineK/survivaloptim.
ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.
cc mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11214 from yanboliang/spark-13334.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.
Closes#10602Closes#10897
Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes#11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes#10152 from ygcao/improvementForSentenceBoundary.
## What changes were proposed in this pull request?
Fix MLlib LogisticRegressionWithLBFGS regularization map as:
```SquaredL2Updater``` -> ```elasticNetParam = 0.0```
```L1Updater``` -> ```elasticNetParam = 1.0```
cc dbtsai
## How was the this patch tested?
unit tests
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11258 from yanboliang/spark-13379.
This PR fixes some warnings found by `build/sbt mllib/test:compile`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#11227 from mengxr/fix-mllib-warnings-201602.
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10411 from BenFradet/SPARK-12247.
This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.
A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).
This contribution is my original work and I license the work to the project under the project's open source license.
viirya mengxr
Author: seddonm1 <seddonm1@gmail.com>
Closes#10976 from seddonm1/master.
JIRA: https://issues.apache.org/jira/browse/SPARK-12363
This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#10539 from viirya/fix-poweriter.
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11151 from yu-iskw/SPARK-13265.
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#11149 from sasakitoa/remove_multibyte_in_sparkenv.
JIRA: https://issues.apache.org/jira/browse/SPARK-10524
Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8734 from viirya/dt-soft-centroids.
KMeans:
Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values
MFDataGenerator:
Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.
I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.
Author: Holden Karau <holden@us.ibm.com>
Closes#11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer
also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit.
this change improves training times for one of my test sets from ~7m30s to ~4m30s
Author: Gary King <gary@idibon.com>
Closes#11027 from idigary/spark-13132-optimize-logistic-regression.
Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently.
Author: Imran Younus <iyounus@us.ibm.com>
Closes#10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
Fixes problem and verifies fix by test suite.
Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn
and deduplicates SchemaUtils.appendColumn functions.
Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>
Closes#10741 from grzegorz-chilkiewicz/master.
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```.
There are two limitations in the current implementation compared with R:
* It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code:
```
glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial)
```
* It does not support ```offset```.
Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM).
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10639 from yanboliang/spark-9835.
The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review.
Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>
Closes#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
… Add LibSVMOutputWriter
The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
* Partition is still not supported
* Multiple input paths is not supported
Author: Jeff Zhang <zjffdu@apache.org>
Closes#9595 from zjffdu/SPARK-11622.
https://issues.apache.org/jira/browse/SPARK-12834
We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes#10772 from yinxusen/SPARK-12834.
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10830 from yanboliang/spark-12905.
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10222 from yanboliang/spark-11965.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge.
Author: DB Tsai <dbt@netflix.com>
Closes#10862 from dbtsai/add-tests.
Add Since annotations to ml.param and ml.*
Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp>
Closes#8935 from taishi-oss/issue10263.
This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.
Author: Imran Younus <iyounus@us.ibm.com>
Closes#10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10472 from BenFradet/SPARK-9716.
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
Author: Holden Karau <holden@us.ibm.com>
Closes#10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.
Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>
Closes#10818 from wjur/wjur/rename_error_message.
Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names.
cc mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#10323 from ericl/spark-12346.
I create new pr since original pr long time no update.
Please help to review.
srowen
Author: Tommy YU <tummyyu@163.com>
Closes#10756 from Wenpei/add_since_to_recomm.
jira: https://issues.apache.org/jira/browse/SPARK-12026
The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
I tested on local and the change can improve the performance and the running time was stable.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10146 from hhbyyh/chiSq.
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9484 from hhbyyh/ldaTopicPre.
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.
Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10627 from hhbyyh/w2voverflow.
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10552 from yanboliang/spark-12603.
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.
Plus a few fixes to imports that cropped in since my recent cleanups.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10612 from vanzin/SPARK-3873-enable.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10684 from sarutak/SPARK-12692-followup-mllib.
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.
Author: Sean Owen <sowen@cloudera.com>
Closes#10570 from srowen/SPARK-12618.
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.
This PR aims to fix both issues.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#10328 from BenFradet/SPARK-12368.
Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.
Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>
Closes#10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
DecisionTreeRegressor will provide variance of prediction as a Double column.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8866 from yanboliang/spark-9622.
callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.
Author: Reynold Xin <rxin@databricks.com>
Closes#10547 from rxin/SPARK-12599.
A slight adjustment to the checker configuration was needed; there is
a handful of warnings still left, but those are because of a bug in
the checker that I'll fix separately (before enabling errors for the
checker, of course).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10535 from vanzin/SPARK-3873-mllib.
Include the following changes:
1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10440 from zsxwing/findbugs.
ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654).
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10381 from sarutak/SPARK-12424.
Restore the original value of os.arch property after each test
Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10289 from kiszk/SPARK-12311.
Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x
jkbradley is this kind of what you had in mind?
Author: Sean Owen <sowen@cloudera.com>
Closes#10327 from srowen/SPARK-12349.
Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10279 from yanboliang/spark-12309.
JIRA: https://issues.apache.org/jira/browse/SPARK-12016
We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#10100 from viirya/fix-load-py-wordvecmodel.
As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.
This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased.
cc holdenk
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 )
Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>
Closes#9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance.
CC mengxr
Author: Sean Owen <sowen@cloudera.com>
Closes#9736 from srowen/SPARK-11530.
Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .
Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.
felixcheung , mengxr
Just added a message to require()
Author: Dominik Dahlem <dominik.dahlem@gmail.combination>
Closes#9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.
fix:
1. `StreamingTest.registerStream` needs java friendly interface.
2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.
TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.
Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.
cc jkbradley feynmanliang
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10102 from hhbyyh/javaAPI.
Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10006 from yanboliang/spark-11958.
Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.
This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml.
CC: mengxr yhuai
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10161 from jkbradley/mllib-sqlcontext-fix.
jira: https://issues.apache.org/jira/browse/SPARK-12096
word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.
new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)
I tested with vocabsize over 18M and vectorsize = 100.
srowen jkbradley Sorry to miss this in last PR. I was reminded today.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10103 from hhbyyh/w2vCapacity.
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10112 from JoshRosen/upgrade-to-sbt-0.13.9.
This replaces https://github.com/apache/spark/pull/9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: JoshRosen, pwendell
> I state that the contribution is my original work, and I license the work to the project under the project's open source license.
Author: Dmitry Erastov <derastov@gmail.com>
Closes#9867 from dskrvk/master.
This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations.
cc: JoshRosen jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#10114 from mengxr/SPARK-12000.2.
cc mengxr noel-smith
I worked on this issues based on https://github.com/apache/spark/pull/8729.
ehsanmok thank you for your contricution!
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Ehsan M.Kermani <ehsanmo1367@gmail.com>
Closes#9338 from yu-iskw/JIRA-10266.
jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.
Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,
1. decrease the worker memory consumption by 45%.
2. decrease running time by 40%.
This will also help extend the upper limit for Word2Vec.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9878 from hhbyyh/w2vBC.
Add read/write support to LDA, similar to ALS.
save/load for ml.LocalLDAModel is done.
For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9894 from hhbyyh/ldaMLsave.
Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7
CC: mengxr thunterdb
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9927 from jkbradley/linregsummary-doc.
There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT.
So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType".
This PR aims to fix this, throwing a SparkException when dealing with an unknown column type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#9885 from BenFradet/SPARK-11902.
Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9897 from yanboliang/spark-11912.
I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.)
Added read/write for all 3 Evaluators as well.
CC: mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9848 from jkbradley/cv-io.
```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9839 from yanboliang/standardScaler-refactor.
Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```)
I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem.
CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9851 from jkbradley/tempdir-cleanup.
Add read/write support to the following estimators under spark.ml:
* ChiSqSelector
* PCA
* VectorIndexer
* Word2Vec
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9838 from yanboliang/spark-11829.
Updates:
* Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression
* Switch from hand-written class names in Readers to using getClass
CC: mengxr
CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you!
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9829 from jkbradley/ml-io-cleanups.
* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.*
* define `DefaultParamsReadable/Writable` and use them to save some code
* use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9827 from mengxr/SPARK-11839.
Add read/write support to the following estimators under spark.ml:
* CountVectorizer
* IDF
* MinMaxScaler
* StandardScaler (a little awkward because we store some params in spark.mllib model)
* StringIndexer
Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9798 from mengxr/SPARK-6787.
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
* supporting feature interaction in R formula.
* summary for gaussian GLM model.
* coefficients for binomial GLM model.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9727 from yanboliang/spark-11684.
jira: https://issues.apache.org/jira/browse/SPARK-11813
I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.
Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.
Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9803 from hhbyyh/w2vVocab.
Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata.
CC: mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9786 from jkbradley/als-io.
This replaces [https://github.com/apache/spark/pull/9656] with updates.
fayeshine should be the main author when this PR is committed.
CC: mengxr fayeshine
Author: Wenjian Huang <nextrush@163.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9814 from jkbradley/fayeshine-patch-6790.
I have added unit test for ML's StandardScaler By comparing with R's output, please review for me.
Thx.
Author: RoyGaoVLIS <roygao@zju.edu.cn>
Closes#6665 from RoyGao/7013.
This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9776 from mengxr/SPARK-11764.
Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.
Moved LogisticRegressionReader/Writer to within LogisticRegressionModel
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9749 from jkbradley/lr-io-2.
This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds:
* Bucketizer
* DCT
* HashingTF
* Interaction
* NGram
* Normalizer
* OneHotEncoder
* PolynomialExpansion
* QuantileDiscretizer
* RFormula
* SQLTransformer
* StopWordsRemover
* StringIndexer
* Tokenizer
* VectorAssembler
* VectorSlicer
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9755 from jkbradley/transformer-io.
This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9751 from mengxr/SPARK-11766.
Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable.
Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9674 from jkbradley/pipeline-io.
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9690 from yanboliang/spark-11723.
We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites.
cc: yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#9677 from mengxr/SPARK-11672.2.
Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.
CC feynmanliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9678 from jkbradley/lda-pipelines-2.
This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9644 from mengxr/SPARK-11674.
org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence.
Author: Yuming Wang <q79969786@gmail.com>
Author: yuming.wang <q79969786@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#9592 from 979969786/master.
This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9606 from jkbradley/logreg-io2.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed.
Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR.
CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9513 from jkbradley/lda-pipelines.
Implementation of step capability for sliding window function in MLlib's RDD.
Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points:
Window | Step | Time | Windows produced
------------ | ------------- | ---------- | ----------
128 | 1 | 6.38 | 9999873
128 | 10 | 0.9 | 999988
128 | 100 | 0.41 | 99999
1024 | 1 | 44.67 | 9998977
1024 | 10 | 4.74 | 999898
1024 | 100 | 0.78 | 99990
```
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(1 to 10000000, 10)
rdd.count
val window = 1024
val step = 1
val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9)
```
Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net>
Author: Alexander Ulanov <nashb@yandex.ru>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5855 from avulanov/SPARK-7316-sliding.
Refactoring
* separated overwrite and param save logic in DefaultParamsWriter
* added sparkVersion to DefaultParamsWriter
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9587 from jkbradley/logreg-io.
jira: https://issues.apache.org/jira/browse/SPARK-11069
quotes from jira:
Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
call the Boolean Param "toLowercase"
set default to false (so behavior does not change)
Actually sklearn converts to lowercase before tokenizing too
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9092 from hhbyyh/tokenLower.
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517
- This implementation based on a bi-sectiong K-means clustering.
- It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
- However, It is 1000x faster than the previous version through parallel processing.
Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>
Closes#5267 from yu-iskw/new-hierarchical-clustering.
The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.
Author: fazlan-nazeem <fazlann@wso2.com>
Closes#9558 from fazlan-nazeem/master.
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
Min Max
-0.9509607 0.7291832
$Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.2353597 7.123139 4.456124e-11
Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica -1.00751 0.09330565 -10.79796 0
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9561 from yanboliang/spark-11494.
Could jkbradley and davies review it?
- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.
[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8643 from yu-iskw/SPARK-8467-2.
This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:
* class name
* uid
* timestamp
* paramMap
The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.
~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")
Instance.load("path")
~~~
The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.
TODOs:
* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9454 from mengxr/SPARK-11217.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9485 from yanboliang/spark-11473.
In file LDAOptimizer.scala:
line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach.
- nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) =>
+ nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
Author: a1singh <a1singh@ucsd.edu>
Closes#9456 from a1singh/master.
Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9303 from yanboliang/spark-9492.
Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial").
For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9302 from yanboliang/spark-11349.
Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients`
Author: DB Tsai <dbt@netflix.com>
Closes#9426 from dbtsai/feature-minor.
mengxr, felixcheung
This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also.
Author: Dominik Dahlem <dominik.dahlem@gmail.combination>
Closes#9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.
cc: srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9322 from mengxr/SPARK-11358.
Made foreachActive public in MLLib's vector API
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.
…sion as followup. This is the follow up work of SPARK-10668.
* Fix miner style issues.
* Add test case for checking whether solver is selected properly.
Author: Lewuathe <lewuathe@me.com>
Author: lewuathe <lewuathe@me.com>
Closes#9180 from Lewuathe/SPARK-11207.
SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9331 from yanboliang/spark-11369.
WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes#9309 from srowen/SPARK-11302.2.
Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.
With a test.
Author: Reza Zadeh <reza@databricks.com>
Closes#8792 from rezazadeh/colsims.
Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier
Author: Sean Owen <sowen@cloudera.com>
Closes#9169 from srowen/SPARK-11184.
This is a PR for Parquet-based model import/export.
* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite
Author: Jayant Shekar <jayant@user-MBPMBA-3.local>
Closes#6785 from jayantshekhar/SPARK-6723.
Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.
The current code in master gives unpredictable behavior for such cases.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8271 from MechCoder/hash_code_matrices.
…2 regularization if the number of features is small
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>
Closes#8884 from Lewuathe/SPARK-10668.
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl
Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com>
Closes#8609 from lkhamsurenl/SPARK-9963.
jira: https://issues.apache.org/jira/browse/SPARK-11029
We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel.
This will be a temp fix until we have proper evaluators defined for clustering.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
Closes#9073 from hhbyyh/computeCost.
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
- Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
- Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8757 from brkyvz/opt-bmm.
Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)
Author: vectorijk <jiangkai@gmail.com>
Closes#9083 from vectorijk/spark-11059.
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9090 from mengxr/SPARK-7402.
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
Closes#8700 from smartkiwi/SPARK-10535_.
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8940 from pnpritchard/SPARK-10875.
GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8549 from yanboliang/spark-7770.
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#8246 from NathanHowell/SPARK-10064.
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
Author: DB Tsai <dbt@netflix.com>
Closes#8853 from dbtsai/refactoring.
Provide initialModel param for pyspark.mllib.clustering.KMeans
Author: Evan Chen <chene@us.ibm.com>
Closes#8967 from evanyc15/SPARK-10779-pyspark-mllib.
It is currently impossible to clear Param values once set. It would be helpful to be able to.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.