ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Zheng RuiFeng	6555ef0ccb	[TRIVIAL] Typo Fix ## What changes were proposed in this pull request? Fix a typo ## How was this patch tested? no tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14772 from zhengruifeng/minor_numClasses.	2016-08-23 21:25:04 +01:00
Jagadeesan	97d461b75b	[SPARK-17095] [Documentation] [Latex and Scala doc do not play nicely] ## What changes were proposed in this pull request? In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output. Author: Jagadeesan <as2@us.ibm.com> Closes #14688 from jagadeesanas2/SPARK-17095.	2016-08-23 12:23:30 +01:00
hqzizania	37f0ab70d2	[SPARK-17090][FOLLOW-UP][ML] Add expert param support to SharedParamsCodeGen ## What changes were proposed in this pull request? Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added. Author: hqzizania <hqzizania@gmail.com> Closes #14738 from hqzizania/SPARK-17090-minor.	2016-08-22 17:09:08 -07:00
Holden Karau	b264cbb16f	[SPARK-15113][PYSPARK][ML] Add missing num features num classes ## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.	2016-08-22 12:21:22 +02:00
Wenchen Fan	b2074b664a	[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog ## What changes were proposed in this pull request? Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties. This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place. changes overview: 1. before this PR: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`) after this PR: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore. 2. before this PR: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties. after this PR: The table metadata read from external catalog is exactly the same with what we saved to it. bonus: now we can create data source table using `SessionCatalog`, if schema is specified. breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14155 from cloud-fan/catalog-table.	2016-08-21 22:23:14 -07:00
hqzizania	61ef74f227	[SPARK-17090][ML] Make tree aggregation level in linear/logistic regression configurable ## What changes were proposed in this pull request? Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml Author: hqzizania <hqzizania@gmail.com> Closes #14717 from hqzizania/SPARK-17090.	2016-08-20 18:52:44 -07:00
Junyang Qian	acac7a508a	[SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper ## What changes were proposed in this pull request? Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated. ## How was this patch tested? SparkR unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png) ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png) Author: Junyang Qian <junyangq@databricks.com> Closes #14384 from junyangq/SPARK-16443.	2016-08-19 14:24:09 -07:00
Yanbo Liang	864be9359a	[SPARK-17141][ML] MinMaxScaler should remain NaN value. ## What changes were proposed in this pull request? In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately. * If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```. * Otherwise, it will remain ```NaN``` after transformation. I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14716 from yanboliang/spark-17141.	2016-08-19 03:23:16 -07:00
sethah	287bea1305	[SPARK-7159][ML] Add multiclass logistic regression to Spark ML ## What changes were proposed in this pull request? This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML. JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159) ## How was this patch tested? Added new test suite `MultinomialLogisticRegressionSuite`. ## Approach ### Do not use a "pivot" class in the algorithm formulation Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai. ### Separate multinomial logistic regression and binary logistic regression The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one. Arguments for: * The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class. * The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant. Arguments against: * It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes). * Some portions of the code are repeated. This is a major design point and warrants more discussion. ### Mean centering When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well. ### Feature scaling In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases. One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion. ### Specifying and inferring the number of outcome classes The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm. ## Performance Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary. Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each Notes: * Time in units of seconds * Metric is classification accuracy \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.746415 \| 30 \| 100000 \| 3 \| 100000 \| 327.923 \| true \| 0 \| \| mllib \| 0 \| true \| 0.743785 \| 30 \| 100000 \| 3 \| 100000 \| 390.217 \| true \| 0 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.973238 \| 30 \| 2000000 \| 3 \| 10000 \| 385.476 \| true \| 0 \| \| mllib \| 0 \| true \| 0.949828 \| 30 \| 2000000 \| 3 \| 10000 \| 550.403 \| true \| 0 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| mllib \| 0 \| true \| 0.864358 \| 30 \| 2000000 \| 3 \| 10000 \| 543.359 \| true \| 0.1 \| \| ml \| 0 \| true \| 0.867418 \| 30 \| 2000000 \| 3 \| 10000 \| 401.955 \| true \| 0.1 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 1 \| true \| 0.807449 \| 30 \| 2000000 \| 3 \| 10000 \| 334.892 \| true \| 0.05 \| \| algo \| elasticNetParam \| fitIntercept \| metric \| maxIter \| numPoints \| numClasses \| numFeatures \| time \| standardization \| regParam \| \|--------\|-------------------\|----------------\|----------\|-----------\|-------------\|--------------\|---------------\|---------\|-------------------\|------------\| \| ml \| 0 \| true \| 0.602006 \| 30 \| 2000000 \| 500 \| 100 \| 112.319 \| true \| 0 \| \| mllib \| 0 \| true \| 0.567226 \| 30 \| 2000000 \| 500 \| 100 \| 263.768 \| true \| 0 \|e \| 0.567226 \| 30 \| 2000000 \| 500 \| 100 \| 263.768 \| true \| 0 \| ## References Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf) [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) ## Follow up items * Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134) * Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) * Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140) * Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138) * Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) * Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135) * Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136) * Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137) Author: sethah <seth.hendrickson16@gmail.com> Closes #13796 from sethah/SPARK-7159_M.	2016-08-18 22:16:48 -07:00
Xusen Yin	b72bb62d42	[SPARK-16447][ML][SPARKR] LDA wrapper in SparkR ## What changes were proposed in this pull request? Add LDA Wrapper in SparkR with the following interfaces: - spark.lda(data, ...) - spark.posterior(object, newData, ...) - spark.perplexity(object, ...) - summary(object) - write.ml(object) - read.ml(path) ## How was this patch tested? Test with SparkR unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14229 from yinxusen/SPARK-16447.	2016-08-18 05:33:52 -07:00
Yanbo Liang	4d92af310a	[SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR ## What changes were proposed in this pull request? Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14392 from yanboliang/spark-16446.	2016-08-17 11:18:33 -07:00
wm624@hotmail.com	363793f2bf	[SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add Isotonic Regression wrapper in SparkR Wrappers in R and Scala are added. Unit tests Documentation ## How was this patch tested? Manually tested with sudo ./R/run-tests.sh (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14182 from wangmiao1981/isoR.	2016-08-17 06:15:04 -07:00
WeichenXu	3d8bfe7a39	[SPARK-16934][ML][MLLIB] Update LogisticCostAggregator serialization code to make it consistent with LinearRegression ## What changes were proposed in this pull request? Update LogisticCostAggregator serialization code to make it consistent with #14109 ## How was this patch tested? MLlib 2.0: ![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png) Author: WeichenXu <WeichenXu123@outlook.com> Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.	2016-08-15 06:38:30 -07:00
Yanbo Liang	ddf0d1e3fe	[TRIVIAL][ML] Fix LogisticRegression typo in error message. ## What changes were proposed in this pull request? Fix ```LogisticRegression``` typo in error message. ## How was this patch tested? Docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14633 from yanboliang/lr-typo.	2016-08-15 10:11:29 +01:00
zero323	0ebf7c1bff	[SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize ## What changes were proposed in this pull request? Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient ## How was this patch tested? Spark unit tests Author: zero323 <zero323@users.noreply.github.com> Closes #14614 from zero323/SPARK-17027.	2016-08-14 11:59:24 +01:00
Yanbo Liang	bbae20ade1	[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance ## What changes were proposed in this pull request? ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. BTW, we should destroy broadcast variable ```compute``` at the end of each iteration. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14621 from yanboliang/spark-17033.	2016-08-12 10:06:17 -07:00
Yanbo Liang	d4a9122430	[SPARK-16710][SPARKR][ML] spark.glm should support weightCol ## What changes were proposed in this pull request? Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14346 from yanboliang/spark-16710.	2016-08-10 10:53:48 -07:00
Yanbo Liang	182e11904b	[SPARK-16933][ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png) After patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14519 from yanboliang/spark-16933.	2016-08-09 03:39:57 -07:00
Holden Karau	9216901d52	[SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting ## What changes were proposed in this pull request? Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability. ## How was this patch tested? Existing tests. Author: Holden Karau <holden@us.ibm.com> Closes #14407 from holdenk/SPARK-16779.	2016-08-08 15:54:03 -07:00
sethah	1db1c6567b	[SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data ## What changes were proposed in this pull request? Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. In https://github.com/apache/spark/pull/13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark. ## How was this patch tested? MLlib ![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png) ML without patch ![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png) ML with patch ![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png) Author: sethah <seth.hendrickson16@gmail.com> Closes #14109 from sethah/LIR_serialize.	2016-08-08 00:00:15 -07:00
Yanbo Liang	6cbde337a5	[SPARK-16750][FOLLOW-UP][ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14455 from yanboliang/transformSchema.	2016-08-05 22:07:59 +01:00
Zheng RuiFeng	0e2e5d7d0b	[SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' length ## What changes were proposed in this pull request? Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier ## How was this patch tested? unit tests and manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.	2016-08-04 21:44:54 +01:00
WeichenXu	462784ffad	[SPARK-16880][ML][MLLIB] make ann training data persisted if needed ## What changes were proposed in this pull request? To Make sure ANN layer input training data to be persisted, so that it can avoid overhead cost if the RDD need to be computed from lineage. ## How was this patch tested? Existing Tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14483 from WeichenXu123/add_ann_persist_training_data.	2016-08-04 21:41:35 +01:00
Shuai Lin	36827ddafe	[SPARK-16822][DOC] Support latex in scaladoc. ## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.	2016-08-02 09:14:08 -07:00
Zheng RuiFeng	d9e0919d30	[SPARK-16851][ML] Incorrect threshould length in 'setThresholds()' evoke Exception ## What changes were proposed in this pull request? Add a length checking for threshoulds' length in method `setThreshoulds()` of classification models. ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14457 from zhengruifeng/check_setThresholds.	2016-08-02 07:22:41 -07:00
Shuai Lin	2a0de7dc99	[SPARK-16485][DOC][ML] Remove useless latex in a log messge. ## What changes were proposed in this pull request? Removed useless latex in a log messge. ## How was this patch tested? Check generated scaladoc. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14380 from lins05/fix-docs-formatting.	2016-08-01 06:54:18 -07:00
WeichenXu	bce354c1d4	[SPARK-16696][ML][MLLIB] destroy KMeans bcNewCenters when loop finished and update code where should release unused broadcast/RDD in proper time ## What changes were proposed in this pull request? update unused broadcast in KMeans/Word2Vec, use destroy(false) to release memory in time. and several place destroy() update to destroy(false) so that it will be async-called, it will better than blocking called. and update bcNewCenters in KMeans to make it destroy in correct time. I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop. fix TODO in `BisectingKMeans.run` "unpersist old indices", Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration. ## How was this patch tested? Existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14333 from WeichenXu123/broadvar_unpersist_to_destroy.	2016-07-30 08:07:22 -07:00
Sean Owen	0dc4310b47	[SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required ## What changes were proposed in this pull request? Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14332 from srowen/SPARK-16694.	2016-07-30 04:42:38 -07:00
Yanbo Liang	0557a45452	[SPARK-16750][ML] Fix GaussianMixture training failed due to feature column type mistake ## What changes were proposed in this pull request? ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake. See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug. Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14378 from yanboliang/spark-16750.	2016-07-29 04:40:20 -07:00
krishnakalyan3	7e8279fde1	[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc ## What changes were proposed in this pull request? Updated ML pipeline Cross Validation Scaladoc & PyDoc. ## How was this patch tested? Documentation update (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #13894 from krishnakalyan3/kfold-cv.	2016-07-27 15:37:38 +02:00
Yanbo Liang	3c3371bbd6	[MINOR][ML] Fix some mistake in LinearRegression formula. ## What changes were proposed in this pull request? Fix some mistake in ```LinearRegression``` formula. ## How was this patch tested? Documents change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14369 from yanboliang/LiR-formula.	2016-07-27 11:24:28 +01:00
WeichenXu	4c9695598e	[SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid redundant RDD computation ## What changes were proposed in this pull request? In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]` and also move the place of unpersisting `expElogbetaBc` broadcast variable, to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early, and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)` ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14335 from WeichenXu123/improve_LDA.	2016-07-26 10:41:41 +01:00
WeichenXu	ad3708e783	[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14286 from WeichenXu123/update_ann_tol.	2016-07-25 20:00:37 +01:00
WeichenXu	25db51675f	[SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug ## What changes were proposed in this pull request? renaming var names to make code more clear: nnz => weightSum weightSum => totalWeightSum and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number. using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case. ## How was this patch tested? A new testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14216 from WeichenXu123/multivarOnlineSummary.	2016-07-23 12:32:30 +01:00
Anthony Truchet	0dc79ffd1c	[SPARK-16440][MLLIB] Destroy broadcasted variables even on driver ## What changes were proposed in this pull request? Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver. ## How was this patch tested? Unit Tests in Word2VecSuite were run locally. This contribution is done on behalf of Criteo, according to the terms of the Apache license 2.0. Author: Anthony Truchet <a.truchet@criteo.com> Closes #14268 from AnthonyTruchet/SPARK-16440.	2016-07-20 10:39:59 +01:00
Yanbo Liang	670891496a	[SPARK-16494][ML] Upgrade breeze version to 0.12 ## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.	2016-07-19 12:31:04 +01:00
WeichenXu	8310c0741c	[SPARK-16600][MLLIB] fix some latex formula syntax error ## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `hat{x_i}` ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14246 from WeichenXu123/fix_formular_err.	2016-07-19 12:07:40 +01:00
Xin Ren	21a6dd2aef	[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent https://issues.apache.org/jira/browse/SPARK-16535 ## What changes were proposed in this pull request? When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot ``` Definition of groupId is redundant, because it's inherited from the parent ``` ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png) I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok. ``` <groupId>org.apache.spark</groupId> ``` As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1). ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762 ## How was this patch tested? I've tested by re-building the project, and build succeeded. Author: Xin Ren <iamshrek@126.com> Closes #14189 from keypointt/SPARK-16535.	2016-07-19 11:59:46 +01:00
WeichenXu	a529fc9442	[MINOR][TYPO] fix fininsh typo ## What changes were proposed in this pull request? fininsh => finish ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14238 from WeichenXu123/fix_fininsh_typo.	2016-07-18 09:11:53 +01:00
Reynold Xin	480c870644	[SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236.	2016-07-17 22:48:00 -07:00
Sean Owen	5ec0d692b0	[SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end. ## How was this patch tested? Jenkins build of docs Author: Sean Owen <sowen@cloudera.com> Closes #14221 from srowen/SPARK-3359.3.	2016-07-16 13:26:58 -07:00
z001qdp	71ad945bbb	[SPARK-16426][MLLIB] Fix bug that caused NaNs in IsotonicRegression ## What changes were proposed in this pull request? Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition. ## How was this patch tested? Added a unit test. Author: z001qdp <Nicholas.Eggert@target.com> Closes #14140 from neggert/SPARK-16426-isotonic-nan.	2016-07-15 12:30:22 +01:00
WeichenXu	252d4f27f2	[SPARK-16500][ML][MLLIB][OPTIMIZER] add LBFGS convergence warning for all used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.	2016-07-14 09:11:04 +01:00
Joseph K. Bradley	a5f51e2162	[SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ML, doc fixes ## What changes were proposed in this pull request? Fixing issues found during 2.0 API checks: * GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed * sqlDataTypes: name does not follow conventions. Do we need to expose it? * Evaluator: inconsistent doc between evaluate and isLargerBetter * MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little * GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name ## How was this patch tested? Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.) Author: Joseph K. Bradley <joseph@databricks.com> Closes #14187 from jkbradley/final-api-check-2.0.	2016-07-13 15:40:44 -07:00
Joseph K. Bradley	01f09b1612	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.	2016-07-13 12:33:39 -07:00
oraviv	ea06e4ef34	[SPARK-16469] enhanced simulate multiply ## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <oraviv@paypal.com> Closes #14068 from uzadude/master.	2016-07-13 14:47:08 +01:00
Sean Owen	51ade51a9f	[SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs ## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14153 from srowen/SPARK-16440.	2016-07-13 11:39:32 +01:00
WeichenXu	6cb75db9ab	[SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not ## What changes were proposed in this pull request? In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence. The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations: 1) reach max iteration number 2) function reach value convergence 3) objective function stop improving 4) gradient reach convergence 5) search failed(due to some internal numerical error) I add warning printing code so that if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string. ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.	2016-07-12 13:04:34 +01:00
WeichenXu	fc11c509e2	[MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression ## What changes were proposed in this pull request? In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0` the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14121 from WeichenXu123/update_lr_comment.	2016-07-12 09:23:59 +01:00
Reynold Xin	ffcb6e055a	[SPARK-16477] Bump master version to 2.1.0-SNAPSHOT ## What changes were proposed in this pull request? After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14130 from rxin/SPARK-16477.	2016-07-11 09:42:56 -07:00

1 2 3 4 5 ...

1583 commits