ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xingbo Jiang	b33a58c0c6	Revert "Prepare Spark release v3.0.0-preview-rc1" This reverts commit `5eddbb5f1d`.	2019-10-28 22:32:34 -07:00
Xingbo Jiang	5eddbb5f1d	Prepare Spark release v3.0.0-preview-rc1 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the PySpark version from `3.0.0.dev0` to `3.0.0` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26243 from jiangxb1987/3.0.0-preview-prepare. Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-28 22:31:29 -07:00
Huaxin Gao	c137acbf65	[SPARK-29566][ML] Imputer should support single-column input/output ### What changes were proposed in this pull request? add single-column input/output support in Imputer ### Why are the changes needed? Currently, Imputer only has multi-column support. This PR adds single-column input/output support. ### Does this PR introduce any user-facing change? Yes. add single-column input/output support in Imputer ```Imputer.setInputCol``` ```Imputer.setOutputCol``` ### How was this patch tested? add unit tests Closes #26247 from huaxingao/spark-29566. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-29 11:11:41 +08:00
Huaxin Gao	b19fd487df	[SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py ### What changes were proposed in this pull request? Remove automatically generated param setters in _shared_params_code_gen.py ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes Add some setters in Python ML XXXModels ### How was this patch tested? unit tests Closes #26232 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-28 11:36:10 +08:00
zhengruifeng	091cbc3be0	[SPARK-9612][ML] Add instance weight support for GBTs ### What changes were proposed in this pull request? add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees in summary: 1, add setters of `minWeightFractionPerNode` & `weightCol` 2, update input types in private methods from `RDD[LabeledPoint]` to `RDD[Instance]`: `DecisionTreeRegressor.train`, `GradientBoostedTrees.run`, `GradientBoostedTrees.runWithValidation`, `GradientBoostedTrees.computeInitialPredictionAndError`, `GradientBoostedTrees.computeError`, `GradientBoostedTrees.evaluateEachIteration`, `GradientBoostedTrees.boost`, `GradientBoostedTrees.updatePredictionError` 3, add new private method `GradientBoostedTrees.computeError(data, predError)` to compute average error, since original `predError.values.mean()` do not take weights into account. 4, add new tests ### Why are the changes needed? GBTs should support sample weights like other algs ### Does this PR introduce any user-facing change? yes, new setters are added ### How was this patch tested? existing & added testsuites Closes #25926 from zhengruifeng/gbt_add_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-25 13:48:09 +08:00
Huaxin Gao	868d851dac	[SPARK-29232][ML] Update the parameter maps of the DecisionTreeRegression/Classification Models ### What changes were proposed in this pull request? The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees. Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values ### Does this PR introduce any user-facing change? Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values. ### How was this patch tested? Add tests Closes #26154 from huaxingao/spark-29232. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-22 17:49:44 +08:00
zhengruifeng	dba673f0e3	[SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss ### What changes were proposed in this pull request? `ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss ### Why are the changes needed? log-loss is an important classification metric and is widely used in practice ### Does this PR introduce any user-facing change? Yes, add new option ("logloss") and a related param `eps` ### How was this patch tested? added testsuites & local tests refering to sklearn Closes #26135 from zhengruifeng/logloss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:57:13 +08:00
zhengruifeng	9dacdd38b3	[SPARK-23578][ML][PYSPARK] Binarizer support multi-column ### What changes were proposed in this pull request? Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds` ### Why are the changes needed? similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer` ### Does this PR introduce any user-facing change? yes, add setter/getter of `thresholds`/`inputCols`/`outputCols` ### How was this patch tested? added suites Closes #26064 from zhengruifeng/binarizer_multicols. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-16 18:32:07 +08:00
zhengruifeng	8b62399684	[SPARK-29380][ML] RFormula avoid repeated 'first' jobs to get vector size ### What changes were proposed in this pull request? get the first row lazily, and reuse it for each vector column. ### Why are the changes needed? avoid unnecssary `first` jobs ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites & local tests in repl Closes #26052 from zhengruifeng/rformula_lazy_row. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-12 22:25:36 +08:00
Sean Owen	cc7493fa21	[SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) ### What changes were proposed in this pull request? Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place. ### Why are the changes needed? To compile with 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26073 from srowen/SPARK-29416. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 17:00:48 -07:00
Sean Owen	fa95a5c395	[SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 ### What changes were proposed in this pull request? Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object. ### Why are the changes needed? It doesn't compile otherwise in Scala 2.13. - https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30 ### Does this PR introduce any user-facing change? Should be no behavior change at all. ### How was this patch tested? Existing tests. Closes #26070 from srowen/SPARK-29411. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:24:13 -07:00
Sean Owen	ee83d09b53	[SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples ### What changes were proposed in this pull request? Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like: ``` [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit)Array[Unit] <and> (x: Double,xs: Double)Array[Double] <and> (x: Float,xs: Float)Array[Float] <and> (x: Long,xs: Long)Array[Long] <and> (x: Int,xs: Int)Array[Int] <and> (x: Char,xs: Char)Array[Char] <and> (x: Short,xs: Short)Array[Short] <and> (x: Byte,xs: Byte)Array[Byte] <and> (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) ``` Using a `Seq` instead appears to resolve it, and is effectively equivalent. ### Why are the changes needed? To better cross-build for 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26062 from srowen/SPARK-29401. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:22:02 -07:00
Huaxin Gao	ffddfc8584	[SPARK-29269][PYTHON][ML] Pyspark ALSModel support getters/setters ### What changes were proposed in this pull request? Add getters/setters in Pyspark ALSModel. ### Why are the changes needed? To keep parity between python and scala. ### Does this PR introduce any user-facing change? Yes. add the following getters/setters to ALSModel ``` get/setUserCol get/setItemCol get/setColdStartStrategy get/setPredictionCol ``` ### How was this patch tested? add doctest Closes #25947 from huaxingao/spark-29269. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-08 14:05:09 +08:00
zero323	7c5db4515e	[SPARK-29363][MLLIB] Make o.a.s.regression.Regressor public ### What changes were proposed in this pull request? - Removal of `private[ml]` modifier from `Regressor`. - Marking `Regressor` as `DeveloperApi`. ### Why are the changes needed? Consistency with the rest of ML API as described in [the corresponding JIRA ticket](https://issues.apache.org/jira/browse/SPARK-29363). ### Does this PR introduce any user-facing change? Yes, as described above. ### How was this patch tested? Existing tests. Closes #26033 from zero323/SPARK-29363. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-05 18:16:28 -07:00
Dongjoon Hyun	bd031c2173	[SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove `scalatest` deprecation warnings with the following changes. - `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar` - `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser` - `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers` - `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks` ### Why are the changes needed? According to the Jenkins logs, there are 118 warnings about this. ``` grep "is deprecated" ~/consoleText \| grep scalatest \| wc -l 118 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? After Jenkins passes, we need to check the Jenkins log. Closes #25982 from dongjoon-hyun/SPARK-29307. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 21:00:11 -07:00
Sean Owen	e1ea806b30	[SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13 ### What changes were proposed in this pull request? Scala 2.13 emits a deprecation warning for procedure-like declarations: ``` def foo() { ... ``` This is equivalent to the following, so should be changed to avoid a warning: ``` def foo(): Unit = { ... ``` ### Why are the changes needed? It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files. Unfortunately, that makes this quite a big change. ### Does this PR introduce any user-facing change? No behavior change at all. ### How was this patch tested? Existing tests. Closes #25968 from srowen/SPARK-29291. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 10:03:23 -07:00
Dongjoon Hyun	989b0c773f	[SPARK-29297][TESTS] Compare `core`/`mllib` module benchmarks in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. Manually run the benchmark. Closes #25969 from dongjoon-hyun/SPARK-29297. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 21:43:58 -07:00
zhengruifeng	aed7ff36f7	[SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics ### What changes were proposed in this pull request? 1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator` 2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator` 3, add metric `explainedVariance` in `RegressionEvaluator` ### Why are the changes needed? existing function in mllib.metrics should also be exposed in ml ### Does this PR introduce any user-facing change? yes, this PR add two expert params and one metric option ### How was this patch tested? existing and added tests Closes #25940 from zhengruifeng/evaluator_add_param. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 13:30:03 +08:00
zhengruifeng	fff2e847c2	[SPARK-29095][ML] add extractInstances ### What changes were proposed in this pull request? common methods support extract weights ### Why are the changes needed? today more and more ML algs support weighting, add this method will make impls simple ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites Closes #25802 from zhengruifeng/add_extractInstances. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-24 09:24:10 -05:00
Patrick Pisciuneri	c7c6b642dc	[SPARK-29121][ML][MLLIB] Support for dot product operation on Vector(s) ### What changes were proposed in this pull request? Support for dot product with: - `ml.linalg.Vector` - `ml.linalg.Vectors` - `mllib.linalg.Vector` - `mllib.linalg.Vectors` ### Why are the changes needed? Dot product is useful for feature engineering and scoring. BLAS routines are already there, just a wrapper is needed. ### Does this PR introduce any user-facing change? No user facing changes, just some new functionality. ### How was this patch tested? Tests were written and added to the appropriate `VectorSuites` classes. They can be quickly run with: ``` sbt "mllib-local/testOnly org.apache.spark.ml.linalg.VectorsSuite" sbt "mllib/testOnly org.apache.spark.mllib.linalg.VectorsSuite" ``` Closes #25818 from phpisciuneri/SPARK-29121. Authored-by: Patrick Pisciuneri <phpisciuneri@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-21 14:26:54 -05:00
Sean Owen	a9ae262cf2	[SPARK-28772][BUILD][MLLIB] Update breeze to 1.0 ### What changes were proposed in this pull request? Update breeze dependency to 1.0. ### Why are the changes needed? Breeze 1.0 supports Scala 2.13 and has a few bug fixes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25874 from srowen/SPARK-28772. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-20 20:31:26 -07:00
zhengruifeng	c764dd6dd7	[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold ### What changes were proposed in this pull request? if threshold<0, convert implict 0 to 1, althought this will break sparsity ### Why are the changes needed? if `threshold<0`, current impl deal with sparse vector incorrectly. See JIRA [SPARK-29144](https://issues.apache.org/jira/browse/SPARK-29144) and [Scikit-Learn's Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) ('Threshold may not be less than 0 for operations on sparse matrices.') for details. ### Does this PR introduce any user-facing change? no ### How was this patch tested? added testsuite Closes #25829 from zhengruifeng/binarizer_throw_exception_sparse_vector. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-20 19:22:46 -05:00
zhengruifeng	d74fc6bb82	[SPARK-29118][ML] Avoid redundant computation in transform of GMM & GLR ### What changes were proposed in this pull request? 1,GMM: obtaining the prediction (double) from its probabilty prediction(vector) 2,GLR: obtaining the prediction (double) from its link prediction(double) ### Why are the changes needed? it avoid predict twice ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25815 from zhengruifeng/gmm_transform_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-18 09:41:02 -05:00
Liang-Chi Hsieh	12e1583093	[SPARK-28927][ML] Rethrow block mismatch exception in ALS when input data is nondeterministic ### What changes were proposed in this pull request? Fitting ALS model can be failed due to nondeterministic input data. Currently the failure is thrown by an ArrayIndexOutOfBoundsException which is not explainable for end users what is wrong in fitting. This patch catches this exception and rethrows a more explainable one, when the input data is nondeterministic. Because we may not exactly know the output deterministic level of RDDs produced by user code, this patch also adds a note to Scala/Python/R ALS document about the training data deterministic level. ### Why are the changes needed? ArrayIndexOutOfBoundsException was observed during fitting ALS model. It was caused by mismatching between in/out user/item blocks during computing ratings. If the training RDD output is nondeterministic, when fetch failure is happened, rerun part of training RDD can produce inconsistent user/item blocks. This patch is needed to notify users ALS fitting on nondeterministic input. ### Does this PR introduce any user-facing change? Yes. When fitting ALS model on nondeterministic input data, previously if rerun happens, users would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks. After this patch, a SparkException with more clear message will be thrown, and original ArrayIndexOutOfBoundsException is wrapped. ### How was this patch tested? Tested on development cluster. Closes #25789 from viirya/als-indeterminate-input. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-18 09:22:13 -05:00
Dongjoon Hyun	e63098b287	[SPARK-29007][MLLIB][FOLLOWUP] Remove duplicated dependency ### What changes were proposed in this pull request? This removes the duplicated dependency which is added by [SPARK-29007](`b62ef8f793/mllib/pom.xml (L58-L64)`). ### Why are the changes needed? Maven complains this kind of duplications. We had better be safe in the future Maven versions. ``` $ cd mllib $ mvn clean package -DskipTests [INFO] Scanning for projects... [WARNING] [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-mllib_2.12🫙3.0.0-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.spark:spark-streaming_${scala.binary.version}:test-jar -> duplicate declaration of version ${project.version} line 119, column 17 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] ... ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual check since this is a warning. ``` $ cd mllib $ mvn clean package -DskipTests ``` Closes #25783 from dongjoon-hyun/SPARK-29007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-13 11:54:46 -07:00
Jungtaek Lim (HeartSaVioR)	b62ef8f793	[SPARK-29007][STREAMING][MLLIB][TESTS] Enforce not leaking SparkContext in tests which creates new StreamingContext with new SparkContext ### What changes were proposed in this pull request? This patch enforces tests to prevent leaking newly created SparkContext while is created via initializing StreamingContext. Leaking SparkContext in test would make most of following tests being failed as well, so this patch applies defensive programming, trying its best to ensure SparkContext is cleaned up. ### Why are the changes needed? We got some case in CI build where SparkContext is being leaked and other tests are affected by leaked SparkContext. Ideally we should isolate the environment among tests if possible. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified UTs. Closes #25709 from HeartSaVioR/SPARK-29007. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-11 10:29:13 -07:00
Huaxin Gao	aa805eca54	[SPARK-23265][ML] Update multi-column error handling logic in QuantileDiscretizer ## What changes were proposed in this pull request? SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer. ## How was this patch tested? Add new unit test. Closes #20442 from huaxingao/spark-23265. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-09 19:11:18 -07:00
Sean Owen	6378d4bc06	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 ### What changes were proposed in this pull request? - Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods - Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport` - Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0 - Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0 - Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD - Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0 - Remove deprecated ChiSqSelector isSorted protected method - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc Notes: - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset. - Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was. - I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird. - I kept LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated. ### Why are the changes needed? Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old. ### Does this PR introduce any user-facing change? Yes, in that deprecated items are removed from some public APIs. ### How was this patch tested? Existing tests. Closes #25684 from srowen/SPARK-28980. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 10:19:40 -05:00
zhengruifeng	4664a082c2	[SPARK-28968][ML] Add HasNumFeatures in the scala side ### What changes were proposed in this pull request? Add HasNumFeatures in the scala side, with `1<<18` as the default value ### Why are the changes needed? HasNumFeatures is already added in the py side, it is reasonable to keep them in sync. I don't find other similar place. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #25671 from zhengruifeng/add_HasNumFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-06 11:50:45 +08:00
Liang-Chi Hsieh	19f882ce1b	[SPARK-28933][ML] Reduce unnecessary shuffle in ALS when initializing factors ### What changes were proposed in this pull request? When Initializing factors in ALS, we should use `mapPartitions` instead of current `map`, so we can preserve existing partition of the RDD of `InBlock`. The RDD of `InBlock` is already partitioned by src block id. We don't change the partition when initializing factors. ### Why are the changes needed? This patch can reduce unnecessary shuffle after initializing factors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It should not change existing tests. It should pass added test that verifies shuffle dependency of factor RDDs. Closes #25639 from viirya/fix-als-partition. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-01 19:49:50 -07:00
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
Liang-Chi Hsieh	2bd02e2b41	[SPARK-28866][ML] Persist item factors RDD when checkpointing in ALS ### What changes were proposed in this pull request? In ALS ML implementation, for non-implicit case, we checkpoint the RDD of item factors, between intervals. Before checkpointing (.checkpoint()) and materializing (.count()) RDD, this RDD was not persisted. It causes recomputation. In an experiment, there is performance difference between persisting and no persisting before checkpointing the RDD. The performance difference is not big, but this change is not big too. The actual performance difference varies depending the interval of checkpoint, training dataset, etc. ### Why are the changes needed? Persisting the RDD before checkpointing the RDD of item factors can avoid recomputation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual check RDD recomputation or not. Taking 30% MovieLens 20M Dataset as training dataset. Setting checkpoint dir for SparkContext. Fitting an ALS model like: ```scala val als = new ALS() .setMaxIter(100) .setCheckpointInterval(5) .setRegParam(0.01) .setUserCol("userId") .setItemCol("movieId") .setRatingCol("rating") val t0 = System.currentTimeMillis() val model = als.fit(training) val t1 = System.currentTimeMillis() ``` Before this patch: 65.386 s After this patch: 61.022 s Closes #25576 from viirya/persist-item-factors. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-30 11:37:06 -05:00
zhengruifeng	7fe750674e	[SPARK-11215][ML][FOLLOWUP] update the examples and suites using new api ## What changes were proposed in this pull request? since method `labels` is already deprecated, we should update the examples and suites to turn off warings when compiling spark: ``` [warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala:65: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. [warn] .setLabels(labelIndexer.labels) [warn] ^ [warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala:68: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. [warn] .setLabels(labelIndexer.labels) [warn] ^ ``` ## How was this patch tested? existing suites Closes #25428 from zhengruifeng/del_stringindexer_labels_usage. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-27 08:58:32 -05:00
zhengruifeng	defb65ed1a	[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML ## What changes were proposed in this pull request? Tree-based feature transformation is a widely used feature and already implemented in many famous libraries, like sklearn/xgboost/lightgbm/catboost. But is still missing in ML. The previous discussions and design doc can be found in [SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677), which is the only left subtask in 'GBT improvement umbrella' [SPARK-14047](https://issues.apache.org/jira/browse/SPARK-14047). This pr is to add tree-based feature transformation. ## How was this patch tested? existing and added suites Closes #25383 from zhengruifeng/tree_path. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-22 09:37:42 -05:00
heleny	fb1f868d4f	[SPARK-28776][ML] SparkML Writer gets hadoop conf from session state <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? SparkML writer gets hadoop conf from session state, instead of the spark context. <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> ### Why are the changes needed? Allow for multiple sessions in the same context that have different hadoop configurations. <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Tested in pyspark.ml.tests.test_persistence.PersistenceTest test_default_read_write Closes #25505 from helenyugithub/SPARK-28776. Authored-by: heleny <heleny@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-22 09:27:05 -05:00
zhengruifeng	49ffbff2fc	[SPARK-28780][ML] Delete the incorrect setWeightCol method in LinearSVCModel ### What changes were proposed in this pull request? Delete the incorrect method `def setWeightCol(value: Double): this.type = set(threshold, value)` in `LinearSVCModel` ### Why are the changes needed? `LinearSVCModel` should not provide this setter, moreover, this method is wrongly defined. ### Does this PR introduce any user-facing change? yes, a public method is removed ### How was this patch tested? existing suites Closes #25510 from zhengruifeng/linearsvc_model_set_weightcol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-21 09:47:53 -05:00
Sean Owen	fa7fd8f2a4	[SPARK-28434][TESTS][ML] Fix values in dummy tree in DecisionTreeSuite ### What changes were proposed in this pull request? Fix dummy tree created in decision tree tests to have actually consistent stats, so that it can be compared in tests more completely. The current one has values for, say, impurity that don't even match internally. With this, the tests can assert more about stats staying correct after load. ### Why are the changes needed? Fixes a TODO and improves the test slightly. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests. Closes #25485 from srowen/SPARK-28434. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-19 17:01:14 -05:00
Liang-Chi Hsieh	0094b5fe72	[SPARK-28722][ML] Change sequential label sorting in StringIndexer fit to parallel ## What changes were proposed in this pull request? The `fit` method in `StringIndexer` sorts given labels in a sequential approach, if there are multiple input columns. When the number of input column increases, the time of label sorting dramatically increases too so it is hard to use in practice if dealing with hundreds of input columns. This patch tries to make the label sorting parallel. This runs benchmark like: ```scala import org.apache.spark.ml.feature.StringIndexer val numCol = 300 val data = (0 to 100).map { i => (i, 100 * i) } var df = data.toDF("id", "label0") (1 to numCol).foreach { idx => df = df.withColumn(s"label$idx", col("label0") + 1) } val inputCols = (0 to numCol).map(i => s"label$i").toArray val outputCols = (0 to numCol).map(i => s"labelIndex$i").toArray val t0 = System.nanoTime() val indexer = new StringIndexer().setInputCols(inputCols).setOutputCols(outputCols).setStringOrderType("alphabetDesc").fit(df) val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0) / 1000000000.0 + "s") ``` \| numCol \| 20 \| 50 \| 100 \| 200 \| 300 \| \|--:\|---\|---\|---\|---\|---\| \| Before \| 9.85 \| 28.62 \| 64.35 \| 167.17 \| 431.60 \| \| After \| 2.44 \| 2.71 \| 3.34 \| 4.83 \| 6.90 \| Unit: second ## How was this patch tested? Passed existing tests. Manually test for performance. Closes #25442 from viirya/improve_stringindexer2. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-16 12:39:12 -05:00
younggyu chun	8535df7261	[MINOR] Fix typos in comments and replace an explicit type with <> ## What changes were proposed in this pull request? This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+. ## How was this patch tested? Manually tested. Closes #25338 from younggyuchun/younggyu. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-10 16:47:11 -05:00
zhengruifeng	8b08e14de7	[SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup ## What changes were proposed in this pull request? some cleanup and tiny optimization 1, since the `transformImpl` method in the .mllib side is no longer used in the .ml side, the scope should be limited; 2, in the `hashUDF`, val `numOfFeatures` is never used; 3, in the udf, it is inefficient to involve param getter (`$(numFeatures)`/`$(binary)`) directly or via method `indexOf` ((`$(numFeatures)`) . instead, the getter should be called outside of the udf; ## How was this patch tested? existing suites Closes #25324 from zhengruifeng/hashingtf_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-09 10:04:39 -05:00
zhengruifeng	c17fa1360d	[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT ## What changes were proposed in this pull request? Remove the redundant and confusing transformImpl method in RF & GBT; 1, In `GBTClassifier` & `RandomForestClassifier`, the real `transform` methods inherit from `ProbabilisticClassificationModel` which can deal with multi output columns. The `transformImpl` method, which deals with only one column - `predictionCol`, completely does nothing. This is quite confusing. 2, In `GBTRegressor` & `RandomForestRegressor`, the `transformImpl` do exactly what the superclass `PredictionModel` does (except model broadcasting), so can be removed. ## How was this patch tested? existing suites Closes #25256 from zhengruifeng/del_ensamble_transformImpl. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-06 15:12:47 -05:00
Sean Owen	c09675779b	[SPARK-28604][ML] Use log1p(x) over log(1+x) and expm1(x) over exp(x)-1 for accuracy ## What changes were proposed in this pull request? Use `log1p(x)` over `log(1+x)` and `expm1(x)` over `exp(x)-1` for accuracy, where possible. This should improve accuracy a tiny bit in ML-related calculations, and shouldn't hurt in any event. ## How was this patch tested? Existing tests. Closes #25337 from srowen/SPARK-28604. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-04 17:04:01 -05:00
actuaryzhang	6d7a6751d8	[SPARK-20604][ML] Allow imputer to handle numeric types ## What changes were proposed in this pull request? Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. ## How was this patch tested? new test Closes #17864 from actuaryzhang/imputer. Lead-authored-by: actuaryzhang <actuaryzhang10@gmail.com> Co-authored-by: Wayne Zhang <actuaryzhang@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:54:50 -05:00
Huaxin Gao	660423d717	[SPARK-23469][ML] HashingTF should use corrected MurmurHash3 implementation ## What changes were proposed in this pull request? Update HashingTF to use new implementation of MurmurHash3 Make HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded ## How was this patch tested? Change existing unit tests. Also add one unit test to make sure HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded Closes #25303 from huaxingao/spark-23469. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:53:36 -05:00
zhengruifeng	37243e160e	[SPARK-28579][ML] MaxAbsScaler avoids conversion to breeze.vector ## What changes were proposed in this pull request? avoid `.ml.vector => .breeze.vector` conversion in `MaxAbsScaler`, and reuse the transformation method in `StandardScalerModel`, which can deal with dense & sparse vector separately. ## How was this patch tested? existing suites Closes #25311 from zhengruifeng/maxabs_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:38:10 -05:00
WeichenXu	a745381b9d	[SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 ## What changes were proposed in this pull request? I remove the deprecate `ImageSchema.readImages`. Move some useful methods from class `ImageSchema` into class `ImageFileFormat`. In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some useful python methods in it. ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25245 from WeichenXu123/remove_image_schema. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 14:26:18 +09:00
zhengruifeng	44c28d7515	[SPARK-28399][ML][PYTHON] implement RobustScaler ## What changes were proposed in this pull request? Implement `RobustScaler` Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers. ## How was this patch tested? existing and added tests Closes #25160 from zhengruifeng/robust_scaler. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-30 10:24:33 -05:00
Huaxin Gao	70f82fd298	[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. ## How was this patch tested? Add Unit test. Closes #25250 from huaxingao/spark-21481. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-28 08:32:43 -05:00
zhengruifeng	bf41070480	[SPARK-28499][ML] Optimize MinMaxScaler ## What changes were proposed in this pull request? 1, avoid calling param getter in udf; 2, for constant dims, precompute the transformed result; 3, for usual dims, precompute `scale / originalRange(i)` to skip a division; ## How was this patch tested? existing suites Closes #25244 from zhengruifeng/minmax_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-26 08:35:00 -05:00
Liang-Chi Hsieh	ded1a7495b	[SPARK-28365][ML] Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM ## What changes were proposed in this pull request? Because the local default locale isn't in available locales at `Locale`, when I did some tests locally with python code, `StopWordsRemover` related python test hits some errors, like: ``` Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, *kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' ``` As per HyukjinKwon's advice, instead of setting up locale to pass test, it is better to have a workable locale if system default locale can't be found in available locales in JVM. Otherwise, users have to manually change system locale or accessing a private property _jvm in PySpark. ## How was this patch tested? Added test and manual test. ``` scala> val remover = new StopWordsRemover().setInputCol("raw").setOutputCol("filtered") 19/07/14 19:20:03 WARN StopWordsRemover: Default locale set was [en_TW]; however, it was not found in available locales in JVM, falling back to en_US locale. Set param `locale` in order to respect another locale. ``` Closes #25133 from viirya/pytest-default-locale. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-26 12:13:10 +09:00
zhengruifeng	a3bbc371cb	[SPARK-28421][ML] SparseVector.apply performance optimization ## What changes were proposed in this pull request? optimize the `SparseVector.apply` by avoiding internal conversion Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting. \| size\| nnz \| apply(old) \| apply2(new impl) \| apply3(new impl with extra range check)\| \|------\|----------\|------------\|----------\|----------\| \|10000000\|100\|75294\|12208\|18682\| \|10000000\|10000\|75616\|23132\|32932\| \|10000000\|1000000\|92949\|42529\|48821\| ## How was this patch tested? existing tests using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`): ``` import scala.util.Random import org.apache.spark.ml.linalg._ val size = 10000000 for (nnz <- Seq(100, 10000, 1000000)) { val rng = new Random(123) val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted val values = Array.fill(nnz)(rng.nextDouble) val vec = Vectors.sparse(size, indices, values).toSparse val tic1 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} }; val toc1 = System.currentTimeMillis; val tic2 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} }; val toc2 = System.currentTimeMillis; val tic3 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} }; val toc3 = System.currentTimeMillis; println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3)) } ``` Closes #25178 from zhengruifeng/sparse_vec_apply. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-23 20:20:22 -05:00
Ievgen Prokhorenko	52ddf038ec	[SPARK-28440][MLLIB][TEST] Use TestingUtils to compare floating point values ## What changes were proposed in this pull request? Use `org.apache.spark.mllib.util.TestingUtils` object across `MLLIB` component to compare floating point values in tests. ## How was this patch tested? `build/mvn test` - existing tests against updated code. Closes #25191 from eugen-prokhorenko/mllib-testingutils-double-comparison. Authored-by: Ievgen Prokhorenko <eugen.prokhorenko@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-18 23:48:12 -07:00
zhengruifeng	282a12d331	[SPARK-27944][ML] Unify the behavior of checking empty output column names ## What changes were proposed in this pull request? In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing. ## How was this patch tested? existing tests Closes #24793 from zhengruifeng/aft_iso_check_empty_outputCol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-16 09:56:12 -04:00
Liang-Chi Hsieh	591de42351	[SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 ## What changes were proposed in this pull request? This upgraded to a newer version of Pyrolite. Most updates [1] in the newer version are for dotnot. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. [1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master ## How was this patch tested? Manually tested on Python 3.6 in local on existing tests. Closes #25143 from viirya/upgrade-pyrolite. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-15 12:29:58 +09:00
zhengruifeng	aa41dcea4a	[SPARK-28159][ML][FOLLOWUP] fix typo & (0 until v.size).toList => List.range(0, v.size) ## What changes were proposed in this pull request? fix typo in spark-28159 `transfromWithMean` -> `transformWithMean` ## How was this patch tested? existing test Closes #25129 from zhengruifeng/to_ml_vec_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-12 11:00:16 -07:00
Henry D	a32c92c0cd	[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors ## What changes were proposed in this pull request? In both cases, the input `DataFrame` schema must contain only the information that's required for the matrix object, so a vector column in the case of `RowMatrix` and long and vector columns for `IndexedRowMatrix`. ## How was this patch tested? Unit tests that verify: - `RowMatrix` and `IndexedRowMatrix` can be created from `DataFrame`s - If the schema does not match expectations, we throw an `IllegalArgumentException` Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24953 from henrydavidge/row-matrix-df. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-09 16:39:21 -05:00
zhengruifeng	28ea445c43	[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion ## What changes were proposed in this pull request? Make the transform natively in ml framework to avoid extra conversion. There are many TODOs in current ml module, like `// TODO: Make the transformer natively in ml framework to avoid extra conversion.` in ChiSqSelector. This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms. Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. ## How was this patch tested? existing testsuites Closes #24963 from zhengruifeng/to_ml_vector. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-08 14:45:45 -05:00
zhengruifeng	c83b3ddb56	[SPARK-28154][ML][FOLLOWUP] GMM fix double caching ## What changes were proposed in this pull request? if the input dataset is alreadly cached, then we do not need to cache the internal rdd (like kmeans) ## How was this patch tested? existing test Closes #24919 from zhengruifeng/gmm_fix_double_caching. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:50:34 -05:00
zhengruifeng	83b96f6b30	[SPARK-28117][ML] LDA and BisectingKMeans cache the input dataset if necessary ## What changes were proposed in this pull request? cache dataset in BisectingKMeans cache dataset in LDA if Online solver is chosen. ## How was this patch tested? existing test Closes #24920 from zhengruifeng/bikm_cache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:47:06 -05:00
zhengruifeng	c397b06924	[SPARK-28045][ML][PYTHON] add missing RankingEvaluator ## What changes were proposed in this pull request? add missing RankingEvaluator ## How was this patch tested? added testsuites Closes #24869 from zhengruifeng/ranking_eval. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:44:06 -05:00
WeichenXu	b276788d57	[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource ## What changes were proposed in this pull request? Provide a way to recursively load data from datasource. I add a "recursiveFileLookup" option. When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively. If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown. ## How was this patch tested? Unit tests. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24830 from WeichenXu123/recursive_ds. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-06-20 12:43:01 +08:00
Andrew-Crosby	36b327d479	[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator ## What changes were proposed in this pull request? Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See https://github.com/apache/spark/pull/14109 for explanation of the use of transient lazy variables) On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds. ## How was this patch tested? Existing unit tests. See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script. Closes #24880 from Andrew-Crosby/spark-28062. Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-19 08:57:12 -05:00
zhengruifeng	9ec049601a	[SPARK-28044][ML][PYTHON] MulticlassClassificationEvaluator support more metrics ## What changes were proposed in this pull request? expose more metrics in evaluator: weightedTruePositiveRate/weightedFalsePositiveRate/weightedFMeasure/truePositiveRateByLabel/falsePositiveRateByLabel/precisionByLabel/recallByLabel/fMeasureByLabel ## How was this patch tested? existing cases and add cases Closes #24868 from zhengruifeng/multi_class_support_bylabel. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-19 08:56:15 -05:00
Sean Owen	e96dd82f12	[SPARK-28081][ML] Handle large vocab counts in word2vec ## What changes were proposed in this pull request? The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count. This takes over https://github.com/apache/spark/pull/24814 ## How was this patch tested? Existing tests. Closes #24893 from srowen/SPARK-28081. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-18 20:27:43 -05:00
zhengruifeng	7281784883	[SPARK-16692][ML][PYTHON] add MultilabelClassificationEvaluator ## What changes were proposed in this pull request? add MultilabelClassificationEvaluator ## How was this patch tested? added testsuites Closes #24777 from zhengruifeng/multi_label_eval. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-13 07:58:22 -05:00
ozan	a38d605d0d	[SPARK-18570][ML][R] RFormula support * and ^ operators ## What changes were proposed in this pull request? Added support for `` and `^` operators, along with expressions within parentheses. New operators just expand to already supported terms, such as; - y ~ a b = y ~ a + b + a : b - y ~ (a+b+c)^3 = y ~ a + b + c + a : b + a : c + a :b : c ## How was this patch tested? Added new unit tests to RFormulaParserSuite mengxr yanboliang Closes #24764 from ozancicek/rformula. Authored-by: ozan <ozancancicekci@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-04 08:59:30 -05:00
zhengruifeng	98708de38c	[MINOR][ML] add missing since annotation of meanAveragePrecision ## What changes were proposed in this pull request? add missing since annotation of meanAveragePrecision ## How was this patch tested? existing tests Closes #24778 from zhengruifeng/ranking_missing_since. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-03 18:07:23 -05:00
zhengruifeng	560e7bec6f	[SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics ## What changes were proposed in this pull request? compute all metrics with only one pass ## How was this patch tested? existing tests Closes #24717 from zhengruifeng/multi_label_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-01 08:32:52 -05:00
Sean Owen	aec0869fb2	[SPARK-27896][ML] Fix definition of clustering silhouette coefficient for 1-element clusters ## What changes were proposed in this pull request? Single-point clusters should have silhouette score of 0, according to the original paper and scikit implementation. ## How was this patch tested? Existing test suite + new test case. Closes #24756 from srowen/SPARK-27896. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-31 16:27:20 -07:00
Yuming Wang	db3e746b64	[SPARK-27875][CORE][SQL][ML][K8S] Wrap all PrintWriter with Utils.tryWithResource ## What changes were proposed in this pull request? This pr wrap all `PrintWriter` with `Utils.tryWithResource` to prevent resource leak. ## How was this patch tested? Existing test Closes #24739 from wangyum/SPARK-27875. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-30 19:54:32 +09:00
MJ Tang	1824cbfa39	[SPARK-27657][ML] Fix the log format of ml.util.Instrumentation.logFai… …lure ## What changes were proposed in this pull request? The failure log format is fixed according to the jdk implementation. ## How was this patch tested? Manual tests have been done. The new failure log format would be like: java.lang.RuntimeException: Failed to finish the task at com.xxx.Test.test(Test.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124) at org.testng.internal.Invoker.invokeMethod(Invoker.java:571) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:707) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:979) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) at org.testng.TestRunner.privateRun(TestRunner.java:648) at org.testng.TestRunner.run(TestRunner.java:505) at org.testng.SuiteRunner.runTest(SuiteRunner.java:455) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415) at org.testng.SuiteRunner.run(SuiteRunner.java:364) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1187) at org.testng.TestNG.runSuitesLocally(TestNG.java:1116) at org.testng.TestNG.runSuites(TestNG.java:1028) at org.testng.TestNG.run(TestNG.java:996) at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72) at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:123) Caused by: java.io.FileNotFoundException: File is not found at com.xxx.Test.test(Test.java:105) ... 24 more Closes #24684 from breakdawn/master. Authored-by: MJ Tang <mingjtang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-28 09:29:46 -05:00
zhengruifeng	32461d4744	[SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve ## What changes were proposed in this pull request? compute AUC on one pass ## How was this patch tested? existing tests performance tests: ``` import org.apache.spark.mllib.evaluation._ val scoreAndLabels = sc.parallelize(Array.range(0, 100000).map{ i => (i.toDouble / 100000, (i % 2).toDouble) }, 4) scoreAndLabels.persist() scoreAndLabels.count() val tic = System.currentTimeMillis (0 until 100).foreach{i => val metrics = new BinaryClassificationMetrics(scoreAndLabels, 0); val auc = metrics.areaUnderROC; metrics.unpersist} val toc = System.currentTimeMillis toc - tic ``` \|New\| Existing\| \|------\|----------\| \|87532\|103644\| One-pass AUC saves about 16% computation time. Closes #24648 from zhengruifeng/auc_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-27 10:31:07 -05:00
zhengruifeng	be9e9466e2	[SPARK-27787][ML] Eliminate uncessary job to compute SSreg ## What changes were proposed in this pull request? Eliminate uncessary job to compute SSreg Compute SSreg based on the summary of predictions ## How was this patch tested? existing tests Closes #24656 from zhengruifeng/RegressionMetrics_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-26 08:16:32 -05:00
wenxuanguan	e7443d6412	[SPARK-27774][CORE][MLLIB] Avoid hardcoded configs ## What changes were proposed in this pull request? avoid hardcoded configs in `SparkConf` and `SparkSubmit` and test ## How was this patch tested? N/A Closes #24631 from wenxuanguan/minor-fix. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:45:11 +09:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
Shahid	fbb56f2b8f	[SPARK-27636][MLLIB] Remove cached RDD blocks after PIC execution ## What changes were proposed in this pull request? Test steps to reproduce: 1) bin/spark-shell ``` val dataset = spark.createDataFrame(Seq( (0L, 1L, 1.0), (1L,2L,1.0), (3L, 4L,1.0), (4L,0L,0.1))).toDF("src", "dst", "weight") val model = new PowerIterationClustering(). setMaxIter(10). setInitMode("degree"). setWeightCol("weight") val prediction = model.assignClusters(dataset).select("id", "cluster") ``` 2) Open storage tab of the UI. We can see many RDD block cached, even after running the PIC. In this PR, basically materializes the new graph before unpersisting the old ones. ## How was this patch tested? Manually tested and existing UTs. Before patch: ![Screenshot from 2019-05-06 02-53-45](https://user-images.githubusercontent.com/23054875/57201033-daf61b80-6fb0-11e9-97ff-7534909ce2d3.png) After patch: ![Screenshot from 2019-05-06 03-41-04](https://user-images.githubusercontent.com/23054875/57201043-07aa3300-6fb1-11e9-855b-f63ee18ea371.png) Closes #24531 from shahidki31/SPARK-27636. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-09 09:27:31 -05:00
qb-tarushg	9b3211a194	[SPARK-27540][MLLIB] Add 'meanAveragePrecision_at_k' metric to RankingMetrics ## What changes were proposed in this pull request? Added method 'meanAveragePrecisionAt' k to RankingMetrics. This branch is rebased with squashed commits from https://github.com/apache/spark/pull/24458 ## How was this patch tested? Added code in the existing test RankingMetricsSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24543 from qb-tarushg/SPARK-27540-REBASE. Authored-by: qb-tarushg <tarush.grover@quantumblack.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-09 08:47:05 -05:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
Shaochen Shi	d5308cd86f	[SPARK-27577][MLLIB] Correct thresholds downsampled in BinaryClassificationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-07 08:41:58 -05:00
Liang-Chi Hsieh	d9bcacf94b	[SPARK-27629][PYSPARK] Prevent Unpickler from intervening each unpickling ## What changes were proposed in this pull request? In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. ## How was this patch tested? Passing the test added in SPARK-27612 (#24519). Closes #24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-04 13:21:08 +09:00
asarb	4241a72c65	[SPARK-27621][ML] Linear Regression - validate training related params such as loss only during fitting phase ## What changes were proposed in this pull request? When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. ``` java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) ``` This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap (https://github.com/combust/mleap/issues/455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. ## How was this patch tested? Added a unit test to check this scenario. Please let me know if there's anything additional required, this is the first PR that I've raised in this project. Closes #24509 from ancasarb/linear_regression_params_fix. Authored-by: asarb <asarb@expedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-03 18:17:04 -05:00
Gabor Somogyi	fb6b19ab7c	[SPARK-23014][SS] Fully remove V1 memory sink. ## What changes were proposed in this pull request? There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely. What this PR contains: * V1 memory sink removal * V2 memory sink renamed to become the only implementation * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant * Adapted all the tests ## How was this patch tested? Existing unit tests. Closes #24403 from gaborgsomogyi/SPARK-23014. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-29 09:44:23 -07:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Sean Owen	596a5ff273	[MINOR][BUILD] Update genjavadoc to 0.13 ## What changes were proposed in this pull request? Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's update genjavadoc to see if it generates fewer spurious javadoc errors to begin with. ## How was this patch tested? Existing docs build Closes #24443 from srowen/genjavadoc013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 13:44:48 +09:00
Eric Liang	5172190da1	[SPARK-27392][SQL] TestHive test tables should be placed in shared test state, not per session ## What changes were proposed in this pull request? Otherwise, tests that use tables from multiple sessions will run into issues if they access the same table. The correct location is in shared state. A couple other minor test improvements. cc gatorsmile srinathshankar ## How was this patch tested? Existing unit tests. Closes #24302 from ericl/test-conflicts. Lead-authored-by: Eric Liang <ekl@databricks.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-22 11:05:31 -07:00
WeichenXu	d35e81f4bc	[SPARK-27454][ML][SQL] Spark image datasource fail when encounter some illegal images ## What changes were proposed in this pull request? Fix in Spark image datasource fail when encounter some illegal images. This related to bugs inside `ImageIO.read` so in spark code I add exception handling for it. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24362 from WeichenXu123/fix_image_ds_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-15 11:55:51 -07:00
Sean Owen	67bd124f4f	[MINOR][TEST] Speed up slow tests in QuantileDiscretizerSuite ## What changes were proposed in this pull request? This should reduce the total runtime of these tests from about 2 minutes to about 25 seconds. ## How was this patch tested? Existing tests Closes #24360 from srowen/SpeedQDS. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-13 17:03:23 -05:00
Sean Owen	9ed60c2c33	[MINOR][TEST][ML] Speed up some tests of ML regression by loosening tolerance ## What changes were proposed in this pull request? Loosen some tolerances in the ML regression-related tests, as they seem to account for some of the top slow tests in https://spark-tests.appspot.com/slow-tests These changes are good for about a 25 second speedup on my laptop. ## How was this patch tested? Existing tests Closes #24351 from srowen/SpeedReg. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-12 09:31:12 -05:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
Sean Owen	05f6b87e81	[SPARK-27410][MLLIB] Remove deprecated / no-op mllib.KMeans getRuns, setRuns ## What changes were proposed in this pull request? Remove deprecated / no-op mllib.KMeans getRuns, setRuns mllib.KMeans has getRuns, setRuns methods which haven't done anything since Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. ## How was this patch tested? Existing tests. Closes #24320 from srowen/SPARK-27410. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-09 19:13:35 -05:00
Rafael Renaudin	dfa2328e28	[SPARK-26881][MLLIB] Heuristic for tree aggregate depth Changes proposed: - Adding method to compute treeAggregate depth required to avoid exceeding driver max result size (first commit) - Using it in the computation of grammian of RowMatrix (second commit) Tests: - Unit Test wise, one unit test checking the behavior of the depth computation method - Tested at scale on hadoop cluster by doing PCA on a large dataset (needed depth 3 to succeed) Debatable choice: I'm not sure if RDD API is the right place to put the depth computation method. The advantage of it is that it allows to access driver max result size, and rdd number of partitions, to set default arguments for the method. Semantically, such a method might belong to something like org.apache.spark.util.Utils though. Closes #23983 from gagafunctor/Heuristic_for_treeAggregate_depth. Authored-by: Rafael Renaudin <renaudin.rafael@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-08 20:56:53 -05:00
Ilya Matiach	887279cc46	[SPARK-24102][ML][MLLIB][PYSPARK][FOLLOWUP] Added weight column to pyspark API for regression evaluator and metrics ## What changes were proposed in this pull request? Followup to PR https://github.com/apache/spark/pull/17085 This PR adds the weight column to the pyspark side, which was already added to the scala API. The PR also undoes a name change in the scala side corresponding to a change in another similar PR as noted here: https://github.com/apache/spark/pull/17084#discussion_r259648639 ## How was this patch tested? This patch adds python tests for the changes to the pyspark API. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24197 from imatiach-msft/ilmat/regressor-eval-python. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-26 09:06:04 -05:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00
Marco Gaido	25bcf59b3b	[SPARK-25838][ML] Remove formatVersion from Saveable ## What changes were proposed in this pull request? `Saveable` interface introduces `formatVersion` which is protected and it is used nowhere. So the PR proposes to remove it. ## How was this patch tested? existing tests Closes #22830 from mgaido91/SPARK-25838. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-09 09:44:20 -06:00
masa3141	5fa4ba0cfb	[SPARK-26981][MLLIB] Add 'Recall_at_k' metric to RankingMetrics ## What changes were proposed in this pull request? Add 'Recall_at_k' metric to RankingMetrics ## How was this patch tested? Add test to RankingMetricsSuite. Closes #23881 from masa3141/SPARK-26981. Authored-by: masa3141 <masahiro@kazama.tv> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-06 08:28:53 -06:00
liuxian	02bbe977ab	[MINOR] Remove unnecessary gets when getting a value from map. ## What changes were proposed in this pull request? Redundant `get` when getting a value from `Map` given a key. ## How was this patch tested? N/A Closes #23901 from 10110346/removegetfrommap. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:48:07 -06:00
zhengruifeng	acd086f207	[SPARK-19591][ML][PYSPARK][FOLLOWUP] Add sample weights to decision trees ## What changes were proposed in this pull request? Add sample weights to decision trees ## How was this patch tested? updated testsuites Closes #23818 from zhengruifeng/py_tree_support_sample_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 21:11:30 -06:00
liuxian	7912dbb88f	[MINOR] Simplify boolean expression ## What changes were proposed in this pull request? Comparing whether Boolean expression is equal to true is redundant For example: The datatype of `a` is boolean. Before: if (a == true) After: if (a) ## How was this patch tested? N/A Closes #23884 from 10110346/simplifyboolean. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 08:38:00 -06:00
Sean Owen	9c283662c6	[SPARK-26986][ML] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later. ## How was this patch tested? Existing tests particularly PMML-related ones, which use JAXB. This works on Java 11. Closes #23890 from srowen/SPARK-26986. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 18:26:49 -06:00
Ilya Matiach	b66be0e490	[SPARK-24103][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. ## How was this patch tested? I added tests to the metrics and evaluators classes. Closes #17084 from imatiach-msft/ilmat/binary-evalute. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 17:16:51 -06:00
Sean Owen	d2529788ed	[SPARK-26966][ML] Update to JPMML 1.4.8 ## What changes were proposed in this pull request? JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures from JPMML relating to JAXB when running on Java 11. It's shaded and not a big change, so should be safe. ## How was this patch tested? Existing tests. Closes #23868 from srowen/SPARK-26966. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 04:37:45 -06:00
zhengruifeng	89d42dc6d3	[SPARK-25097][ML] Support prediction on single instance in KMeans/BiKMeans/GMM ## What changes were proposed in this pull request? expose method `predict` in KMeans/BiKMeans/GMM ## How was this patch tested? added testsuites Closes #22087 from zhengruifeng/clu_pre_instance. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-21 22:21:28 -06:00
Joseph K. Bradley	be1cadf16d	[SPARK-26960][ML] Wait for listener bus to clear in MLEventsSuite to reduce test flakiness ## What changes were proposed in this pull request? This patch aims to address flakiness I've observed in MLEventsSuite in these tests: * test("pipeline read/write events") * test("pipeline model read/write events") The issue is in the "read/write events" tests, which work as follows: * write * wait until we see at least 1 write-related SparkListenerEvent * read * wait until we see at least 1 read-related SparkListenerEvent The problem is that the last step does NOT allow any write-related SparkListenerEvents, but some of those events may be delayed enough that they are seen in this last step. We should ideally add logic before "read" to wait until the listener events are cleared/complete. Looking into other SparkListener tests, we need to use `sc.listenerBus.waitUntilEmpty(TIMEOUT)`. This patch adds the waitUntilEmpty() call. ## How was this patch tested? It's a test! Closes #23863 from jkbradley/SPARK-26960. Authored-by: Joseph K. Bradley <joseph@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-22 10:08:16 +08:00
joelgenter	885aa553c5	[MINOR][DOCS] Fix the update rule in StreamingKMeansModel documentation ## What changes were proposed in this pull request? The formatting for the update rule (in the documentation) now appears as ![image](https://user-images.githubusercontent.com/14948437/52933807-5a0c7980-3309-11e9-8573-642a73e77c26.png) instead of ![image](https://user-images.githubusercontent.com/14948437/52933897-a8ba1380-3309-11e9-8e16-e47c27b4a044.png) Closes #23819 from joelgenter/patch-1. Authored-by: joelgenter <joelgenter@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-19 08:40:59 -06:00
Marco Gaido	5d8a934c13	[SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT ## What changes were proposed in this pull request? Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in https://github.com/scikit-learn/scikit-learn/pull/11176). Citing the description of that PR: > Because the feature importances are (currently, by default) normalized and then averaged, feature importances from later stages are overweighted. The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT. Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper. ## How was this patch tested? modified UT, checked that the computed `featureImportance` in that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different) Closes #23773 from mgaido91/SPARK-26721. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-16 16:51:01 -06:00
Maxim Gekk	a829234df3	[SPARK-26817][CORE] Use System.nanoTime to measure time intervals ## What changes were proposed in this pull request? In the PR, I propose to use `System.nanoTime()` instead of `System.currentTimeMillis()` in measurements of time intervals. `System.currentTimeMillis()` returns current wallclock time and will follow changes to the system clock. Thus, negative wallclock adjustments can cause timeouts to "hang" for a long time (until wallclock time has caught up to its previous value again). This can happen when ntpd does a "step" after the network has been disconnected for some time. The most canonical example is during system bootup when DHCP takes longer than usual. This can lead to failures that are really hard to understand/reproduce. `System.nanoTime()` is guaranteed to be monotonically increasing irrespective of wallclock changes. ## How was this patch tested? By existing test suites. Closes #23727 from MaxGekk/system-nanotime. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-13 13:12:16 -06:00
Hyukjin Kwon	dfb880951a	[SPARK-26818][ML] Make MLEvents JSON ser/de safe ## What changes were proposed in this pull request? Currently, it looks it's not going to cause any virtually effective problem apparently (if I didn't misread the codes). I see one place that JSON formatted events are being used. `ec506bd30c/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala (L148)` It's okay because it just logs when the exception is ignorable `9690eba16e/core/src/main/scala/org/apache/spark/util/ListenerBus.scala (L111)` I guess it should be best to stay safe - I don't want this unstable experimental feature breaks anything in any case. It also disables `logEvent` in `SparkListenerEvent` for the same reason. This is also to match SQL execution events side: `ca545f7941/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala (L41-L57)` to make ML events JSON ser/de safe. ## How was this patch tested? Manually tested, and unit tests were added. Closes #23728 from HyukjinKwon/SPARK-26818. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-03 21:19:35 +08:00
Sean Owen	8171b156eb	[SPARK-26771][CORE][GRAPHX] Make .unpersist(), .destroy() consistently non-blocking by default ## What changes were proposed in this pull request? Make .unpersist(), .destroy() non-blocking by default and adjust callers to request blocking only where important. This also adds an optional blocking argument to Pyspark's RDD.unpersist(), which never had one. ## How was this patch tested? Existing tests. Closes #23685 from srowen/SPARK-26771. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-01 18:29:55 -06:00
bscan	e44f308593	[SPARK-26787] Fix standardizeLabels error message in WeightedLeastSquares Error message falsely states standardization=True is causing a problem, even when standardization=False. The real issue is standardizeLabels=True, which is set automatically in LinearRegression and not currently available in the Public API. ## What changes were proposed in this pull request? A simple change to an error message. More details here: https://jira.apache.org/jira/browse/SPARK-26787 ## How was this patch tested? This does not change any functionality. Closes #23705 from bscan/bscan-errormsg-1. Authored-by: bscan <brianjscannell@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-31 19:50:18 -06:00
Ilya Matiach	b3b62ba303	[SPARK-19591][ML][MLLIB][FOLLOWUP] Add sample weights to decision trees - fix tolerance This is a follow-up to PR: https://github.com/apache/spark/pull/21632 ## What changes were proposed in this pull request? This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values). In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below. ## How was this patch tested? This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values. Eg in SBT: ``` ./build/sbt > project mllib > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights" ``` For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test): ``` val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) { print("should not print this") partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples)) } else { partValueCountMap } ``` Closes #23682 from imatiach-msft/ilmat/sample-weights-tol. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-31 05:44:55 -06:00
Liang-Chi Hsieh	33107897ad	[SPARK-11215][ML] Add multiple columns support to StringIndexer ## What changes were proposed in this pull request? This takes over #19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes #20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-29 09:21:25 -06:00
Hyukjin Kwon	d2ff10cbe1	[SPARK-23674][ML] Adds Spark ML Events to Instrumentation ## What changes were proposed in this pull request? This PR proposes to add ML events to Instrumentation, and use it in Pipeline so that other developers can track and add some actions for them. ## Introduction ML events (like SQL events) can be quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) Another good thing that might have to be considered is, that we can interact this with other SQL/Streaming events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualise SQL operations as below: ![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png) I think we can combine those existing lineages together to easily understand where the data comes and goes. Currently, ML side is a hole so the lineages can't be connected for the current Apache Spark .. To add up, I think it's not to mention how useful it is to track the SQL/Streaming operations. Likewise, I would like to propose ML events as well (as lowest stability `Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) `mllib/src/main/scala/org/apache/spark/ml/events.scala` ```scala Unstable case class ...StartEvent(caller, input) Unstable case class ...EndEvent(caller, output) trait MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This trait is used by `Instrumentation`. ```scala class Instrumentation ... with MLEvents { ``` and used as below: ```scala instrumented { instr => instr.with...Event(...) { ... } } ``` This way mimics both: 1. Catalog events (see `org/apache/spark/sql/catalyst/catalog/events.scala`) - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here 2. SQL execution events (see `org/apache/spark/sql/execution/SQLExecution.scala`) - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ## Usage It needs a custom implementation for a query listener. For instance, with the custom listener below: ```scala class CustomMLListener extends SparkListener def onOtherEvents(e) = e match { case e: MLEvent => // do something case _ => // pass } } ``` There are two (existing) ways to use this. ```scala spark.sparkContext.addSparkListener(new CustomMLListener) ``` ```bash spark-submit ...\ --conf spark.extraListeners=CustomMLListener\ ... ``` It's also similar with other existing implementation in SQL side. ## Target users 1. I think someone in general would likely utilise this feature like other event listeners. At least, I can see some interests going on outside. - SQL Listener - https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query - http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html - Streaming Query Listener - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/ - http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416 2. Someone would likely run this via Atlas. The plugin mirror intentionally is exposed at [spark-atlas-connector](https://github.com/hortonworks-spark/spark-atlas-connector) so that anyone could do something about lineage and governance in Atlas. I'm trying to show integrated lineages in Apache Spark but this is a missing hole. ## How was this patch tested? Manually tested and unit tests were added. Closes #23263 from HyukjinKwon/SPARK-23674-1. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-25 10:11:49 +08:00
Ilya Matiach	b2d36f65db	[SPARK-19591][ML][MLLIB] Add sample weights to decision trees This is updated PR https://github.com/apache/spark/pull/16722 to latest master ## What changes were proposed in this pull request? This patch adds support for sample weights to DecisionTreeRegressor and DecisionTreeClassifier. Note: This patch does not add support for sample weights to RandomForest. As discussed in the JIRA, we would like to add sample weights into the bagging process. This patch is large enough as is, and there are some additional considerations to be made for random forests. Since the machinery introduced here needs to be present regardless, I have opted to leave random forests for a follow up pr. ## How was this patch tested? The algorithms are tested to ensure that: 1. Arbitrary scaling of constant weights has no effect 2. Outliers with small weights do not affect the learned model 3. Oversampling and weighting are equivalent Unit tests are also added to test other smaller components. ## Summary of changes - Impurity aggregators now store weighted sufficient statistics. They also store a raw count, however, since this is needed to use minInstancesPerNode. - Impurity aggregators now also hold the raw count. - This patch maintains the meaning of minInstancesPerNode, in that the parameter still corresponds to raw, unweighted counts. It also adds a new parameter minWeightFractionPerNode which requires that nodes must contain at least minWeightFractionPerNode * weightedNumExamples total weight. - This patch modifies findSplitsForContinuousFeatures to use weighted sums. Unit tests are added. - TreePoint is modified to hold a sample weight - BaggedPoint is modified from: ``` Scala private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) extends Serializable ``` to ``` Scala private[spark] class BaggedPoint[Datum]( val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double) extends Serializable ``` We do not simply multiply the counts by the weight and store that because we need the raw counts and the weight in order to use both minInstancesPerNode and minWeightPerNode Note: many of the changed files are due simply to using Instance instead of LabeledPoint Closes #21632 from imatiach-msft/ilmat/sample-weights. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-24 18:20:28 -07:00
Sean Owen	6dcad38ba3	[SPARK-26228][MLLIB] OOM issue encountered when computing Gramian matrix ## What changes were proposed in this pull request? Avoid memory problems in closure cleaning when handling large Gramians (>= 16K rows/cols) by using null as zeroValue ## How was this patch tested? Existing tests. Note that it's hard to test the case that triggers this issue as it would require a large amount of memory and run a while. I confirmed locally that a 16K x 16K Gramian failed with tons of driver memory before, and didn't fail upfront after this change. Closes #23600 from srowen/SPARK-26228. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 19:22:06 -06:00
Shahid	0d35f9ea3a	[SPARK-24484][MLLIB] Power Iteration Clustering is giving incorrect clustering results when there are mutiple leading eigen values. ## What changes were proposed in this pull request? ![image](https://user-images.githubusercontent.com/23054875/41823325-e83e1d34-781b-11e8-8c34-fc6e7a042f3f.png) ![image](https://user-images.githubusercontent.com/23054875/41823367-733c9ba4-781c-11e8-8da2-b26460c2af63.png) ![image](https://user-images.githubusercontent.com/23054875/41823409-179dd910-781d-11e8-8d8c-9865156fad15.png) Method to determine if the top eigen values has same magnitude but opposite signs The vector is written as a linear combination of the eigen vectors at iteration k. ![image](https://user-images.githubusercontent.com/23054875/41822941-f8b13d4c-7814-11e8-8091-54c02721c1c5.png) ![image](https://user-images.githubusercontent.com/23054875/41822982-b80a6fc4-7815-11e8-9129-ed96a14f037f.png) ![image](https://user-images.githubusercontent.com/23054875/41823022-5b69e906-7816-11e8-847a-8fa5f0b6200e.png) ![image](https://user-images.githubusercontent.com/23054875/41823087-54311398-7817-11e8-90bf-e1be2bbff323.png) ![image](https://user-images.githubusercontent.com/23054875/41823121-e0b78324-7817-11e8-9596-379bd2e518af.png) ![image](https://user-images.githubusercontent.com/23054875/41823151-965319d2-7818-11e8-8b91-10f6276ace62.png) ![image](https://user-images.githubusercontent.com/23054875/41823182-75cdbad6-7819-11e8-912f-23c66a8359de.png) ![image](https://user-images.githubusercontent.com/23054875/41823221-1ca77a36-781a-11e8-9a40-48bd165797cc.png) ![image](https://user-images.githubusercontent.com/23054875/41823272-f6962b2a-781a-11e8-9978-1b2dc0dc8b2c.png) ![image](https://user-images.githubusercontent.com/23054875/41823303-75b296f0-781b-11e8-8501-6133b04769c8.png) So, we need to check if the reileigh coefficient at the convergence is lesser than the norm of the estimated eigen vector before normalizing (Please fill in changes proposed in this fix) Added a UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21627 from shahidki31/picConvergence. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 18:29:18 -06:00
Kazuaki Ishizaki	7bf0794651	[SPARK-26463][CORE] Use ConfigEntry for hardcoded configs for scheduler categories. ## What changes were proposed in this pull request? The PR makes hardcoded `spark.dynamicAllocation`, `spark.scheduler`, `spark.rpc`, `spark.task`, `spark.speculation`, and `spark.cleaner` configs to use `ConfigEntry`. ## How was this patch tested? Existing tests Closes #23416 from kiszk/SPARK-26463. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 07:44:36 -06:00
Jatin Puri	d2e86cb3cd	[SPARK-26616][MLLIB] Expose document frequency in IDFModel ## What changes were proposed in this pull request? This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model. * The document frequency is returned as an `Array[Long]` * If the minimum document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms * numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value * Pyspark changes ## How was this patch tested? The existing test case was edited to also check for the document frequency values. I am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything Reviewer request: mengxr zjffdu yinxusen Closes #23549 from purijatin/master. Authored-by: Jatin Puri <purijatin@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 07:41:54 -06:00
Shahid	9a30e23211	[SPARK-26351][MLLIB] Update doc and minor correction in the mllib evaluation metrics ## What changes were proposed in this pull request? Currently, there are some minor inconsistencies in doc compared to the code. In this PR, I am correcting those inconsistencies. 1) Links related to the evaluation metrics in the docs are not working 2) Minor correction in the evaluation metrics formulas in docs. ## How was this patch tested? NA Closes #23589 from shahidki31/docCorrection. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-20 18:11:14 -06:00
Kazuaki Ishizaki	64cc9e572e	[SPARK-26477][CORE] Use ConfigEntry for hardcoded configs for unsafe category ## What changes were proposed in this pull request? The PR makes hardcoded `spark.unsafe` configs to use ConfigEntry and put them in the `config` package. ## How was this patch tested? Existing UTs Closes #23412 from kiszk/SPARK-26477. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-18 23:57:04 -08:00
Jungtaek Lim (HeartSaVioR)	38f030725c	[SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories. ## What changes were proposed in this pull request? The PR makes hardcoded configs below to use `ConfigEntry`. * spark.kryo * spark.kryoserializer * spark.serializer * spark.jars * spark.files * spark.submit * spark.deploy * spark.worker This patch doesn't change configs which are not relevant to SparkConf (e.g. system properties). ## How was this patch tested? Existing tests. Closes #23532 from HeartSaVioR/SPARK-26466-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-16 20:57:21 -06:00
Kengo Seki	3bd77aa9f6	[SPARK-26564] Fix wrong assertions and error messages for parameter checking ## What changes were proposed in this pull request? If users set equivalent values to spark.network.timeout and spark.executor.heartbeatInterval, they get the following message: ``` java.lang.IllegalArgumentException: requirement failed: The value of spark.network.timeout=120s must be no less than the value of spark.executor.heartbeatInterval=120s. ``` But it's misleading since it can be read as they could be equal. So this PR replaces "no less than" with "greater than". Also, it fixes similar inconsistencies found in MLlib and SQL components. ## How was this patch tested? Ran Spark with equivalent values for them manually and confirmed that the revised message was displayed. Closes #23488 from sekikn/SPARK-26564. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-12 14:53:33 -06:00
Shahid	71183b2833	[SPARK-24489][ML] Check for invalid input type of weight data in ml.PowerIterationClustering ## What changes were proposed in this pull request? The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column. ``` test("invalid input types for weight") { val invalidWeightData = spark.createDataFrame(Seq( (0L, 1L, "a"), (2L, 3L, "b") )).toDF("src", "dst", "weight") val pic = new PowerIterationClustering() .setWeightCol("weight") val result = pic.assignClusters(invalidWeightData) } ``` ``` Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847) ``` In this PR, added check types for weight column. ## How was this patch tested? UT added Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21509 from shahidki31/testCasePic. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2019-01-07 09:15:50 -08:00
Dongjoon Hyun	e15a319ccd	[SPARK-26536][BUILD][TEST] Upgrade Mockito to 2.23.4 ## What changes were proposed in this pull request? This PR upgrades Mockito from 1.10.19 to 2.23.4. The following changes are required. - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers` - Replace `anyObject` with `any` - Replace `getArgumentAt` with `getArgument` and add type annotation. - Use `isNull` matcher in case of `null` is invoked. ```scala saslHandler.channelInactive(null); - verify(handler).channelInactive(any(TransportClient.class)); + verify(handler).channelInactive(isNull()); ``` - Make and use `doReturn` wrapper to avoid [SI-4775](https://issues.scala-lang.org/browse/SI-4775) ```scala private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, Seq.empty: _*) ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #23452 from dongjoon-hyun/SPARK-26536. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-04 19:23:38 -08:00
Marco Gaido	001d309538	[SPARK-25765][ML] Add training cost to BisectingKMeans summary ## What changes were proposed in this pull request? The PR adds the `trainingCost` value to the `BisectingKMeansSummary`, in order to expose the information retrievable by running `computeCost` on the training dataset. This fills the gap with `KMeans` implementation. ## How was this patch tested? improved UTs Closes #22764 from mgaido91/SPARK-25765. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-01 09:18:58 -06:00
zhengruifeng	aa0d4ca8ba	[SPARK-25970][ML] Add Instrumentation to PrefixSpan ## What changes were proposed in this pull request? Add Instrumentation to PrefixSpan ## How was this patch tested? existing tests Closes #22971 from zhengruifeng/log_PrefixSpan. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-12-20 11:22:49 -08:00
Yuhao Yang	c04ad17ccf	[SPARK-20351][ML] Add trait hasTrainingSummary to replace the duplicate code ## What changes were proposed in this pull request? Add a trait HasTrainingSummary to avoid code duplicate related to training summary. Currently all the training summary use the similar pattern which can be generalized, ``` private[ml] final var trainingSummary: Option[T] = None def hasSummary: Boolean = trainingSummary.isDefined def summary: T = trainingSummary.getOrElse... private[ml] def setSummary(summary: Option[T]): ... ``` Classes with the trait need to override `setSummry`. And for Java compatibility, they will also have to override `summary` method, otherwise the java code will regard all the summary class as Object due to a known issue with Scala. ## How was this patch tested? existing Java and Scala unit tests Closes #17654 from hhbyyh/hassummary. Authored-by: Yuhao Yang <yuhao.yang@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-17 09:28:23 -06:00
Ilya Matiach	570b8f3d45	[SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. The updates to the regression metrics were based on (and updated with new changes based on comments): https://issues.apache.org/jira/browse/SPARK-11520 ("RegressionMetrics should support instance weights") but the pull request was closed as the changes were never checked in. ## How was this patch tested? I added tests to the metrics class. Closes #17085 from imatiach-msft/ilmat/regression-evaluate. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-12 10:06:41 -06:00
Huaxin Gao	05cf81e6de	[SPARK-19827][R] spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes #23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:28:13 -06:00
韩田田00222924	82c1ac48a3	[SPARK-25696] The storage memory displayed on spark Application UI is… … incorrect. ## What changes were proposed in this pull request? In the reported heartbeat information, the unit of the memory data is bytes, which is converted by the formatBytes() function in the utils.js file before being displayed in the interface. The cardinality of the unit conversion in the formatBytes function is 1000, which should be 1024. Change the cardinality of the unit conversion in the formatBytes function to 1024. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22683 from httfighter/SPARK-25696. Lead-authored-by: 韩田田00222924 <han.tiantian@zte.com.cn> Co-authored-by: han.tiantian@zte.com.cn <han.tiantian@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:27:01 -06:00
李亮	9fdc7a840d	[SPARK-26158][MLLIB] fix covariance accuracy problem for DenseVector ## What changes were proposed in this pull request? Enhance accuracy of the covariance logic in RowMatrix for function computeCovariance ## How was this patch tested? Unit test Accuracy test Closes #23126 from KyleLi1985/master. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-29 13:08:53 -06:00
zhengruifeng	e3ea93ab6c	[MINOR][ML] add missing params to Instr ## What changes were proposed in this pull request? add following param to instr: GBTC: validationTol GBTR: validationTol, validationIndicatorCol colnames in LiR, LinearSVC, etc ## How was this patch tested? existing tests Closes #23122 from zhengruifeng/instr_append_missing_params. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-29 08:53:12 -06:00
Liang-Chi Hsieh	8bfea86b1c	[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder ## What changes were proposed in this pull request? We have deprecated `OneHotEncoder` at Spark 2.3.0 and introduced `OneHotEncoderEstimator`. At 3.0.0, we remove deprecated `OneHotEncoder` and rename `OneHotEncoderEstimator` to `OneHotEncoder`. TODO: According to ML migration guide, we need to keep `OneHotEncoderEstimator` as an alias after renaming. This is not done at this patch in order to facilitate review. ## How was this patch tested? Existing tests. Closes #23100 from viirya/remove_one_hot_encoder. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-29 01:54:06 +00:00
zhengruifeng	9fde3deab8	[SPARK-25989][ML] OneVsRestModel handle empty outputCols incorrectly ## What changes were proposed in this pull request? ignore empty output columns ## How was this patch tested? added tests Closes #22991 from zhengruifeng/ovrm_empty_outcol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-28 07:33:34 -08:00
zhengruifeng	1bb60ab839	[SPARK-26153][ML] GBT & RandomForest avoid unnecessary `first` job to compute `numFeatures` ## What changes were proposed in this pull request? use base models' `numFeature` instead of `first` job ## How was this patch tested? existing tests Closes #23123 from zhengruifeng/avoid_first_job. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-26 05:57:33 -06:00
Katrin Leinweber	c5daccb1da	[MINOR] Update all DOI links to preferred resolver ## What changes were proposed in this pull request? The DOI foundation recommends [this new resolver](https://www.doi.org/doi_handbook/3_Resolution.html#3.8). Accordingly, this PR re`sed`s all static DOI links ;-) ## How was this patch tested? It wasn't, since it seems as safe as a "[typo fix](https://spark.apache.org/contributing.html)". In case any of the files is included from other projects, and should be updated there, please let me know. Closes #23129 from katrinleinweber/resolve-DOIs-securely. Authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-25 17:43:55 -06:00
oraviv	d81d95a7e8	[SPARK-19368][MLLIB] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices ## What changes were proposed in this pull request? Optimization [SPARK-12869] was made for dense matrices but caused great performance issue for sparse matrices because manipulating them is very inefficient. When manipulating sparse matrices in Breeze we better use VectorBuilder. ## How was this patch tested? checked it against a use case that we have that after moving to Spark 2 took 6.5 hours instead of 20 mins. After the change it is back to 20 mins again. Closes #16732 from uzadude/SparseVector_optimization. Authored-by: oraviv <oraviv@paypal.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-22 15:48:01 -06:00
Marco Gaido	dd8c179c28	[SPARK-25867][ML] Remove KMeans computeCost ## What changes were proposed in this pull request? The PR removes the deprecated method `computeCost` of `KMeans`. ## How was this patch tested? NA Closes #22875 from mgaido91/SPARK-25867. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-22 15:45:25 -06:00
Marco Gaido	4aa9ccbde7	[SPARK-26127][ML] Remove deprecated setters from tree regression and classification models ## What changes were proposed in this pull request? The setter methods are deprecated since 2.1 for the models of regression and classification using trees. The deprecation was stating that the method would have been removed in 3.0. Hence the PR removes the deprecated method. ## How was this patch tested? NA Closes #23093 from mgaido91/SPARK-26127. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-21 17:03:57 -06:00
Sean Owen	32365f8177	[SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous deprecation and build warnings for Spark 3 ## What changes were proposed in this pull request? The build has a lot of deprecation warnings. Some are new in Scala 2.12 and Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy miscellaneous ones here. They're too numerous and small to list here; see the pull request. Some highlights: - `BeanInfo` is deprecated in 2.12, and BeanInfo classes are pretty ancient in Java. Instead, case classes can explicitly declare getters - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0 - finalize() is finally deprecated (just needs to be suppressed) - StageInfo.attempId was deprecated and easiest to remove here I'm not now going to touch some chunks of deprecation warnings: - Parquet deprecations - Hive deprecations (particularly serde2 classes) - Deprecations in generated code (mostly Thriftserver CLI) - ProcessingTime deprecations (we may need to revive this class as internal) - many MLlib deprecations because they concern methods that may be removed anyway - a few Kinesis deprecations I couldn't figure out - Mesos get/setRole, which I don't know well - Kafka/ZK deprecations (e.g. poll()) - Kinesis - a few other ones that will probably resolve by deleting a deprecated method ## How was this patch tested? Existing tests, including manual testing with the 2.11 build and Java 11. Closes #23065 from srowen/SPARK-26090. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-19 09:16:42 -06:00
Sean Owen	630e25e355	[SPARK-26026][BUILD] Published Scaladoc jars missing from Maven Central ## What changes were proposed in this pull request? This restores scaladoc artifact generation, which got dropped with the Scala 2.12 update. The change looks large, but is almost all due to needing to make the InterfaceStability annotations top-level classes (i.e. `InterfaceStability.Stable` -> `Stable`), unfortunately. A few inner class references had to be qualified too. Lots of scaladoc warnings now reappear. We can choose to disable generation by default and enable for releases, later. ## How was this patch tested? N/A; build runs scaladoc now. Closes #23069 from srowen/SPARK-26026. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-19 08:06:33 -06:00
Marco Gaido	e00cac9898	[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading ## What changes were proposed in this pull request? Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. ## How was this patch tested? modified UT Closes #22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-17 09:46:45 -06:00
Shahid	e557c53c59	[SPARK-26006][MLLIB] unpersist 'dataInternalRepr' in the PrefixSpan ## What changes were proposed in this pull request? Mllib's Prefixspan - run method - cached RDD stays in cache. After run is comlpeted , rdd remain in cache. We need to unpersist the cached RDD after run method. ## How was this patch tested? Existing tests Closes #23016 from shahidki31/SPARK-26006. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-17 09:43:33 -06:00
zhengruifeng	91405b3b6e	[SPARK-22450][WIP][CORE][MLLIB][FOLLOWUP] Safely register MultivariateGaussian ## What changes were proposed in this pull request? register following classes in Kryo: "org.apache.spark.ml.stat.distribution.MultivariateGaussian", "org.apache.spark.mllib.stat.distribution.MultivariateGaussian" ## How was this patch tested? added tests Due to existing module dependency, I can not import spark-core in mllib-local's testsuits, so I do not add testsuite in `org.apache.spark.ml.stat.distribution.MultivariateGaussianSuite`. And I notice that class `ClusterStats` in `ClusteringEvaluator` is registered in a different way, should it be modified to keep in line with others in ML? srowen Closes #22974 from zhengruifeng/kryo_MultivariateGaussian. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-15 09:22:31 -06:00
DB Tsai	ad853c5678	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests Closes #22967 from dbtsai/scala2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-14 16:22:23 -08:00
Yuanjian Li	2977e2312d	[SPARK-25986][BUILD] Add rules to ban throw Errors in application code ## What changes were proposed in this pull request? Add scala and java lint check rules to ban the usage of `throw new xxxErrors` and fix up all exists instance followed by https://github.com/apache/spark/pull/22989#issuecomment-437939830. See more details in https://github.com/apache/spark/pull/22969. ## How was this patch tested? Local test with lint-scala and lint-java. Closes #22989 from xuanyuanking/SPARK-25986. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 13:05:18 -08:00
Sean Owen	722369ee55	[SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in JDK11 …. Other related changes to get JDK 11 working, to test ## What changes were proposed in this pull request? - Access `sun.misc.Cleaner` (Java 8) and `jdk.internal.ref.Cleaner` (JDK 9+) by reflection (note: the latter only works if illegal reflective access is allowed) - Access `sun.misc.Unsafe.invokeCleaner` in Java 9+ instead of `sun.misc.Cleaner` (Java 8) In order to test anything on JDK 11, I also fixed a few small things, which I include here: - Fix minor JDK 11 compile issues - Update scala plugin, Jetty for JDK 11, to facilitate tests too This doesn't mean JDK 11 tests all pass now, but lots do. Note also that the JDK 9+ solution for the Cleaner has a big caveat. ## How was this patch tested? Existing tests. Manually tested JDK 11 build and tests, and tests covering this change appear to pass. All Java 8 tests should still pass, but this change alone does not achieve full JDK 11 compatibility. Closes #22993 from srowen/SPARK-24421. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 12:52:54 -08:00
李亮	e503065fd8	[SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem ## What changes were proposed in this pull request? Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy. ## How was this patch tested? From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense) For calculation logic test There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case if meet precisionBound1, we define it as LOGIC1 if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2 if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3 (There is a trick, you can manually change the precision value to meet above situation) sparse- sparse case time cost situation (milliseconds) LOGIC1 Before add patch: 7786, 7970, 8086 After add patch: 7729, 7653, 7903 LOGIC2 Before add patch: 8412, 9029, 8606 After add patch: 8603, 8724, 9024 LOGIC3 Before add patch: 19365, 19146, 19351 After add patch: 18917, 19007, 19074 sparse-dense case time cost situation (milliseconds) LOGIC1 Before add patch: 4195, 4014, 4409 After add patch: 4081,3971, 4151 LOGIC2 Before add patch: 4968, 5579, 5080 After add patch: 4980, 5472, 5148 LOGIC3 Before add patch: 11848, 12077, 12168 After add patch: 11718, 11874, 11743 And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) Before add patch: 7340, 7816, 7672 After add patch: 5752, 5800, 5753 For real world data test There is my test data situation I use the data http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data total instances are 13230 the attributes for line are 6000 Result for sparse-sparse situation time cost (milliseconds) Before Enhance: 7670, 7704, 7652 After Enhance: 7634, 7729, 7645 Closes #22893 from KyleLi1985/updatekmeanpatch. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 07:24:13 -08:00
Sean Owen	510ec77a60	[SPARK-19714][DOCS] Clarify Bucketizer handling of invalid input ## What changes were proposed in this pull request? Clarify Bucketizer handleInvalid docs. Just a resubmit of https://github.com/apache/spark/pull/17169 ## How was this patch tested? N/A Closes #23003 from srowen/SPARK-19714. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-11 09:21:40 -06:00
Sean Owen	2d085c13b7	[SPARK-25984][CORE][SQL][STREAMING] Remove deprecated .newInstance(), primitive box class constructor calls ## What changes were proposed in this pull request? Deprecated in Java 11, replace Class.newInstance with Class.getConstructor.getInstance, and primtive wrapper class constructors with valueOf or equivalent ## How was this patch tested? Existing tests. Closes #22988 from srowen/SPARK-25984. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-10 09:52:14 -06:00
Ilya Matiach	8e5f3c6ba6	[SPARK-24101][ML][MLLIB] ML Evaluators should use weight column - added weight column for multiclass classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. Note: I've updated the JIRA to: https://issues.apache.org/jira/browse/SPARK-24101 Which is a child of JIRA: https://issues.apache.org/jira/browse/SPARK-18693 ## How was this patch tested? I added tests to the metrics class. Closes #17086 from imatiach-msft/ilmat/multiclass-evaluate. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-09 15:40:15 -06:00
Sean Owen	0025a8397f	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 ## What changes were proposed in this pull request? - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only deprecated) - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only deprecated) - Remove unused Python StorageLevel constants - Remove Dataset unionAll in favor of union - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally ## How was this patch tested? Existing tests Closes #22921 from srowen/SPARK-25908. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-07 22:48:50 -06:00
Imran Rashid	8fbc1830f9	[SPARK-25904][CORE] Allocate arrays smaller than Int.MaxValue JVMs can't allocate arrays of length exactly Int.MaxValue, so ensure we never try to allocate an array that big. This commit changes some defaults & configs to gracefully fallover to something that doesn't require one large array in some cases; in other cases it simply improves an error message for cases which will still fail. Closes #22818 from squito/SPARK-25827. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-11-07 13:18:52 +01:00
Marco Gaido	6b425874d3	[SPARK-25866][ML] Update KMeans formatVersion ## What changes were proposed in this pull request? When we added the `distanceMeasure`, we didn't update the `formatVersion` for `KMeans`. Despite this is not a big issue, as that information is used nowhere, we are returning a wrong information. ## How was this patch tested? NA Closes #22873 from mgaido91/SPARK-25866. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-06 23:18:55 +08:00
Sean Owen	c0d1bf0322	[MINOR] Fix typos and misspellings ## What changes were proposed in this pull request? Fix typos and misspellings, per https://github.com/apache/spark-website/pull/158#issuecomment-435790366 ## How was this patch tested? Existing tests. Closes #22950 from srowen/Typos. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-05 17:34:23 -06:00
Marco Gaido	fc10c898f4	[SPARK-25758][ML] Deprecate computeCost in BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22869 from mgaido91/SPARK-25758_3.0. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-05 22:13:20 +00:00
Shahid	ce40efa200	[SPARK-25790][MLLIB] PCA: Support more than 65535 column matrix ## What changes were proposed in this pull request? Spark PCA supports maximum only ~65,535 columns matrix. This is due to the fact that, it computes the Covariance matrix first, then compute principle components. The main bottle neck was computing covariance matrix. The limit 65,500 came due to the integer size limit. Because we are passing an array of size n*(n+1)/2 to the breeze library and the size cannot be more than INT_MAX. so, the maximum column size we can give is 65,500. Currently we don't have such limitation for computing SVD in spark. So, we can make use of Spark SVD to compute the PCA, if the number of columns exceeds the limit. Computation of PCA can be done directly using SVD of matrix, instead of finding the covariance matrix. Following are the papers/links for the reference. https://arxiv.org/pdf/1404.1100.pdf https://en.wikipedia.org/wiki/Principal_component_analysis#Singular_value_decomposition http://www.ifis.uni-luebeck.de/~moeller/Lectures/WS-16-17/Web-Mining-Agents/PCA-SVD.pdf ## How was this patch tested? added UT, also manually verified with the existing test for pca, by removing the limit condition in the fit method. Closes #22784 from shahidki31/PCA. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-30 08:39:30 -05:00
yucai	409d688fb6	[SPARK-25864][SQL][TEST] Make main args accessible for BenchmarkBase's subclass ## What changes were proposed in this pull request? Set main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark ## How was this patch tested? manual tests Closes #22872 from yucai/main_args. Authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-29 20:00:31 +08:00
Huaxin Gao	dc9b320807	[SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The following code in BisectingKMeansModel.load calls the wrong version of load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` Closes #22790 from huaxingao/spark-25793. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-26 11:07:55 +08:00
WeichenXu	6540c2f8f3	[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide ## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-25 23:03:16 +08:00
Huaxin Gao	fc64e83f95	[SPARK-24207][R] add R API for PrefixSpan ## What changes were proposed in this pull request? add R API for PrefixSpan ## How was this patch tested? add test in test_mllib_fpm.R Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21710 from huaxingao/spark-24207.	2018-10-21 12:32:43 -07:00
Wenchen Fan	2fbbcd0d27	Revert "[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans" This reverts commit `c2962546d9`.	2018-10-21 09:12:29 +08:00
Marco Gaido	c2962546d9	[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 10:32:25 -07:00
Shahid	a4b14a9cf8	[SPARK-25623][SPARK-25624][SPARK-25625][TEST] Reduce test time of LogisticRegressionSuite ...with intercept with L1 regularization ## What changes were proposed in this pull request? In the test, "multinomial logistic regression with intercept with L1 regularization" in the "LogisticRegressionSuite", taking more than a minute due to training of 2 logistic regression model. However after analysing the training cost over iteration, we can reduce the computation time by 50%. Training cost vs iteration for model 1 ![image](https://user-images.githubusercontent.com/23054875/46573805-ddab7680-c9b7-11e8-9ee9-63a99d498475.png) So, model1 is converging after iteration 150. Training cost vs iteration for model 2 ![image](https://user-images.githubusercontent.com/23054875/46573790-b3f24f80-c9b7-11e8-89c0-81045ad647cb.png) After around 100 iteration, model2 is converging. So, if we give maximum iteration for model1 and model2 as 175 and 125 respectively, we can reduce the computation time by half. ## How was this patch tested? Computation time in local setup : Before change: ~53 sec After change: ~26 sec Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22659 from shahidki31/SPARK-25623. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-08 19:07:05 -05:00
WeichenXu	ebd899b8a8	[SPARK-25321][ML] Revert SPARK-14681 to avoid API breaking change ## What changes were proposed in this pull request? This is the same as #22492 but for master branch. Revert SPARK-14681 to avoid API breaking changes. cc: WeichenXu123 ## How was this patch tested? Existing unit tests. Closes #22618 from mengxr/SPARK-25321.master. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-07 10:06:44 -07:00
Gengliang Wang	7b4e94f160	[SPARK-25581][SQL] Rename method `benchmark` as `runBenchmarkSuite` in `BenchmarkBase` ## What changes were proposed in this pull request? Rename method `benchmark` in `BenchmarkBase` as `runBenchmarkSuite `. Also add comments. Currently the method name `benchmark` is a bit confusing. Also the name is the same as instances of `Benchmark`: `f246813afb/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala (L330-L339)` ## How was this patch tested? Unit test. Closes #22599 from gengliangwang/renameBenchmarkSuite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-02 10:04:47 -07:00
gatorsmile	9bf397c0e4	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT ## What changes were proposed in this pull request? This patch is to bump the master branch version to 3.0.0-SNAPSHOT. ## How was this patch tested? N/A Closes #22606 from gatorsmile/bump3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-02 08:48:24 -07:00
hyukjinkwon	a2f502cf53	[SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls ## What changes were proposed in this pull request? This PR adds a rule to force `.toLowerCase(Locale.ROOT)` or `toUpperCase(Locale.ROOT)`. It produces an error as below: ``` [error] Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you [error] should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. [error] If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with [error] // scalastyle:off caselocale [error] .toUpperCase [error] .toLowerCase [error] // scalastyle:on caselocale ``` This PR excludes the cases above for SQL code path for external calls like table name, column name and etc. For test suites, or when it's clear there's no locale problem like Turkish locale problem, it uses `Locale.ROOT`. One minor problem is, `UTF8String` has both methods, `toLowerCase` and `toUpperCase`, and the new rule detects them as well. They are ignored. ## How was this patch tested? Manually tested, and Jenkins tests. Closes #22581 from HyukjinKwon/SPARK-25565. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-30 14:31:04 +08:00
seancxmao	9bf04d8543	[SPARK-25489][ML][TEST] Refactor UDTSerializationBenchmark ## What changes were proposed in this pull request? Refactor `UDTSerializationBenchmark` to use main method and print the output as a separate file. Run blow command to generate benchmark results: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "mllib/test:runMain org.apache.spark.mllib.linalg.UDTSerializationBenchmark" ``` ## How was this patch tested? Manual tests. Closes #22499 from seancxmao/SPARK-25489. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-23 13:34:06 -07:00
WeichenXu	40edab209b	[SPARK-25321][ML] Fix local LDA model constructor ## What changes were proposed in this pull request? change back the constructor to: ``` class LocalLDAModel private[ml] ( uid: String, vocabSize: Int, private[clustering] val oldLocalModel : OldLocalLDAModel, sparkSession: SparkSession) ``` Although it is marked `private[ml]`, it is used in `mleap` and the master change breaks `mleap` building. See mleap code [here](`c7860af328/mleap-spark/src/main/scala/org/apache/spark/ml/bundle/ops/clustering/LDAModelOp.scala (L57)`) ## How was this patch tested? Manual. Closes #22510 from WeichenXu123/LDA_fix. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-21 13:08:01 -07:00
Gengliang Wang	d25f425c96	[SPARK-25499][TEST] Refactor BenchmarkBase and Benchmark ## What changes were proposed in this pull request? Currently there are two classes with the same naming BenchmarkBase: 1. `org.apache.spark.util.BenchmarkBase` 2. `org.apache.spark.sql.execution.benchmark.BenchmarkBase` This is very confusing. And the benchmark object `org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark` is using the one in `org.apache.spark.util.BenchmarkBase`, while there is another class `BenchmarkBase` in the same package of it... Here I propose: 1. the package `org.apache.spark.util.BenchmarkBase` should be in test package of core module. Move it to package `org.apache.spark.benchmark` . 2. Move `org.apache.spark.util.Benchmark` to test package of core module. Move it to package `org.apache.spark.benchmark` . 3. Rename the class `org.apache.spark.sql.execution.benchmark.BenchmarkBase` as `BenchmarkWithCodegen` ## How was this patch tested? Unit test Closes #22513 from gengliangwang/refactorBenchmarkBase. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 22:20:55 +08:00
WeichenXu	6f681d4296	[SPARK-22666][ML][FOLLOW-UP] Improve testcase to tolerate different schema representation ## What changes were proposed in this pull request? Improve testcase "image datasource test: read non image" to tolerate different schema representation. Because file:/path and file:///path are both valid URI-ifications so in some environment the testcase will fail. ## How was this patch tested? Manual. Closes #22449 from WeichenXu123/image_url. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-19 15:16:20 -07:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
Marco Gaido	0736e72a66	[SPARK-25371][SQL] struct() should allow being called with 0 args ## What changes were proposed in this pull request? SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception. This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined. The PR removes the added check making `struct()` valid again. ## How was this patch tested? added UT Closes #22373 from mgaido91/SPARK-25371. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-11 14:16:56 +08:00
WeichenXu	88a930dfab	[MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeasure` method ## What changes were proposed in this pull request? Remove `BisectingKMeansModel.setDistanceMeasure` method. In `BisectingKMeansModel` set this param is meaningless. ## How was this patch tested? N/A Closes #22360 from WeichenXu123/bkmeans_update. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-09 09:49:13 -05:00
gatorsmile	0b9ccd55c2	Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317] ## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR https://github.com/apache/spark/pull/22338, the performance regression still exists. If we revert the changes in https://github.com/apache/spark/pull/19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-09 21:25:19 +08:00
WeichenXu	08c02e637a	[SPARK-25345][ML] Deprecate public APIs from ImageSchema ## What changes were proposed in this pull request? Deprecate public APIs from ImageSchema. ## How was this patch tested? N/A Closes #22349 from WeichenXu123/image_api_deprecate. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-08 09:09:14 -07:00
Dilip Biswal	6d7bc5af45	[SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive ## What changes were proposed in this pull request? In SharedSparkSession and TestHive, we need to disable the rule ConvertToLocalRelation for better test case coverage. ## How was this patch tested? Identify the failures after excluding "ConvertToLocalRelation" rule. Closes #22270 from dilipbiswal/SPARK-25267-final. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-06 23:35:02 -07:00
Yuming Wang	3e033035a3	[SPARK-25258][SPARK-23131][SPARK-25176][BUILD] Upgrade Kryo to 4.0.2 ## What changes were proposed in this pull request? Upgrade chill to 0.9.3, Kryo to 4.0.2, to get bug fixes and improvements. The resolved tickets includes: - SPARK-25258 Upgrade kryo package to version 4.0.2 - SPARK-23131 Kryo raises StackOverflow during serializing GLR model - SPARK-25176 Kryo fails to serialize a parametrised type hierarchy More details: https://github.com/twitter/chill/releases/tag/v0.9.3 `cc3910d501` ## How was this patch tested? Existing tests. Closes #22179 from wangyum/SPARK-23131. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-05 15:48:41 -07:00
WeichenXu	925449283d	[SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/`) This datasource NOT support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. cluster: 4 nodes, each with 64GB memory, 8 cores CPU test dataset: Flickr8k_Dataset (about 8091 images) time cost: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes #22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-05 11:59:00 -07:00
Marco Gaido	a3dccd24c2	[SPARK-10697][ML] Add lift to Association rules ## What changes were proposed in this pull request? The PR adds the lift measure to Association rules. ## How was this patch tested? existing and modified UTs Closes #22236 from mgaido91/SPARK-10697. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-01 18:07:58 -05:00
Marco Gaido	6ad8d4c375	[SPARK-25289][ML] Avoid exception in ChiSqSelector with FDR when no feature is selected ## What changes were proposed in this pull request? Currently, when FDR is used for `ChiSqSelector` and no feature is selected an exception is thrown because the max operation fails. The PR fixes the problem by handling this case and returning an empty array in that case, as sklearn (which was the reference for the initial implementation of FDR) does. ## How was this patch tested? added UT Closes #22303 from mgaido91/SPARK-25289. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-01 08:41:07 -05:00
Marco Gaido	55f36641ff	[SPARK-25093][SQL] Avoid recompiling regexp for comments multiple times ## What changes were proposed in this pull request? The PR moves the compilation of the regexp for code formatting outside the method which is called for each code block when splitting expressions, in order to avoid recompiling the regexp every time. Credit should be given to Izek Greenfield. ## How was this patch tested? existing UTs Closes #22135 from mgaido91/SPARK-25093. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-22 14:31:51 +08:00
Liang-Chi Hsieh	8b0e94d896	[SPARK-23042][ML] Use OneHotEncoderModel to encode labels in MultilayerPerceptronClassifier ## What changes were proposed in this pull request? In MultilayerPerceptronClassifier, we use RDD operation to encode labels for now. I think we should use ML's OneHotEncoderEstimator/Model to do the encoding. ## How was this patch tested? Existing tests. Closes #20232 from viirya/SPARK-23042. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-08-17 18:40:29 +00:00
zhengruifeng	e50192494d	[SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/NB ## What changes were proposed in this pull request? logNumExamples in KMeans/BiKM/GMM/AFT/NB ## How was this patch tested? existing tests Closes #21561 from zhengruifeng/alg_logNumExamples. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-16 15:23:32 -07:00
Liang-Chi Hsieh	3eb52092b3	[SPARK-22974][ML] Attach attributes to output column of CountVectorModel ## What changes were proposed in this pull request? The output column from `CountVectorModel` lacks attribute. So a later transformer like `Interaction` can raise error because no attribute available. ## How was this patch tested? Added test. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #20313 from viirya/SPARK-22974. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-08-14 05:05:16 +00:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
Kazuaki Ishizaki	1dd0f17446	[SPARK-25036][SQL][FOLLOW-UP] Avoid match may not be exhaustive in Scala-2.12. ## What changes were proposed in this pull request? This is a follow-up pr of #22014 and #22039 We still have some more compilation errors in mllib with scala-2.12 with sbt: ``` [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala:116: match may not be exhaustive. [error] It would fail on the following inputs: ("silhouette", _), (_, "cosine"), (_, "squaredEuclidean"), (_, String()), (_, _) [error] [warn] ($(metricName), $(distanceMeasure)) match { [error] [warn] ``` ## How was this patch tested? Existing UTs Closes #22058 from kiszk/SPARK-25036c. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-10 07:34:09 -05:00
Kazuaki Ishizaki	132bcceebb	[SPARK-25036][SQL] Avoid discarding unmoored doc comment in Scala-2.12. ## What changes were proposed in this pull request? This PR avoid the following compilation error using sbt in Scala-2.12. ``` [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410: discarding unmoored doc comment [error] [warn] / [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441: discarding unmoored doc comment [error] [warn] / [error] [warn] ... [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440: discarding unmoored doc comment [error] [warn] /** [error] [warn] ``` ## How was this patch tested? Existing UTs Closes #22059 from kiszk/SPARK-25036d. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-10 07:32:52 -05:00
Sean Owen	1a7e747ce4	[SPARK-25047][ML] Can't assign SerializedLambda to scala.Function1 in deserialization of BucketedRandomProjectionLSHModel ## What changes were proposed in this pull request? Convert two function fields in ML classes to simple functions to avoi…d odd SerializedLambda deserialization problem ## How was this patch tested? Existing tests. Closes #22032 from srowen/SPARK-25047. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-09 08:07:46 -05:00
Kazuaki Ishizaki	56e9e97073	[MINOR][DOC] Fix typo ## What changes were proposed in this pull request? This PR fixes typo regarding `auxiliary verb + verb[s]`. This is a follow-on of #21956. ## How was this patch tested? N/A Closes #22040 from kiszk/spellcheck1. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-09 20:10:17 +08:00
Sean Owen	66699c5c30	[SPARK-25029][TESTS] Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors ## What changes were proposed in this pull request? Fixes for test issues that arose after Scala 2.12 support was added -- ones that only affect the 2.12 build. ## How was this patch tested? Existing tests. Closes #22004 from srowen/SPARK-25029. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-07 17:30:37 -05:00
hyukjinkwon	55e3ae6930	[SPARK-25001][BUILD] Fix miscellaneous build warnings ## What changes were proposed in this pull request? There are many warnings in the current build (for instance see https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console). common: ``` [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237: warning: [rawtypes] found raw type: LevelDBIterator [warn] void closeIterator(LevelDBIterator it) throws IOException { [warn] ^ [warn] missing type arguments for generic class LevelDBIterator<T> [warn] where T is a type-variable: [warn] T extends Object declared in class LevelDBIterator [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151: warning: [deprecation] group() in AbstractBootstrap has been deprecated [warn] if (bootstrap != null && bootstrap.group() != null) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152: warning: [deprecation] group() in AbstractBootstrap has been deprecated [warn] bootstrap.group().shutdownGracefully(); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154: warning: [deprecation] childGroup() in ServerBootstrap has been deprecated [warn] if (bootstrap != null && bootstrap.childGroup() != null) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155: warning: [deprecation] childGroup() in ServerBootstrap has been deprecated [warn] bootstrap.childGroup().shutdownGracefully(); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112: warning: [deprecation] PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in PooledByteBufAllocator has been deprecated [warn] return new PooledByteBufAllocator( [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321: warning: [rawtypes] found raw type: Future [warn] public void operationComplete(Future future) throws Exception { [warn] ^ [warn] missing type arguments for generic class Future<V> [warn] where V is a type-variable: [warn] V extends Object declared in interface Future [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215: warning: [rawtypes] found raw type: StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, resp.streamId, resp.byteCount, [warn] ^ [warn] missing type arguments for generic class StreamInterceptor<T> [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215: warning: [rawtypes] found raw type: StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, resp.streamId, resp.byteCount, [warn] ^ [warn] missing type arguments for generic class StreamInterceptor<T> [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215: warning: [unchecked] unchecked call to StreamInterceptor(MessageHandler<T>,String,long,StreamCallback) as a member of the raw type StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, resp.streamId, resp.byteCount, [warn] ^ [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:255: warning: [rawtypes] found raw type: StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, wrappedCallback.getID(), [warn] ^ [warn] missing type arguments for generic class StreamInterceptor<T> [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:255: warning: [rawtypes] found raw type: StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, wrappedCallback.getID(), [warn] ^ [warn] missing type arguments for generic class StreamInterceptor<T> [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:255: warning: [unchecked] unchecked call to StreamInterceptor(MessageHandler<T>,String,long,StreamCallback) as a member of the raw type StreamInterceptor [warn] StreamInterceptor interceptor = new StreamInterceptor(this, wrappedCallback.getID(), [warn] ^ [warn] where T is a type-variable: [warn] T extends Message declared in class StreamInterceptor [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:270: warning: [deprecation] transfered() in FileRegion has been deprecated [warn] region.transferTo(byteRawChannel, region.transfered()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:304: warning: [deprecation] transfered() in FileRegion has been deprecated [warn] region.transferTo(byteChannel, region.transfered()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/test/java/org/apache/spark/network/ProtocolSuite.java:119: warning: [deprecation] transfered() in FileRegion has been deprecated [warn] while (in.transfered() < in.count()) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/test/java/org/apache/spark/network/ProtocolSuite.java:120: warning: [deprecation] transfered() in FileRegion has been deprecated [warn] in.transferTo(channel, in.transfered()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java:80: warning: [static] static method should be qualified by type name, Murmur3_x86_32, instead of by an expression [warn] Assert.assertEquals(-300363099, hasher.hashUnsafeWords(bytes, offset, 16, 42)); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java:84: warning: [static] static method should be qualified by type name, Murmur3_x86_32, instead of by an expression [warn] Assert.assertEquals(-1210324667, hasher.hashUnsafeWords(bytes, offset, 16, 42)); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java:88: warning: [static] static method should be qualified by type name, Murmur3_x86_32, instead of by an expression [warn] Assert.assertEquals(-634919701, hasher.hashUnsafeWords(bytes, offset, 16, 42)); [warn] ^ ``` launcher: ``` [warn] Pruning sources from previous analysis, due to incompatible CompileSetup. [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/launcher/src/main/java/org/apache/spark/launcher/AbstractLauncher.java:31: warning: [rawtypes] found raw type: AbstractLauncher [warn] public abstract class AbstractLauncher<T extends AbstractLauncher> { [warn] ^ [warn] missing type arguments for generic class AbstractLauncher<T> [warn] where T is a type-variable: [warn] T extends AbstractLauncher declared in class AbstractLauncher ``` core: ``` [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:99: method group in class AbstractBootstrap is deprecated: see corresponding Javadoc for more information. [warn] if (bootstrap != null && bootstrap.group() != null) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/main/scala/org/apache/spark/api/r/RBackend.scala💯 method group in class AbstractBootstrap is deprecated: see corresponding Javadoc for more information. [warn] bootstrap.group().shutdownGracefully() [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: method childGroup in class ServerBootstrap is deprecated: see corresponding Javadoc for more information. [warn] if (bootstrap != null && bootstrap.childGroup() != null) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:103: method childGroup in class ServerBootstrap is deprecated: see corresponding Javadoc for more information. [warn] bootstrap.childGroup().shutdownGracefully() [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:151: reflective access of structural type member method getData should be enabled [warn] by making the implicit value scala.language.reflectiveCalls visible. [warn] This can be achieved by adding the import clause 'import scala.language.reflectiveCalls' [warn] or by setting the compiler option -language:reflectiveCalls. [warn] See the Scaladoc for value scala.language.reflectiveCalls for a discussion [warn] why the feature should be explicitly enabled. [warn] val rdd = sc.parallelize(1 to 1).map(concreteObject.getData) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175: reflective access of structural type member value innerObject2 should be enabled [warn] by making the implicit value scala.language.reflectiveCalls visible. [warn] val rdd = sc.parallelize(1 to 1).map(concreteObject.innerObject2.getData) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175: reflective access of structural type member method getData should be enabled [warn] by making the implicit value scala.language.reflectiveCalls visible. [warn] val rdd = sc.parallelize(1 to 1).map(concreteObject.innerObject2.getData) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/LocalSparkContext.scala:32: constructor Slf4JLoggerFactory in class Slf4JLoggerFactory is deprecated: see corresponding Javadoc for more information. [warn] InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory()) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:218: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] assert(wrapper.stageAttemptId === stages.head.attemptId) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:261: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] stageAttemptId = stages.head.attemptId)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:287: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] stageAttemptId = stages.head.attemptId)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:471: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] stageAttemptId = stages.last.attemptId)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:966: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] listener.onTaskStart(SparkListenerTaskStart(dropped.stageId, dropped.attemptId, task)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:972: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] listener.onTaskEnd(SparkListenerTaskEnd(dropped.stageId, dropped.attemptId, [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:976: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] .taskSummary(dropped.stageId, dropped.attemptId, Array(0.25d, 0.50d, 0.75d)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:1146: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] SparkListenerTaskEnd(stage1.stageId, stage1.attemptId, "taskType", Success, tasks(1), null)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala:1150: value attemptId in class StageInfo is deprecated: Use attemptNumber instead [warn] SparkListenerTaskEnd(stage1.stageId, stage1.attemptId, "taskType", Success, tasks(0), null)) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala:197: method transfered in trait FileRegion is deprecated: see corresponding Javadoc for more information. [warn] while (region.transfered() < region.count()) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala:198: method transfered in trait FileRegion is deprecated: see corresponding Javadoc for more information. [warn] region.transferTo(byteChannel, region.transfered()) [warn] ^ ``` sql: ``` [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:534: abstract type T is unchecked since it is eliminated by erasure [warn] assert(partitioning.isInstanceOf[T]) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:534: abstract type T is unchecked since it is eliminated by erasure [warn] assert(partitioning.isInstanceOf[T]) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala:323: inferred existential type Option[Class[_$1]]( forSome { type _$1 }), which cannot be expressed by wildcards, should be enabled [warn] by making the implicit value scala.language.existentials visible. [warn] This can be achieved by adding the import clause 'import scala.language.existentials' [warn] or by setting the compiler option -language:existentials. [warn] See the Scaladoc for value scala.language.existentials for a discussion [warn] why the feature should be explicitly enabled. [warn] val optClass = Option(collectionCls) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:226: warning: [deprecation] ParquetFileReader(Configuration,FileMetaData,Path,List<BlockMetaData>,List<ColumnDescriptor>) in ParquetFileReader has been deprecated [warn] this.reader = new ParquetFileReader( [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:178: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] (descriptor.getType() == PrimitiveType.PrimitiveTypeName.INT32 \|\| [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:179: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] (descriptor.getType() == PrimitiveType.PrimitiveTypeName.INT64 && [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:181: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] descriptor.getType() == PrimitiveType.PrimitiveTypeName.FLOAT \|\| [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:182: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] descriptor.getType() == PrimitiveType.PrimitiveTypeName.DOUBLE \|\| [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:183: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] descriptor.getType() == PrimitiveType.PrimitiveTypeName.BINARY))) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:198: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] switch (descriptor.getType()) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:221: warning: [deprecation] getTypeLength() in ColumnDescriptor has been deprecated [warn] readFixedLenByteArrayBatch(rowId, num, column, descriptor.getTypeLength()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:224: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] throw new IOException("Unsupported type: " + descriptor.getType()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:246: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] descriptor.getType().toString(), [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:258: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] switch (descriptor.getType()) { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:384: warning: [deprecation] getType() in ColumnDescriptor has been deprecated [warn] throw new UnsupportedOperationException("Unsupported type: " + descriptor.getType()); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java:458: warning: [static] static variable should be qualified by type name, BaseRepeatedValueVector, instead of by an expression [warn] int index = rowId * accessor.OFFSET_WIDTH; [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java:460: warning: [static] static variable should be qualified by type name, BaseRepeatedValueVector, instead of by an expression [warn] int end = offsets.getInt(index + accessor.OFFSET_WIDTH); [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/BenchmarkQueryTest.scala:57: a pure expression does nothing in statement position; you may be omitting necessary parentheses [warn] case s => s [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala:182: inferred existential type org.apache.parquet.column.statistics.Statistics[?0]( forSome { type ?0 <: Comparable[?0] }), which cannot be expressed by wildcards, should be enabled [warn] by making the implicit value scala.language.existentials visible. [warn] This can be achieved by adding the import clause 'import scala.language.existentials' [warn] or by setting the compiler option -language:existentials. [warn] See the Scaladoc for value scala.language.existentials for a discussion [warn] why the feature should be explicitly enabled. [warn] val columnStats = oneBlockColumnMeta.getStatistics [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala:146: implicit conversion method conv should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scaladoc for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def conv(x: (Int, Long)): KV = KV(x._1, x._2) [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala:48: implicit conversion method unsafeRow should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] private implicit def unsafeRow(value: Int) = { [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala:178: method getType in class ColumnDescriptor is deprecated: see corresponding Javadoc for more information. [warn] assert(oneFooter.getFileMetaData.getSchema.getColumns.get(0).getType() === [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTest.scala:154: method readAllFootersInParallel in object ParquetFileReader is deprecated: see corresponding Javadoc for more information. [warn] ParquetFileReader.readAllFootersInParallel(configuration, fs.getFileStatus(path)).asScala.toSeq [warn] ^ [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/hive/src/test/java/org/apache/spark/sql/hive/test/Complex.java:679: warning: [cast] redundant cast to Complex [warn] Complex typedOther = (Complex)other; [warn] ^ ``` mllib: ``` [warn] Pruning sources from previous analysis, due to incompatible CompileSetup. [warn] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala:597: match may not be exhaustive. [warn] It would fail on the following inputs: None, Some((x: Tuple2[?, ?] forSome x not in (?, ?))) [warn] val df = dfs.find { [warn] ^ ``` This PR does not target fix all of them since some look pretty tricky to fix and there look too many warnings including false positive (like deprecated API but it's used in its test, etc.) ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@apache.org> Closes #21975 from HyukjinKwon/remove-build-warnings.	2018-08-04 11:52:49 -05:00
Stavros Kontopoulos	a65736996b	[SPARK-14540][CORE] Fix remaining major issues for Scala 2.12 Support ## What changes were proposed in this pull request? This PR addresses issues 2,3 in this [document](https://docs.google.com/document/d/1fbkjEL878witxVQpOCbjlvOvadHtVjYXeB-2mgzDTvk). * We modified the closure cleaner to identify closures that are implemented via the LambdaMetaFactory mechanism (serializedLambdas) (issue2). * We also fix the issue due to scala/bug#11016. There are two options for solving the Unit issue, either add () at the end of the closure or use the trick described in the doc. Otherwise overloading resolution does not work (we are not going to eliminate either of the methods) here. Compiler tries to adapt to Unit and makes these two methods candidates for overloading, when there is polymorphic overloading there is no ambiguity (that is the workaround implemented). This does not look that good but it serves its purpose as we need to support two different uses for method: `addTaskCompletionListener`. One that passes a TaskCompletionListener and one that passes a closure that is wrapped with a TaskCompletionListener later on (issue3). Note: regarding issue 1 in the doc the plan is: > Do Nothing. Don’t try to fix this as this is only a problem for Java users who would want to use 2.11 binaries. In that case they can cast to MapFunction to be able to utilize lambdas. In Spark 3.0.0 the API should be simplified so that this issue is removed. ## How was this patch tested? This was manually tested: ```./dev/change-scala-version.sh 2.12 ./build/mvn -DskipTests -Pscala-2.12 clean package ./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.serializer.ProactiveClosureSerializationSuite -Dtest=None ./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.util.ClosureCleanerSuite -Dtest=None ./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.streaming.DStreamClosureSuite -Dtest=None``` Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Closes #21930 from skonto/scala2.12-sup.	2018-08-02 09:17:09 -05:00
zhengruifeng	57d994994d	[SPARK-24557][ML] ClusteringEvaluator support array input ## What changes were proposed in this pull request? ClusteringEvaluator support array input ## How was this patch tested? added tests Author: zhengruifeng <ruifengz@foxmail.com> Closes #21563 from zhengruifeng/clu_eval_support_array.	2018-08-01 23:46:01 -07:00
Gengliang Wang	fa09d91925	[SPARK-24919][BUILD] New linter rule for sparkContext.hadoopConfiguration ## What changes were proposed in this pull request? In most cases, we should use `spark.sessionState.newHadoopConf()` instead of `sparkContext.hadoopConfiguration`, so that the hadoop configurations specified in Spark session configuration will come into effect. Add a rule matching `spark.sparkContext.hadoopConfiguration` or `spark.sqlContext.sparkContext.hadoopConfiguration` to prevent the usage. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21873 from gengliangwang/linterRule.	2018-07-26 16:50:59 -07:00
Bago Amirbekian	3cb1b57809	[SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented. ## What changes were proposed in this pull request? Followup for #21719. Update spark.ml training code to fully wrap instrumented methods and remove old instrumentation APIs. ## How was this patch tested? existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21799 from MrBago/new-instrumentation-apis2.	2018-07-20 12:13:15 -07:00
Marco Gaido	cc4d64bb16	[SPARK-23451][ML] Deprecate KMeans.computeCost ## What changes were proposed in this pull request? Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`. ## How was this patch tested? manual test (deprecation warning displayed) Scala ``` ... scala> model.computeCost(dataset) warning: there was one deprecation warning; re-run with -deprecation for details res1: Double = 0.0 ``` Python ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) ... >>> model.computeCost(df) /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. " instead.", DeprecationWarning) ``` Author: Marco Gaido <marcogaido91@gmail.com> Closes #20629 from mgaido91/SPARK-23451.	2018-07-20 09:18:57 -07:00
Marco Gaido	a5925c1631	[SPARK-24268][SQL] Use datatype.catalogString in error messages ## What changes were proposed in this pull request? As stated in https://github.com/apache/spark/pull/21321, in the error messages we should use `catalogString`. This is not the case, as SPARK-22893 used `simpleString` in order to have the same representation everywhere and it missed some places. The PR unifies the messages using alway the `catalogString` representation of the dataTypes in the messages. ## How was this patch tested? existing/modified UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21804 from mgaido91/SPARK-24268_catalog.	2018-07-19 23:29:29 -07:00
Bago Amirbekian	912634b004	[SPARK-24747][ML] Make Instrumentation class more flexible ## What changes were proposed in this pull request? This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When these APIs are merged, I'll followup with a PR to update the training code to use these new APIs so we can remove the old APIs. These changes are all to private APIs so this PR doesn't make any user facing changes. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21719 from MrBago/new-instrumentation-apis.	2018-07-17 13:11:52 -07:00
Shahid	cf97045349	[SPARK-18230][MLLIB] Throw a better exception, if the user or product doesn't exist When invoking MatrixFactorizationModel.recommendProducts(Int, Int) with a non-existing user, a java.util.NoSuchElementException is thrown: > java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169) ## What changes were proposed in this pull request? Throw a better exception, like "user-id/product-id doesn't found in the model", for a non-existent user/product ## How was this patch tested? Added UT Author: Shahid <shahidki31@gmail.com> Closes #21740 from shahidki31/checkInvalidUserProduct.	2018-07-16 09:50:43 -05:00
郑瑞峰	bcf7121ed2	[TRIVIAL][ML] GMM unpersist RDD after training ## What changes were proposed in this pull request? unpersist `instances` after training ## How was this patch tested? existing tests Author: 郑瑞峰 <zhengruifeng@ZBMAC-C02VX5XWH.local> Closes #21562 from zhengruifeng/gmm_unpersist.	2018-07-15 20:14:17 -07:00
Sean Owen	8aceb961c3	[SPARK-24754][ML] Minhash integer overflow ## What changes were proposed in this pull request? Use longs in calculating min hash to avoid bias due to int overflow. ## How was this patch tested? Existing tests. Author: Sean Owen <srowen@gmail.com> Closes #21750 from srowen/SPARK-24754.	2018-07-14 15:59:17 -05:00
Marco Gaido	3b6005b8a2	[SPARK-23528][ML] Add numIter to ClusteringSummary ## What changes were proposed in this pull request? Added the number of iterations in `ClusteringSummary`. This is an helpful information in evaluating how to eventually modify the parameters in order to get a better model. ## How was this patch tested? modified existing UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #20701 from mgaido91/SPARK-23528.	2018-07-13 11:23:42 -07:00
Kazuaki Ishizaki	5ad4735bda	[SPARK-24529][BUILD][TEST-MAVEN] Add spotbugs into maven build process ## What changes were proposed in this pull request? This PR enables a Java bytecode check tool [spotbugs](https://spotbugs.github.io/) to avoid possible integer overflow at multiplication. When an violation is detected, the build process is stopped. Due to the tool limitation, some other checks will be enabled. In this PR, [these patterns](http://spotbugs-in-kengo-toda.readthedocs.io/en/lqc-list-detectors/detectors.html#findpuzzlers) in `FindPuzzlers` can be detected. This check is enabled at `compile` phase. Thus, `mvn compile` or `mvn package` launches this check. ## How was this patch tested? Existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21542 from kiszk/SPARK-24529.	2018-07-12 09:52:23 +08:00
Xiao Li	aec966b05e	Revert "[SPARK-24268][SQL] Use datatype.simpleString in error messages" This reverts commit `1bd3d61f41`.	2018-07-09 14:24:23 -07:00
Marco Gaido	1bd3d61f41	[SPARK-24268][SQL] Use datatype.simpleString in error messages ## What changes were proposed in this pull request? SPARK-22893 tried to unify error messages about dataTypes. Unfortunately, still many places were missing the `simpleString` method in other to have the same representation everywhere. The PR unified the messages using alway the simpleString representation of the dataTypes in the messages. ## How was this patch tested? existing/modified UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21321 from mgaido91/SPARK-24268.	2018-07-09 22:59:05 +08:00
Shahid	ca8243f30f	[MINOR][ML] Minor correction in the powerIterationSuite ## What changes were proposed in this pull request? Currently the power iteration clustering test in spark ml, maps the results to the labels 0 and 1 for assertion. Since the clustering outputs need not be the same as the mapped labels, it may cause failure in the test case. Even if it correctly maps, theoretically we cannot guarantee which set belongs to which cluster label. KMeans can assign label 0 to either of the set. PowerIterationClusteringSuite in the MLLib checks the clustering results without mapping to the particular cluster label, as shown below. `` val predictions = Array.fill(2)(mutable.Set.empty[Long]) model.assignments.collect().foreach { a => predictions(a.cluster) += a.id } assert(predictions.toSet == Set((0 until n1).toSet, (n1 until n).toSet)) `` ## How was this patch tested? Existing tests Author: Shahid <shahidki31@gmail.com> Closes #21689 from shahidki31/picTestSuiteMinorCorrection.	2018-07-04 09:56:24 -05:00
bravo-zhang	524827f062	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model ## What changes were proposed in this pull request? [SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712) spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark. ## How was this patch tested? LogisticRegressionSuite.scala Python doctest in pyspark.ml.classification.py Author: bravo-zhang <mzhang1230@gmail.com> Closes #18826 from bravo-zhang/spark-14712.	2018-06-28 12:40:39 -07:00
Fangshi Li	cc88d7fad1	[SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName that is not safe in scala ## What changes were proposed in this pull request? When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and discussed in many scala upstream jiras such as SI-8110, SI-5425. To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. ## How was this patch tested? added unit test Author: Fangshi Li <fli@linkedin.com> Closes #21276 from fangshil/SPARK-24216.	2018-06-12 12:10:08 -07:00
Lee Dongjin	5d6a53d983	[SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala\|Python] unit tests. Author: Lee Dongjin <dongjin@apache.org> Closes #21501 from dongjinleekr/feature/SPARK-15064.	2018-06-12 08:16:37 -07:00
WeichenXu	e8c1a0c2fd	[SPARK-15784] Add Power Iteration Clustering to spark.ml ## What changes were proposed in this pull request? According to the discussion on JIRA. I rewrite the Power Iteration Clustering API in `spark.ml`. ## How was this patch tested? Unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXu <weichen.xu@databricks.com> Closes #21493 from WeichenXu123/pic_api.	2018-06-04 21:24:35 -07:00
Lu WANG	ff0501b0c2	[SPARK-24300][ML] change the way to set seed in ml.cluster.LDASuite.generateLDAData ## What changes were proposed in this pull request? Using different RNG in all different partitions. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21492 from ludatabricks/SPARK-24300.	2018-06-04 16:08:27 -07:00
Lu WANG	b24d3dba65	[SPARK-24290][ML] add support for Array input for instrumentation.logNamedValue ## What changes were proposed in this pull request? Extend instrumentation.logNamedValue to support Array input change the logging for "clusterSizes" to new method ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21347 from ludatabricks/SPARK-24290.	2018-06-04 14:54:31 -07:00
WeichenXu	90ae98d1ac	[SPARK-24146][PYSPARK][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## What changes were proposed in this pull request? spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## How was this patch tested? doctests Author: WeichenXu <weichen.xu@databricks.com> Closes #21265 from WeichenXu123/prefix_span_py.	2018-05-31 06:53:10 -07:00
WeichenXu	df125062c8	[SPARK-20114][ML][FOLLOW-UP] spark.ml parity for sequential pattern mining - PrefixSpan ## What changes were proposed in this pull request? Change `PrefixSpan` into a class with param setter/getters. This address issues mentioned here: https://github.com/apache/spark/pull/20973#discussion_r186931806 ## How was this patch tested? UT. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXu <weichen.xu@databricks.com> Closes #21393 from WeichenXu123/fix_prefix_span.	2018-05-23 11:00:23 -07:00
WeichenXu	ffaefe755e	[SPARK-7132][ML] Add fit with validation set to spark.ml GBT ## What changes were proposed in this pull request? Add fit with validation set to spark.ml GBT ## How was this patch tested? Will add later. Author: WeichenXu <weichen.xu@databricks.com> Closes #21129 from WeichenXu123/gbt_fit_validation.	2018-05-21 13:05:17 -07:00
Sandor Murakozi	d4a0895c62	[SPARK-22884][ML] ML tests for StructuredStreaming: spark.ml.clustering ## What changes were proposed in this pull request? Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. This PR is a new version of https://github.com/apache/spark/pull/20319 Author: Sandor Murakozi <smurakozi@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #21358 from jkbradley/smurakozi-SPARK-22884.	2018-05-17 16:33:06 -07:00
Bago Amirbekian	439c695118	[SPARK-24114] Add instrumentation to FPGrowth. ## What changes were proposed in this pull request? Have FPGrowth keep track of model training using the Instrumentation class. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21344 from MrBago/fpgrowth-instr.	2018-05-17 13:42:10 -07:00
Bago Amirbekian	a7a9b18378	[SPARK-24115] Have logging pass through instrumentation class. ## What changes were proposed in this pull request? Fixes to tuning instrumentation. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21340 from MrBago/tunning-instrumentation.	2018-05-17 11:13:16 -07:00
Lu WANG	bfd75cdfb2	[SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in submitMiniBatch - Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution. ## How was this patch tested? Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21183 from ludatabricks/SPARK-22210.	2018-05-16 17:54:06 -07:00
Lu WANG	075d678c88	[SPARK-24155][ML] Instrumentation improvements for clustering ## What changes were proposed in this pull request? changed the instrument for all of the clustering methods ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21218 from ludatabricks/SPARK-23686-1.	2018-05-14 13:35:54 -07:00
Fan Donglai	2fa33649d9	Update StreamingKMeans.scala ## What changes were proposed in this pull request? I think the ‘n_t+t’ in the following code may be wrong, it shoud be ‘n_t+1’ that means is the number of points to the cluster after it finish the no.t+1 min-batch. * <blockquote> * $$ * \begin{align} * c_t+1 &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ * n_t+t &= n_t * a + m_t * \end{align} * $$ * </blockquote> Author: Fan Donglai <ddna_1022@163.com> Closes #21179 from ddna1021/master.	2018-05-13 18:10:00 -05:00
WeichenXu	7aaa148f59	[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ## How was this patch tested? UT. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXu <weichen.xu@databricks.com> Closes #21097 from WeichenXu123/GBTeval.	2018-05-09 11:09:19 -07:00
Lu WANG	7e7350285d	[SPARK-24132][ML] Instrumentation improvement for classification ## What changes were proposed in this pull request? - Add OptionalInstrumentation as argument for getNumClasses in ml.classification.Classifier - Change the function call for getNumClasses in train() in ml.classification.DecisionTreeClassifier, ml.classification.RandomForestClassifier, and ml.classification.NaiveBayes - Modify the instrumentation creation in ml.classification.LinearSVC - Change the log call in ml.classification.OneVsRest and ml.classification.LinearSVC ## How was this patch tested? Manual. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21204 from ludatabricks/SPARK-23686.	2018-05-08 21:20:58 -07:00
Lu WANG	0d63eb8888	[SPARK-23975][ML] Add support of array input for all clustering methods ## What changes were proposed in this pull request? Add support for all of the clustering methods ## How was this patch tested? unit tests added Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21195 from ludatabricks/SPARK-23975-1.	2018-05-07 20:08:41 -07:00
WeichenXu	76ecd09502	[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan ## What changes were proposed in this pull request? PrefixSpan API for spark.ml. New implementation instead of #20810 ## How was this patch tested? TestSuite added. Author: WeichenXu <weichen.xu@databricks.com> Closes #20973 from WeichenXu123/prefixSpan2.	2018-05-07 14:57:14 -07:00
WeichenXu	f48bd6bdc5	[SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning ## What changes were proposed in this pull request? ML test for StructuredStreaming: spark.ml.tuning ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #20261 from WeichenXu123/ml_stream_tuning_test.	2018-05-07 14:55:41 -07:00
Jeff Zhang	56a52e0a58	[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark ## What changes were proposed in this pull request? Change FPGrowth from private to private[spark]. If no numPartitions is specified, then default value -1 is used. But -1 is only valid in the construction function of FPGrowth, but not in setNumPartitions. So I make this change and use the constructor directly rather than using set method. ## How was this patch tested? Unit test is added Author: Jeff Zhang <zjffdu@apache.org> Closes #13493 from zjffdu/SPARK-15750.	2018-05-07 14:47:58 -07:00
WeichenXu	379bffa052	[SPARK-23990][ML] Instruments logging improvements - ML regression package ## What changes were proposed in this pull request? Instruments logging improvements - ML regression package I add an `OptionalInstrument` class which used in `WeightLeastSquares` and `IterativelyReweightedLeastSquares`. ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #21078 from WeichenXu123/inst_reg.	2018-04-24 11:02:22 -07:00
Liang-Chi Hsieh	83013752e3	[SPARK-23455][ML] Default Params in ML should be saved separately in metadata ## What changes were proposed in this pull request? We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params. It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading. The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata. For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field. ## How was this patch tested? Pass existing tests and added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20633 from viirya/save-ml-default-params.	2018-04-24 10:40:25 -07:00
Lu WANG	2a24c481da	[SPARK-23975][ML] Allow Clustering to take Arrays of Double as input features ## What changes were proposed in this pull request? - Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type - Add if statement in transform() to support array type as featuresCol - Add the case statement in fit() while selecting columns from dataset These changes will be applied to KMeans first, then to other clustering method ## How was this patch tested? unit test is added Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21081 from ludatabricks/SPARK-23975.	2018-04-24 09:25:41 -07:00
Holden Karau	e82cb68349	[SPARK-11237][ML] Add pmml export for k-means in Spark ML ## What changes were proposed in this pull request? Adding PMML export to Spark ML's KMeans Model. ## How was this patch tested? New unit test for Spark ML PMML export based on the old Spark MLlib unit test. Author: Holden Karau <holden@pigscanfly.ca> Closes #20907 from holdenk/SPARK-11237-Add-PMML-Export-for-KMeans.	2018-04-23 13:23:02 -07:00
Teng Peng	293a0f29e3	[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 ## What changes were proposed in this pull request? It is reported by Spark users that the deviance calculation for poisson regression does not handle y = 0. Thus, the correct model summary cannot be obtained. The user has confirmed the the issue is in ``` override def deviance(y: Double, mu: Double, weight: Double): Double = { 2.0 * weight * (y * math.log(y / mu) - (y - mu)) } when y = 0. ``` The user also mentioned there are many other places he believe we should check the same thing. However, no other changes are needed, including Gamma distribution. ## How was this patch tested? Add a comparison with R deviance calculation to the existing unit test. Author: Teng Peng <josephtengpeng@gmail.com> Closes #21125 from tengpeng/Spark24024GLM.	2018-04-23 10:29:47 -07:00
wm624@hotmail.com	a471880afb	[SPARK-24026][ML] Add Power Iteration Clustering to spark.ml ## What changes were proposed in this pull request? This PR adds PowerIterationClustering as a Transformer to spark.ml. In the transform method, it calls spark.mllib's PowerIterationClustering.run() method and transforms the return value assignments (the Kmeans output of the pseudo-eigenvector) as a DataFrame (id: LongType, cluster: IntegerType). This PR is copied and modified from https://github.com/apache/spark/pull/15770 The primary author is wangmiao1981 ## How was this patch tested? This PR has 2 types of tests: * Copies of tests from spark.mllib's PIC tests * New tests specific to the spark.ml APIs Author: wm624@hotmail.com <wm624@hotmail.com> Author: wangmiao1981 <wm624@hotmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #21090 from jkbradley/wangmiao1981-pic.	2018-04-19 09:40:20 -07:00
WeichenXu	04614820e1	[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting. ## How was this patch tested? UT added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19627 from WeichenXu123/expose-model-list-py.	2018-04-16 11:31:24 -05:00
Lu WANG	5003736ad6	[SPARK-9312][ML] Add RawPrediction, numClasses, and numFeatures for OneVsRestModel add RawPrediction as output column add numClasses and numFeatures to OneVsRestModel ## What changes were proposed in this pull request? - Add two val numClasses and numFeatures in OneVsRestModel so that we can inherit from Classifier in the future - Add rawPrediction output column in transform, the prediction label in calculated by the rawPrediciton like raw2prediction ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21044 from ludatabricks/SPARK-9312.	2018-04-16 11:27:30 -05:00
Eric Liang	1018be44d6	[SPARK-23971] Should not leak Spark sessions across test suites ## What changes were proposed in this pull request? Many suites currently leak Spark sessions (sometimes with stopped SparkContexts) via the thread-local active Spark session and default Spark session. We should attempt to clean these up and detect when this happens to improve the reproducibility of tests. ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes #21058 from ericl/clear-session.	2018-04-12 22:30:59 -07:00
WeichenXu	0f93b91a71	[SPARK-23751][FOLLOW-UP] fix build for scala-2.12 ## What changes were proposed in this pull request? fix build for scala-2.12 ## How was this patch tested? Manual. Author: WeichenXu <weichen.xu@databricks.com> Closes #21051 from WeichenXu123/fix_build212.	2018-04-12 15:47:42 -06:00
Joseph K. Bradley	75a183071c	[SPARK-22883] ML test for StructuredStreaming: spark.ml.feature, I-M ## What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: * IDF * Imputer * Interaction * MaxAbsScaler * MinHashLSH * MinMaxScaler * NGram ## How was this patch tested? It is a bunch of tests! Author: Joseph K. Bradley <joseph@databricks.com> Closes #20964 from jkbradley/SPARK-22883-part2.	2018-04-11 09:59:38 -07:00
Lu WANG	7c7570d466	[SPARK-23944][ML] Add the set method for the two LSHModel ## What changes were proposed in this pull request? Add two set method for LSHModel in LSH.scala, BucketedRandomProjectionLSH.scala, and MinHashLSH.scala ## How was this patch tested? New test for the param setup was added into - BucketedRandomProjectionLSHSuite.scala - MinHashLSHSuite.scala Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21015 from ludatabricks/SPARK-23944.	2018-04-10 17:26:06 -07:00
Huaxin Gao	4f1e8b9bb7	[SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid ## What changes were proposed in this pull request? add python api for VectorAssembler handleInvalid ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21003 from huaxingao/spark-23871.	2018-04-10 15:41:45 -07:00
WeichenXu	adb222b957	[SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` Note API with `CDF` is a little difficult to support in python. We can add it in following PR. ## How was this patch tested? doctest Author: WeichenXu <weichen.xu@databricks.com> Closes #20904 from WeichenXu123/ks-test-py.	2018-04-10 11:18:14 -07:00
Zheng RuiFeng	95034af696	[SPARK-23841][ML] NodeIdCache should unpersist the last cached nodeIdsForInstances ## What changes were proposed in this pull request? unpersist the last cached nodeIdsForInstances in `deleteAllCheckpoints` ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #20956 from zhengruifeng/NodeIdCache_cleanup.	2018-04-10 08:51:35 -05:00
WeichenXu	252468a744	[SPARK-14681][ML] Provide label/impurity stats for spark.ml decision tree nodes ## What changes were proposed in this pull request? API: ``` trait ClassificationNode extends Node def getLabelCount(label: Int): Double trait RegressionNode extends Node def getCount(): Double def getSum(): Double def getSquareSum(): Double // turn LeafNode to be trait trait LeafNode extends Node { def prediction: Double def impurity: Double ... } class ClassificationLeafNode extends ClassificationNode with LeafNode class RegressionLeafNode extends RegressionNode with LeafNode // turn InternalNode to be trait trait InternalNode extends Node{ def gain: Double def leftChild: Node def rightChild: Node def split: Split ... } class ClassificationInternalNode extends ClassificationNode with InternalNode override def leftChild: ClassificationNode override def rightChild: ClassificationNode class RegressionInternalNode extends RegressionNode with InternalNode override val leftChild: RegressionNode override val rightChild: RegressionNode class DecisionTreeClassificationModel override val rootNode: ClassificationNode class DecisionTreeRegressionModel override val rootNode: RegressionNode ``` Closes #17466 ## How was this patch tested? UT will be added soon. Author: WeichenXu <weichen.xu@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #20786 from WeichenXu123/tree_stat_api_2.	2018-04-09 12:18:07 -07:00
Takuya UESHIN	6ab134ca7d	[SPARK-21898][ML][FOLLOWUP] Fix Scala 2.12 build. ## What changes were proposed in this pull request? This is a follow-up pr of #19108 which broke Scala 2.12 build. ``` [error] /Users/ueshin/workspace/apache-spark/spark/mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala:86: overloaded method value test with alternatives: [error] (dataset: org.apache.spark.sql.DataFrame,sampleCol: String,cdf: org.apache.spark.api.java.function.Function[java.lang.Double,java.lang.Double])org.apache.spark.sql.DataFrame <and> [error] (dataset: org.apache.spark.sql.DataFrame,sampleCol: String,cdf: scala.Double => scala.Double)org.apache.spark.sql.DataFrame [error] cannot be applied to (org.apache.spark.sql.DataFrame, String, scala.Double => java.lang.Double) [error] test(dataset, sampleCol, (x: Double) => cdf.call(x)) [error] ^ [error] one error found ``` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20994 from ueshin/issues/SPARK-21898/fix_scala-2.12.	2018-04-06 15:00:13 -07:00
Bago Amirbekian	d23a805f97	[SPARK-23859][ML] Initial PR for Instrumentation improvements: UUID and logging levels ## What changes were proposed in this pull request? Initial PR for Instrumentation improvements: UUID and logging levels. This PR takes over #20837 Closes #20837 ## How was this patch tested? Manual. Author: Bago Amirbekian <bago@databricks.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #20982 from WeichenXu123/better-instrumentation.	2018-04-06 10:09:55 -07:00
Yogesh Garg	f2ac087956	[SPARK-23870][ML] Forward RFormula handleInvalid Param to VectorAssembler to handle invalid values in non-string columns ## What changes were proposed in this pull request? `handleInvalid` Param was forwarded to the VectorAssembler used by RFormula. ## How was this patch tested? added a test and ran all tests for RFormula and VectorAssembler Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com> Closes #20970 from yogeshg/spark_23562.	2018-04-05 19:55:42 -07:00
Kazuaki Ishizaki	4807d381bb	[SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks to choose several types of memory block ## What changes were proposed in this pull request? This PR allows us to use one of several types of `MemoryBlock`, such as byte array, int array, long array, or `java.nio.DirectByteBuffer`. To use `java.nio.DirectByteBuffer` allows to have off heap memory which is automatically deallocated by JVM. `MemoryBlock` class has primitive accessors like `Platform.getInt()`, `Platform.putint()`, or `Platform.copyMemory()`. This PR uses `MemoryBlock` for `OffHeapColumnVector`, `UTF8String`, and other places. This PR can improve performance of operations involving memory accesses (e.g. `UTF8String.trim`) by 1.8x. For now, this PR does not use `MemoryBlock` for `BufferHolder` based on cloud-fan's [suggestion](https://github.com/apache/spark/pull/11494#issuecomment-309694290). Since this PR is a successor of #11494, close #11494. Many codes were ported from #11494. Many efforts were put here. I think this PR should credit to yzotov. This PR can achieve 1.1-1.4x performance improvements for operations in `UTF8String` or `Murmur3_x86_32`. Other operations are almost comparable performances. Without this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 526 / 536 0.0 131399881.5 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 525 / 552 1022.6 1.0 1.0X substring 414 / 423 1298.0 0.8 1.3X ``` With this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 474 / 488 0.0 118552232.0 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 476 / 480 1127.3 0.9 1.0X substring 287 / 291 1869.9 0.5 1.7X ``` Benchmark program ``` test("benchmark Murmur3_x86_32") { val length = 8192 * 32768 + 31 val seed = 42L val iters = 1 << 2 val random = new Random(seed) val arrays = Array.fill[MemoryBlock](numArrays) { val bytes = new Array[Byte](length) random.nextBytes(bytes) new ByteArrayMemoryBlock(bytes, Platform.BYTE_ARRAY_OFFSET, length) } val benchmark = new Benchmark("Hash byte arrays with length " + length, iters * numArrays, minNumIters = 20) benchmark.addCase("HiveHasher") { _: Int => var sum = 0L for (_ <- 0L until iters) { sum += HiveHasher.hashUnsafeBytesBlock( arrays(i), Platform.BYTE_ARRAY_OFFSET, length) } } benchmark.run() } test("benchmark UTF8String") { val N = 512 * 1024 * 1024 val iters = 2 val benchmark = new Benchmark("UTF8String benchmark", N, minNumIters = 20) val str0 = new java.io.StringWriter() { { for (i <- 0 until N) { write(" ") } } }.toString val s0 = UTF8String.fromString(str0) benchmark.addCase("hashCode") { _: Int => var h: Int = 0 for (_ <- 0L until iters) { h += s0.hashCode } } benchmark.addCase("substring") { _: Int => var s: UTF8String = null for (_ <- 0L until iters) { s = s0.substring(N / 2 - 5, N / 2 + 5) } } benchmark.run() } ``` I run [this benchmark program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using [the commit](`ee5a79861c`). I got the following results: ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Memory access benchmarks: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ ByteArrayMemoryBlock get/putInt() 220 / 221 609.3 1.6 1.0X Platform get/putInt(byte[]) 220 / 236 610.9 1.6 1.0X Platform get/putInt(Object) 492 / 494 272.8 3.7 0.4X OnHeapMemoryBlock get/putLong() 322 / 323 416.5 2.4 0.7X long[] 221 / 221 608.0 1.6 1.0X Platform get/putLong(long[]) 321 / 321 418.7 2.4 0.7X Platform get/putLong(Object) 561 / 563 239.2 4.2 0.4X ``` I also run [this benchmark program](https://gist.github.com/kiszk/5fdb4e03733a5d110421177e289d1fb5) for comparing performance of `Platform.copyMemory()`. ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Platform copyMemory: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Object to Object 1961 / 1967 8.6 116.9 1.0X System.arraycopy Object to Object 1917 / 1921 8.8 114.3 1.0X byte array to byte array 1961 / 1968 8.6 116.9 1.0X System.arraycopy byte array to byte array 1909 / 1937 8.8 113.8 1.0X int array to int array 1921 / 1990 8.7 114.5 1.0X double array to double array 1918 / 1923 8.7 114.3 1.0X Object to byte array 1961 / 1967 8.6 116.9 1.0X Object to short array 1965 / 1972 8.5 117.1 1.0X Object to int array 1910 / 1915 8.8 113.9 1.0X Object to float array 1971 / 1978 8.5 117.5 1.0X Object to double array 1919 / 1944 8.7 114.4 1.0X byte array to Object 1959 / 1967 8.6 116.8 1.0X int array to Object 1961 / 1970 8.6 116.9 1.0X double array to Object 1917 / 1924 8.8 114.3 1.0X ``` These results show three facts: 1. According to the second/third or sixth/seventh results in the first experiment, if we use `Platform.get/putInt(Object)`, we achieve more than 2x worse performance than `Platform.get/putInt(byte[])` with concrete type (i.e. `byte[]`). 2. According to the second/third or fourth/fifth/sixth results in the first experiment, the fastest way to access an array element on Java heap is `array[]`. Cons of `array[]` is that it is not possible to support unaligned-8byte access. 3. According to the first/second/third or fourth/sixth/seventh results in the first experiment, `getInt()/putInt() or getLong()/putLong()` in subclasses of `MemoryBlock` can achieve comparable performance to `Platform.get/putInt()` or `Platform.get/putLong()` with concrete type (second or sixth result). There is no overhead regarding virtual call. 4. According to results in the second experiment, for `Platform.copy()`, to pass `Object` can achieve the same performance as to pass any type of primitive array as source or destination. 5. According to second/fourth results in the second experiment, `Platform.copy()` can achieve the same performance as `System.arrayCopy`. It would be good to use `Platform.copy()` since `Platform.copy()` can take any types for src and dst. We are incrementally replace `Platform.get/putXXX` with `MemoryBlock.get/putXXX`. This is because we have two advantages. 1) Achieve better performance due to having a concrete type for an array. 2) Use simple OO design instead of passing `Object` It is easy to use `MemoryBlock` in `InternalRow`, `BufferHolder`, `TaskMemoryManager`, and others that are already abstracted. It is not easy to use `MemoryBlock` in utility classes related to hashing or others. Other candidates are - UnsafeRow, UnsafeArrayData, UnsafeMapData, SpecificUnsafeRowJoiner - UTF8StringBuffer - BufferHolder - TaskMemoryManager - OnHeapColumnVector - BytesToBytesMap - CachedBatch - classes for hash - others. ## How was this patch tested? Added `UnsafeMemoryAllocator` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19222 from kiszk/SPARK-10399.	2018-04-06 10:13:59 +08:00
Yogesh Garg	a1351828d3	[SPARK-23690][ML] Add handleinvalid to VectorAssembler ## What changes were proposed in this pull request? Introduce `handleInvalid` parameter in `VectorAssembler` that can take in `"keep", "skip", "error"` options. "error" throws an error on seeing a row containing a `null`, "skip" filters out all such rows, and "keep" adds relevant number of NaN. "keep" figures out an example to find out what this number of NaN s should be added and throws an error when no such number could be found. ## How was this patch tested? Unit tests are added to check the behavior of `assemble` on specific rows and the transformer is called on `DataFrame`s of different configurations to test different corner cases. Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com> Author: Bago Amirbekian <bago@databricks.com> Author: Yogesh Garg <1059168+yogeshg@users.noreply.github.com> Closes #20829 from yogeshg/rformula_handleinvalid.	2018-04-02 16:41:26 -07:00
Huaxin Gao	a33655348c	[SPARK-23615][ML][PYSPARK] Add maxDF Parameter to Python CountVectorizer ## What changes were proposed in this pull request? The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also. ## How was this patch tested? add test Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20777 from huaxingao/spark-23615.	2018-03-23 15:58:48 -07:00
Holden Karau	95c03cbd27	[SPARK-23783][SPARK-11239][ML] Add PMML export to Spark ML pipelines ## What changes were proposed in this pull request? Adds PMML export support to Spark ML pipelines in the style of Spark's DataSource API to allow library authors to add their own model export formats. Includes a specific implementation for Spark ML linear regression PMML export. In addition to adding PMML to reach parity with our current MLlib implementation, this approach will allow other libraries & formats (like PFA) to implement and export models with a unified API. ## How was this patch tested? Basic unit test. Author: Holden Karau <holdenkarau@google.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #19876 from holdenk/SPARK-11171-SPARK-11237-Add-PMML-export-for-ML-KMeans-r2.	2018-03-23 11:56:17 -07:00
Joseph K. Bradley	a091ee676b	[MINOR] Fix Java lint from new JavaKolmogorovSmirnovTestSuite ## What changes were proposed in this pull request? Fix lint-java from https://github.com/apache/spark/pull/19108 addition of JavaKolmogorovSmirnovTestSuite Author: Joseph K. Bradley <joseph@databricks.com> Closes #20875 from jkbradley/kstest-lint-fix.	2018-03-21 13:52:03 -07:00
WeichenXu	bf09f2f712	[SPARK-10884][ML] Support prediction on single instance for regression and classification related models ## What changes were proposed in this pull request? Support prediction on single instance for regression and classification related models (i.e., PredictionModel, ClassificationModel and their sub classes). Add corresponding test cases. ## How was this patch tested? Test cases added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19381 from WeichenXu123/single_prediction.	2018-03-21 09:39:14 -07:00
Marco Gaido	500b21c3d6	[SPARK-23568][ML] Use metadata numAttributes if available in Silhouette ## What changes were proposed in this pull request? Silhouette need to know the number of features. This was taken using `first` and checking the size of the vector. Despite this works fine, if the number of attributes is present in metadata, we can avoid to trigger a job for this and use the metadata value. This can help improving performances of course. ## How was this patch tested? existing UTs + added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20719 from mgaido91/SPARK-23568.	2018-03-21 10:19:02 -05:00
WeichenXu	7f5e8aa260	[SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib ## What changes were proposed in this pull request? Feature parity for KolmogorovSmirnovTest in MLlib. Implement `DataFrame` interface for `KolmogorovSmirnovTest` in `mllib.stat`. ## How was this patch tested? Test suite added. Author: WeichenXu <weichen.xu@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #19108 from WeichenXu123/ml-ks-test.	2018-03-20 11:14:34 -07:00
“attilapiros”	279b3db897	[SPARK-22915][MLLIB] Streaming tests for spark.ml.feature, from N to Z # What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: - NGramSuite - NormalizerSuite - OneHotEncoderEstimatorSuite - OneHotEncoderSuite - PCASuite - PolynomialExpansionSuite - QuantileDiscretizerSuite - RFormulaSuite - SQLTransformerSuite - StandardScalerSuite - StopWordsRemoverSuite - StringIndexerSuite - TokenizerSuite - RegexTokenizerSuite - VectorAssemblerSuite - VectorIndexerSuite - VectorSizeHintSuite - VectorSlicerSuite - Word2VecSuite # How was this patch tested? They are unit test. Author: “attilapiros” <piros.attila.zsolt@gmail.com> Closes #20686 from attilapiros/SPARK-22915.	2018-03-14 18:36:31 -07:00
Marco Gaido	567bd31e0a	[SPARK-23412][ML] Add cosine distance to BisectingKMeans ## What changes were proposed in this pull request? The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it. ## How was this patch tested? added UTs + existing UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #20600 from mgaido91/SPARK-23412.	2018-03-12 14:53:15 -05:00
lucio	3be4adf648	[SPARK-22751][ML] Improve ML RandomForest shuffle performance ## What changes were proposed in this pull request? As I mentioned in [SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest), there is a shuffle performance problem in ML Randomforest when train a RF in high dimensional data. The reason is that, in _org.apache.spark.tree.impl.RandomForest_, the function _findSplitsBySorting_ will actually flatmap a sparse vector into a dense vector, then in groupByKey there will be a huge shuffle write size. To avoid this, we can add a filter in flatmap, to filter out zero value. And in function _findSplitsForContinuousFeature_, we can infer the number of zero value by _metadata_. In addition, if a feature only contains zero value, _continuousSplits_ will not has the key of feature id. So I add a check when using _continuousSplits_. ## How was this patch tested? Ran model locally using spark-submit. Author: lucio <576632108@qq.com> Closes #20472 from lucio-yz/master.	2018-03-08 08:03:24 -06:00
WeichenXu	98a5c0a35f	[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #20121 from WeichenXu123/ml_stream_test_classification.	2018-03-05 10:50:00 -08:00
Alessandro Solimando	9e26473c0f	[SPARK-3159][ML] Add decision tree pruning ## What changes were proposed in this pull request? Added subtree pruning in the translation from LearningNode to Node: a learning node having a single prediction value for all the leaves in the subtree rooted at it is translated into a LeafNode, instead of a (redundant) InternalNode ## How was this patch tested? Added two unit tests under "mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala": - test("SPARK-3159 tree model redundancy - classification") - test("SPARK-3159 tree model redundancy - regression") 4 existing unit tests relying on the tree structure (existence of a specific redundant subtree) had to be adapted as the tested components in the output tree are now pruned (fixed by adding an extra _prune_ parameter which can be used to disable pruning for testing) Author: Alessandro Solimando <18898964+asolimando@users.noreply.github.com> Closes #20632 from asolimando/master.	2018-03-02 16:24:29 -08:00
Joseph K. Bradley	119f6a0e47	[SPARK-22883][ML][TEST] Streaming tests for spark.ml.feature, from A to H ## What changes were proposed in this pull request? Adds structured streaming tests using testTransformer for these suites: * BinarizerSuite * BucketedRandomProjectionLSHSuite * BucketizerSuite * ChiSqSelectorSuite * CountVectorizerSuite * DCTSuite.scala * ElementwiseProductSuite * FeatureHasherSuite * HashingTFSuite ## How was this patch tested? It tests itself because it is a bunch of tests! Author: Joseph K. Bradley <joseph@databricks.com> Closes #20111 from jkbradley/SPARK-22883-streaming-featureAM.	2018-03-01 21:04:01 -08:00
Gabor Somogyi	3ca9a2c565	[SPARK-22886][ML][TESTS] ML test for structured streaming: ml.recomme… ## What changes were proposed in this pull request? Converting spark.ml.recommendation tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. ## How was this patch tested? Automated: Pass the Jenkins. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20362 from gaborgsomogyi/SPARK-22886.	2018-02-25 09:29:59 -06:00
Shintaro Murakami	d5ed2108d3	[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations ## What changes were proposed in this pull request? Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4. ## How was this patch tested? Added a unit test. Note: When we merge this PR, please give all the credits to Shintaro Murakami. Author: Shintaro Murakami <mrkm4ntrgmail.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Shintaro Murakami <mrkm4ntr@gmail.com> Closes #20630 from gatorsmile/pr-20568.	2018-02-16 17:17:55 -08:00
Liang-Chi Hsieh	db45daab90	[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug ## What changes were proposed in this pull request? #### Problem: Since 2.3, `Bucketizer` supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if `inputCols` and `outputCol` are both set, an error will be thrown. However, when we write `Bucketizer`, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default `outputCol` param in `HasOutputCol` trait will be set in `paramMap` and become an user-supplied param. That makes the check of exclusive params failed. #### Fix: This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later. Please see the discussion in the JIRA. Note: The multi-column `QuantileDiscretizer` also has the same issue. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20594 from viirya/SPARK-23377-2.	2018-02-15 09:54:39 -08:00
Marco Gaido	4e0fb010cc	[SPARK-23217][ML] Add cosine distance measure to ClusteringEvaluator ## What changes were proposed in this pull request? The PR provided an implementation of ClusteringEvaluator using the cosine distance measure. This allows to evaluate clustering results created using the cosine distance, introduced in SPARK-22119. In the corresponding JIRA, there is a design document for the algorithm implemented here. ## How was this patch tested? Added UT which compares the result to the one provided by python sklearn. Author: Marco Gaido <marcogaido91@gmail.com> Closes #20396 from mgaido91/SPARK-23217.	2018-02-13 11:51:19 -06:00
xubo245	263531466f	[SPARK-23392][TEST] Add some test cases for images feature ## What changes were proposed in this pull request? Add some test cases for images feature ## How was this patch tested? Add some test cases in ImageSchemaSuite Author: xubo245 <601450868@qq.com> Closes #20583 from xubo245/CARBONDATA23392_AddTestForImage.	2018-02-13 11:45:20 -06:00
Arseniy Tashoyan	9dae715168	[SPARK-23318][ML] FP-growth: WARN FPGrowth: Input data is not cached ## What changes were proposed in this pull request? Cache the RDD of items in ml.FPGrowth before passing it to mllib.FPGrowth. Cache only when the user did not cache the input dataset of transactions. This fixes the warning about uncached data emerging from mllib.FPGrowth. ## How was this patch tested? Manually: 1. Run ml.FPGrowthExample - warning is there 2. Apply the fix 3. Run ml.FPGrowthExample again - no warning anymore Author: Arseniy Tashoyan <tashoyan@gmail.com> Closes #20578 from tashoyan/SPARK-23318.	2018-02-13 06:20:34 -06:00
Marco Gaido	c0c902aedc	[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance ## What changes were proposed in this pull request? In #19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters. The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance. ## How was this patch tested? existing/improved UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #20518 from mgaido91/SPARK-22119_followup.	2018-02-11 20:15:30 -06:00
Yanbo Liang	e15da5b14c	[SPARK-23107][ML] ML 2.3 QA: New Scala APIs, docs. ## What changes were proposed in this pull request? Audit new APIs and docs in 2.3.0. ## How was this patch tested? No test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #20459 from yanboliang/SPARK-23107.	2018-02-01 11:25:01 +02:00
Yacine Mazari	c40fda9e4c	[SPARK-23166][ML] Add maxDF Parameter to CountVectorizer ## What changes were proposed in this pull request? Currently, the CountVectorizer has a minDF parameter. It might be useful to also have a maxDF parameter. It will be used as a threshold for filtering all the terms that occur very frequently in a text corpus, because they are not very informative or could even be stop-words. This is analogous to scikit-learn, CountVectorizer, max_df. Other changes: - Refactored code to invoke "filter()" conditioned on maxDF or minDF set. - Refactored code to unpersist input after counting is done. ## How was this patch tested? Unit tests. Author: Yacine Mazari <y.mazari@gmail.com> Closes #20367 from ymazari/SPARK-23166.	2018-01-28 10:27:59 -06:00
Xingbo Jiang	94c67a76ec	[SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to incorrect answers ## What changes were proposed in this pull request? Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined. The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below: upstream stage -> repartition stage -> result stage (-> indicate a shuffle) When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data. The following code returns 931532, instead of 1000000: ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() ``` In this PR, we propose a most straight-forward way to fix this problem by performing a local sort before partitioning, after we make the input row ordering deterministic, the function from rows to partitions is fully deterministic too. The downside of the approach is that with extra local sort inserted, the performance of repartition() will go down, so we add a new config named `spark.sql.execution.sortBeforeRepartition` to control whether this patch is applied. The patch is default enabled to be safe-by-default, but user may choose to manually turn it off to avoid performance regression. This patch also changes the output rows ordering of repartition(), that leads to a bunch of test cases failure because they are comparing the results directly. ## How was this patch tested? Add unit test in ExchangeSuite. With this patch(and `spark.sql.execution.sortBeforeRepartition` set to true), the following query returns 1000000: ``` import scala.sys.process._ import org.apache.spark.TaskContext spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true") val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() res7: Long = 1000000 ``` Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20393 from jiangxb1987/shuffle-repartition.	2018-01-26 15:01:03 -08:00
Marco Gaido	cd3956df0f	[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #19993 from mgaido91/SPARK-22799.	2018-01-26 12:23:14 +02:00
Sid Murching	7bd46d9871	[SPARK-23205][ML] Update ImageSchema.readImages to correctly set alpha values for four-channel images ## What changes were proposed in this pull request? When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color](https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)) constructor that sets alpha = 255, even for four-channel images (which may have different alpha values). This PR fixes this issue & adds a unit test to verify correctness of reading four-channel images. ## How was this patch tested? Updates an existing unit test ("readImages pixel values test" in `ImageSchemaSuite`) to also verify correctness when reading a four-channel image. Author: Sid Murching <sid.murching@databricks.com> Closes #20389 from smurching/image-schema-bugfix.	2018-01-25 18:15:29 -06:00
Matthew Tovbin	840dea64ab	[SPARK-23152][ML] - Correctly guard against empty datasets ## What changes were proposed in this pull request? Correctly guard against empty datasets in `org.apache.spark.ml.classification.Classifier` ## How was this patch tested? existing tests Author: Matthew Tovbin <mtovbin@salesforce.com> Closes #20321 from tovbinm/SPARK-23152.	2018-01-24 13:13:44 -05:00
Marco Gaido	4f43d27c9e	[SPARK-22119][ML] Add cosine distance to KMeans ## What changes were proposed in this pull request? Currently, KMeans assumes the only possible distance measure to be used is the Euclidean. This PR aims to add the cosine distance support to the KMeans algorithm. ## How was this patch tested? existing and added UTs. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19340 from mgaido91/SPARK-22119.	2018-01-21 08:51:12 -06:00
Zheng RuiFeng	606a7485f1	[SPARK-23085][ML] API parity for mllib.linalg.Vectors.sparse ## What changes were proposed in this pull request? `ML.Vectors#sparse(size: Int, elements: Seq[(Int, Double)])` support zero-length ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #20275 from zhengruifeng/SparseVector_size.	2018-01-19 09:28:35 -06:00
Bryan Cutler	7823d43ec0	[MINOR] Fix typos in ML scaladocs ## What changes were proposed in this pull request? Fixed some typos found in ML scaladocs ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20300 from BryanCutler/ml-doc-typos-MINOR.	2018-01-17 17:16:57 -06:00
Bago Amirbekian	4371466b3f	[SPARK-23045][ML][SPARKR] Update RFormula to use OneHotEncoderEstimator. ## What changes were proposed in this pull request? RFormula should use VectorSizeHint & OneHotEncoderEstimator in its pipeline to avoid using the deprecated OneHotEncoder & to ensure the model produced can be used in streaming. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #20229 from MrBago/rFormula.	2018-01-16 12:56:57 -08:00
gatorsmile	651f76153f	[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT ## What changes were proposed in this pull request? This patch bumps the master branch version to `2.4.0-SNAPSHOT`. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20222 from gatorsmile/bump24.	2018-01-13 00:37:59 +08:00
Bago Amirbekian	186bf8fb2e	[SPARK-23046][ML][SPARKR] Have RFormula include VectorSizeHint in pipeline ## What changes were proposed in this pull request? Including VectorSizeHint in RFormula piplelines will allow them to be applied to streaming dataframes. ## How was this patch tested? Unit tests. Author: Bago Amirbekian <bago@databricks.com> Closes #20238 from MrBago/rFormulaVectorSize.	2018-01-11 13:57:15 -08:00
sethah	70bcc9d5ae	[SPARK-22993][ML] Clarify HasCheckpointInterval param doc ## What changes were proposed in this pull request? Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context. ## How was this patch tested? No tests necessary, just a doc update. Author: sethah <shendrickson@cloudera.com> Closes #20188 from sethah/als_checkpoint_doc.	2018-01-09 23:32:47 -08:00
Joseph K. Bradley	930b90a848	[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator ## What changes were proposed in this pull request? Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion in the original PR: https://github.com/apache/spark/pull/19527 or read below for what this PR includes: * configedCategorySize: I reverted this to return an Array. I realized the original setup (which I had recommended in the original PR) caused the whole model to be serialized in the UDF. * encoder: I reorganized the logic to show what I meant in the comment in the previous PR. I think it's simpler but am open to suggestions. I also made some small style cleanups based on IntelliJ warnings. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #20132 from jkbradley/viirya-SPARK-13030.	2018-01-05 11:51:25 -08:00
Bago Amirbekian	cf0aa65576	[SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed memory tradeoff for TrainValidationSplit ## What changes were proposed in this pull request? Avoid holding all models in memory for `TrainValidationSplit`. ## How was this patch tested? Existing tests. Author: Bago Amirbekian <bago@databricks.com> Closes #20143 from MrBago/trainValidMemoryFix.	2018-01-04 22:45:15 -08:00
Sean Owen	c284c4e1f6	[MINOR] Fix a bunch of typos	2018-01-02 07:10:19 +09:00
Liang-Chi Hsieh	994065d891	[SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator ## What changes were proposed in this pull request? This patch adds a new class `OneHotEncoderEstimator` which extends `Estimator`. The `fit` method returns `OneHotEncoderModel`. Common methods between existing `OneHotEncoder` and new `OneHotEncoderEstimator`, such as transforming schema, are extracted and put into `OneHotEncoderCommon` to reduce code duplication. ### Multi-column support `OneHotEncoderEstimator` adds simpler multi-column support because it is new API and can be free from backward compatibility. ### handleInvalid Param support `OneHotEncoderEstimator` supports `handleInvalid` Param. It supports `error` and `keep`. ## How was this patch tested? Added new test suite `OneHotEncoderEstimatorSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19527 from viirya/SPARK-13030.	2017-12-31 15:28:59 -08:00
Nick Pentreath	028ee40165	[SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat numeric columns as categorical Previously, `FeatureHasher` always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes in data sources. In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome. Add a new param `categoricalCols` which specifies the numeric columns that should be treated as categorical features. ## How was this patch tested? New unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #19991 from MLnick/hasher-num-cat.	2017-12-31 14:51:38 +02:00
Huaxin Gao	3d8837e59a	[SPARK-22397][ML] add multiple columns support to QuantileDiscretizer ## What changes were proposed in this pull request? add multi columns support to QuantileDiscretizer. When calculating the splits, we can either merge together all the probabilities into one array by calculating approxQuantiles on multiple columns at once, or compute approxQuantiles separately for each column. After doing the performance comparision, we found it’s better to calculating approxQuantiles on multiple columns at once. Here is how we measuring the performance time: ``` var duration = 0.0 for (i<- 0 until 10) { val start = System.nanoTime() discretizer.fit(df) val end = System.nanoTime() duration += (end - start) / 1e9 } println(duration/10) ``` Here is the performance test result: \|numCols \|NumRows \| compute each approxQuantiles separately\|compute multiple columns approxQuantiles at one time\| \|--------\|----------\|--------------------------------\|-------------------------------------------\| \|10 \|60 \|0.3623195839 \|0.1626658607 \| \|10 \|6000 \|0.7537239841 \|0.3869370046 \| \|22 \|6000 \|1.6497598557 \|0.4767903059 \| \|50 \|6000 \|3.2268305752 \|0.7217818396 \| ## How was this patch tested? add UT in QuantileDiscretizerSuite to test multi columns supports Author: Huaxin Gao <huaxing@us.ibm.com> Closes #19715 from huaxingao/spark_22397.	2017-12-31 14:39:24 +02:00
WeichenXu	2ea17afb63	[SPARK-22881][ML][TEST] ML regression package testsuite add StructuredStreaming test ## What changes were proposed in this pull request? ML regression package testsuite add StructuredStreaming test In order to make testsuite easier to modify, new helper function added in `MLTest`: ``` def testTransformerByGlobalCheckFunc[A : Encoder]( dataframe: DataFrame, transformer: Transformer, firstResultCol: String, otherResultCols: String*) (globalCheckFunction: Seq[Row] => Unit): Unit ``` ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Author: Bago Amirbekian <bago@databricks.com> Closes #19979 from WeichenXu123/ml_stream_test.	2017-12-29 20:06:56 -08:00
Bago Amirbekian	816963043a	[SPARK-22734][ML][PYSPARK] Added Python API for VectorSizeHint. (Please fill in changes proposed in this fix) Python API for VectorSizeHint Transformer. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) doc-tests. Author: Bago Amirbekian <bago@databricks.com> Closes #20112 from MrBago/vectorSizeHint-PythonAPI.	2017-12-29 19:45:14 -08:00
Zheng RuiFeng	afc3641460	[SPARK-22905][ML][FOLLOWUP] Fix GaussianMixtureModel save ## What changes were proposed in this pull request? make sure model data is stored in order. WeichenXu123 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #20113 from zhengruifeng/gmm_save.	2017-12-29 10:08:03 -08:00
WeichenXu	c74573084e	[SPARK-22905][MLLIB] Fix ChiSqSelectorModel save implementation ## What changes were proposed in this pull request? Currently, in `ChiSqSelectorModel`, save: ``` spark.createDataFrame(dataArray).repartition(1).write... ``` The default partition number used by createDataFrame is "defaultParallelism", Current RoundRobinPartitioning won't guarantee the "repartition" generating the same order result with local array. We need fix it. ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #20088 from WeichenXu123/fix_chisq_model_save.	2017-12-28 17:32:30 -08:00
WeichenXu	753793bc84	[SPARK-22899][ML][STREAMING] Fix OneVsRestModel transform on streaming data failed. ## What changes were proposed in this pull request? Fix OneVsRestModel transform on streaming data failed. ## How was this patch tested? UT will be added soon, once #19979 merged. (Need a helper test method there) Author: WeichenXu <weichen.xu@databricks.com> Closes #20077 from WeichenXu123/fix_ovs_model_transform.	2017-12-27 17:31:12 -08:00
WeichenXu	fba03133d1	[SPARK-22707][ML] Optimize CrossValidator memory occupation by models in fitting ## What changes were proposed in this pull request? Via some test I found CrossValidator still exists memory issue, it will still occupy `O(nsizeof(model))` memory for holding models when fitting, if well optimized, it should be `O(parallelismsizeof(model))` This is because modelFutures will hold the reference to model object after future is complete (we can use `future.value.get.get` to fetch it), and the `Future.sequence` and the `modelFutures` array holds references to each model future. So all model object are keep referenced. So it will still occupy `O(nsizeof(model))` memory. I fix this by merging the `modelFuture` and `foldMetricFuture` together, and use `atomicInteger` to statistic complete fitting tasks and when all done, trigger `trainingDataset.unpersist`. I ever commented this issue on the old PR [SPARK-19357] https://github.com/apache/spark/pull/16774#pullrequestreview-53674264 unfortunately, at that time I do not realize that the issue still exists, but now I confirm it and create this PR to fix it. ## Discussion I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off. After discussion with jkbradley , choose approach 3* ### Approach 1 ~~The approach proposed by MrBago at~~ https://github.com/apache/spark/pull/19904#discussion_r156751569 ~~This approach resolve the model objects referenced issue, allow the model objects to be GCed in time. BUT, in some cases, it still do not resolve the O(N) model memory occupation issue. Let me use an extreme case to describe it:~~ ~~suppose we set `parallelism = 1`, and there're 100 paramMaps. So we have 100 fitting & evaluation tasks. In this approach, because of `parallelism = 1`, the code have to wait 100 fitting tasks complete, *(at this time the memory occupation by models already reach 100 sizeof(model) )** and then it will unpersist training dataset and then do 100 evaluation tasks.~~ ### Approach 2 ~~This approach is my PR old version code~~ `2cc7c28f38` ~~This approach can make sure at any case, the peak memory occupation by models to be `O(numParallelism * sizeof(model))`, but, it exists an issue that, in some extreme case, the "unpersist training dataset" will be delayed until most of the evaluation tasks complete. Suppose the case `parallelism = 1`, and there're 100 fitting & evaluation tasks, each fitting&evaluation task have to be executed one by one, so only after the first 99 fitting&evaluation tasks and the 100th fitting task complete, the "unpersist training dataset" will be triggered.~~ ### Approach 3 After I compared approach 1 and approach 2, I realized that, in the case which parallelism is low but there're many fitting & evaluation tasks, we cannot achieve both of the following two goals: - Make the peak memory occupation by models(driver-side) to be O(parallelism * sizeof(model)) - unpersist training dataset before most of the evaluation tasks started. So I vote for a simpler approach, move the unpersist training dataset to the end (Does this really matters ?) Because the goal 1 is more important, we must make sure the peak memory occupation by models (driver-side) to be O(parallelism * sizeof(model)), otherwise it will bring high risk of OOM. Like following code: ``` val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => Future[Double] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] //...other minor codes val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric metricformodeltrainedwithparamMap.") metric } (executionContext) } val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) trainingDataset.unpersist() // <------- unpersist at the end validationDataset.unpersist() ``` ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #19904 from WeichenXu123/fix_cross_validator_memory_issue.	2017-12-24 22:57:53 -08:00
Bago Amirbekian	d23dc5b8ef	[SPARK-22346][ML] VectorSizeHint Transformer for using VectorAssembler in StructuredSteaming ## What changes were proposed in this pull request? A new VectorSizeHint transformer was added. This transformer is meant to be used as a pipeline stage ahead of VectorAssembler, on vector columns, so that VectorAssembler can join vectors in a streaming context where the size of the input vectors is otherwise not known. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #19746 from MrBago/vector-size-hint.	2017-12-22 14:05:57 -08:00
Zheng RuiFeng	a36b78b0e4	[SPARK-22450][CORE][MLLIB][FOLLOWUP] safely register class for mllib - LabeledPoint/VectorWithNorm/TreePoint ## What changes were proposed in this pull request? register following classes in Kryo: `org.apache.spark.mllib.regression.LabeledPoint` `org.apache.spark.mllib.clustering.VectorWithNorm` `org.apache.spark.ml.feature.LabeledPoint` `org.apache.spark.ml.tree.impl.TreePoint` `org.apache.spark.ml.tree.impl.BaggedPoint` seems also need to be registered, but I don't know how to do it in this safe way. WeichenXu123 cloud-fan ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19950 from zhengruifeng/labeled_kryo.	2017-12-21 20:20:04 -06:00
WeichenXu	d3ae3e1e89	[SPARK-19634][SQL][ML][FOLLOW-UP] Improve interface of dataframe vectorized summarizer ## What changes were proposed in this pull request? Make several improvements in dataframe vectorized summarizer. 1. Make the summarizer return `Vector` type for all metrics (except "count"). It will return "WrappedArray" type before which won't be very convenient. 2. Make `MetricsAggregate` inherit `ImplicitCastInputTypes` trait. So it can check and implicitly cast input values. 3. Add "weight" parameter for all single metric method. 4. Update doc and improve the example code in doc. 5. Simplified test cases. ## How was this patch tested? Test added and simplified. Author: WeichenXu <weichen.xu@databricks.com> Closes #19156 from WeichenXu123/improve_vec_summarizer.	2017-12-20 19:53:35 -08:00
Zheng RuiFeng	d762d110d4	[SPARK-22832][ML] BisectingKMeans unpersist unused datasets ## What changes were proposed in this pull request? unpersist unused datasets ## How was this patch tested? existing tests and local check in Spark-Shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #20017 from zhengruifeng/bkm_unpersist.	2017-12-20 12:50:03 -06:00
Yanbo Liang	1e44dd0044	[SPARK-3181][ML] Implement huber loss for LinearRegression. ## What changes were proposed in this pull request? MLlib ```LinearRegression``` supports _huber_ loss addition to _leastSquares_ loss. The huber loss objective function is: ![image](https://user-images.githubusercontent.com/1962026/29554124-9544d198-8750-11e7-8afa-33579ec419d5.png) Refer Eq.(6) and Eq.(8) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf). This objective is jointly convex as a function of (w, σ) ∈ R × (0,∞), we can use L-BFGS-B to solve it. The current implementation is a straight forward porting for Python scikit-learn [```HuberRegressor```](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html). There are some differences: * We use mean loss (```lossSum/weightSum```), but sklearn uses total loss (```lossSum```). * We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with _leastSquares_ loss. So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set ```regParam``` divide by the number of instances to match the output of sklearn. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19020 from yanboliang/spark-3181.	2017-12-13 21:19:14 -08:00
Ruifeng Zheng	874350905f	[SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN ## What changes were proposed in this pull request? only drops the rows containing NaN in the input columns ## How was this patch tested? existing tests and added tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19894 from zhengruifeng/bucketizer_nan.	2017-12-13 09:10:03 +02:00
WeichenXu	0e36ba6212	[SPARK-22644][ML][TEST] Make ML testsuite support StructuredStreaming test ## What changes were proposed in this pull request? We need to add some helper code to make testing ML transformers & models easier with streaming data. These tests might help us catch any remaining issues and we could encourage future PRs to use these tests to prevent new Models & Transformers from having issues. I add a `MLTest` trait which extends `StreamTest` trait, and override `createSparkSession`. So ML testsuite can only extend `MLTest`, to use both ML & Stream test util functions. I only modify one testcase in `LinearRegressionSuite`, for first pass review. Link to #19746 ## How was this patch tested? `MLTestSuite` added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19843 from WeichenXu123/ml_stream_test_helper.	2017-12-12 21:28:24 -08:00

... 4 5 6 7 8 ...

2497 commits