ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Liang-Chi Hsieh	a927c764c1	[SPARK-26559][ML][PYSPARK] ML image can't work with numpy versions prior to 1.9 ## What changes were proposed in this pull request? Due to [API change](https://github.com/numpy/numpy/pull/4257/files#diff-c39521d89f7e61d6c0c445d93b62f7dc) at 1.9, PySpark image doesn't work with numpy version prior to 1.9. When running image test with numpy version prior to 1.9, we can see error: ``` test_read_images (pyspark.ml.tests.test_image.ImageReaderTest) ... ERROR test_read_images_multiple_times (pyspark.ml.tests.test_image.ImageReaderTest2) ... ok ====================================================================== ERROR: test_read_images (pyspark.ml.tests.test_image.ImageReaderTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/viirya/docker_tmp/repos/spark-1/python/pyspark/ml/tests/test_image.py", line 36, in test_read_images self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row) File "/Users/viirya/docker_tmp/repos/spark-1/python/pyspark/ml/image.py", line 193, in toImage data = bytearray(array.astype(dtype=np.uint8).ravel().tobytes()) AttributeError: 'numpy.ndarray' object has no attribute 'tobytes' ---------------------------------------------------------------------- Ran 2 tests in 29.040s FAILED (errors=1) ``` ## How was this patch tested? Manually test with numpy version prior and after 1.9. Closes #23484 from viirya/fix-pyspark-image. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-07 18:36:52 +08:00
Marco Gaido	001d309538	[SPARK-25765][ML] Add training cost to BisectingKMeans summary ## What changes were proposed in this pull request? The PR adds the `trainingCost` value to the `BisectingKMeansSummary`, in order to expose the information retrievable by running `computeCost` on the training dataset. This fills the gap with `KMeans` implementation. ## How was this patch tested? improved UTs Closes #22764 from mgaido91/SPARK-25765. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-01 09:18:58 -06:00
Jing Chen He	860f4497f2	[SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel ## What changes were proposed in this pull request? If the input parameter 'threshold' to the function approxSimilarityJoin is not a float, we would get an exception. The fix is to convert the 'threshold' into a float before calling the java implementation method. ## How was this patch tested? Added a new test case. Without this fix, the test will throw an exception as reported in the JIRA. With the fix, the test passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23313 from jerryjch/SPARK-26315. Authored-by: Jing Chen He <jinghe@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-15 08:41:16 -06:00
Huaxin Gao	05cf81e6de	[SPARK-19827][R] spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes #23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:28:13 -06:00
Huaxin Gao	20278e719e	[SPARK-24333][ML][PYTHON] Add fit with validation set to spark.ml GBT: Python API ## What changes were proposed in this pull request? Add validationIndicatorCol and validationTol to GBT Python. ## How was this patch tested? Add test in doctest to test the new API. Closes #21465 from huaxingao/spark-24333. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-07 13:53:35 -08:00
Liang-Chi Hsieh	169d9ad8f1	[SPARK-26133][ML][FOLLOWUP] Fix doc for OneHotEncoder ## What changes were proposed in this pull request? This fixes doc of renamed OneHotEncoder in PySpark. ## How was this patch tested? N/A Closes #23230 from viirya/remove_one_hot_encoder_followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-05 19:30:25 +08:00
Hyukjin Kwon	518a3d10c8	[SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small cleanup and deduplication in ml/mllib tests ## What changes were proposed in this pull request? This PR is a small follow up that puts some logic and functions into smaller scope and make it localized, and deduplicate. ## How was this patch tested? Manually tested. Jenkins tests as well. Closes #23200 from HyukjinKwon/followup-SPARK-26034-SPARK-26033. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-03 14:03:10 -08:00
Liang-Chi Hsieh	8bfea86b1c	[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder ## What changes were proposed in this pull request? We have deprecated `OneHotEncoder` at Spark 2.3.0 and introduced `OneHotEncoderEstimator`. At 3.0.0, we remove deprecated `OneHotEncoder` and rename `OneHotEncoderEstimator` to `OneHotEncoder`. TODO: According to ML migration guide, we need to keep `OneHotEncoderEstimator` as an alias after renaming. This is not done at this patch in order to facilitate review. ## How was this patch tested? Existing tests. Closes #23100 from viirya/remove_one_hot_encoder. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-29 01:54:06 +00:00
Katrin Leinweber	c5daccb1da	[MINOR] Update all DOI links to preferred resolver ## What changes were proposed in this pull request? The DOI foundation recommends [this new resolver](https://www.doi.org/doi_handbook/3_Resolution.html#3.8). Accordingly, this PR re`sed`s all static DOI links ;-) ## How was this patch tested? It wasn't, since it seems as safe as a "[typo fix](https://spark.apache.org/contributing.html)". In case any of the files is included from other projects, and should be updated there, please let me know. Closes #23129 from katrinleinweber/resolve-DOIs-securely. Authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-25 17:43:55 -06:00
Marco Gaido	dd8c179c28	[SPARK-25867][ML] Remove KMeans computeCost ## What changes were proposed in this pull request? The PR removes the deprecated method `computeCost` of `KMeans`. ## How was this patch tested? NA Closes #22875 from mgaido91/SPARK-25867. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-22 15:45:25 -06:00
hyukjinkwon	bbbdaa82a4	[SPARK-26105][PYTHON] Clean unittest2 imports up that were added for Python 2.6 before ## What changes were proposed in this pull request? Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance: ```python if sys.version_info[:2] <= (2, 6): try: import unittest2 as unittest except ImportError: sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') sys.exit(1) else: import unittest ``` While I am here, I removed some of unused imports and reordered imports per PEP 8. We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out. ## How was this patch tested? Manually tests, and existing tests via Jenkins. Closes #23077 from HyukjinKwon/SPARK-26105. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-19 09:22:32 +08:00
Bryan Cutler	034ae305c3	[SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain. The tests are broken down as follows: ``` pyspark ├── __init__.py ... ├── ml │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_algorithms.py │ │ ├── test_base.py │ │ ├── test_evaluation.py │ │ ├── test_feature.py │ │ ├── test_image.py │ │ ├── test_linalg.py │ │ ├── test_param.py │ │ ├── test_persistence.py │ │ ├── test_pipeline.py │ │ ├── test_stat.py │ │ ├── test_training_summary.py │ │ ├── test_tuning.py │ │ └── test_wrapper.py ... ├── testing ... │ ├── mlutils.py ... ``` ## How was this patch tested? Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6. Closes #23063 from BryanCutler/python-test-breakup-ml-SPARK-26033. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-18 16:02:15 +08:00
hyukjinkwon	03306a6df3	[SPARK-26036][PYTHON] Break large tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/tests.py` into ...: ``` pyspark ... ├── testing ... │ └── utils.py ├── tests │ ├── __init__.py │ ├── test_appsubmit.py │ ├── test_broadcast.py │ ├── test_conf.py │ ├── test_context.py │ ├── test_daemon.py │ ├── test_join.py │ ├── test_profiler.py │ ├── test_rdd.py │ ├── test_readwrite.py │ ├── test_serializers.py │ ├── test_shuffle.py │ ├── test_taskcontext.py │ ├── test_util.py │ └── test_worker.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23033 from HyukjinKwon/SPARK-26036. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-15 12:30:52 +08:00
李亮	e503065fd8	[SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem ## What changes were proposed in this pull request? Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy. ## How was this patch tested? From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense) For calculation logic test There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case if meet precisionBound1, we define it as LOGIC1 if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2 if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3 (There is a trick, you can manually change the precision value to meet above situation) sparse- sparse case time cost situation (milliseconds) LOGIC1 Before add patch: 7786, 7970, 8086 After add patch: 7729, 7653, 7903 LOGIC2 Before add patch: 8412, 9029, 8606 After add patch: 8603, 8724, 9024 LOGIC3 Before add patch: 19365, 19146, 19351 After add patch: 18917, 19007, 19074 sparse-dense case time cost situation (milliseconds) LOGIC1 Before add patch: 4195, 4014, 4409 After add patch: 4081,3971, 4151 LOGIC2 Before add patch: 4968, 5579, 5080 After add patch: 4980, 5472, 5148 LOGIC3 Before add patch: 11848, 12077, 12168 After add patch: 11718, 11874, 11743 And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) Before add patch: 7340, 7816, 7672 After add patch: 5752, 5800, 5753 For real world data test There is my test data situation I use the data http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data total instances are 13230 the attributes for line are 6000 Result for sparse-sparse situation time cost (milliseconds) Before Enhance: 7670, 7704, 7652 After Enhance: 7634, 7729, 7645 Closes #22893 from KyleLi1985/updatekmeanpatch. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 07:24:13 -08:00
Sean Owen	510ec77a60	[SPARK-19714][DOCS] Clarify Bucketizer handling of invalid input ## What changes were proposed in this pull request? Clarify Bucketizer handleInvalid docs. Just a resubmit of https://github.com/apache/spark/pull/17169 ## How was this patch tested? N/A Closes #23003 from srowen/SPARK-19714. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-11 09:21:40 -06:00
Sean Owen	c0d1bf0322	[MINOR] Fix typos and misspellings ## What changes were proposed in this pull request? Fix typos and misspellings, per https://github.com/apache/spark-website/pull/158#issuecomment-435790366 ## How was this patch tested? Existing tests. Closes #22950 from srowen/Typos. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-05 17:34:23 -06:00
Marco Gaido	fc10c898f4	[SPARK-25758][ML] Deprecate computeCost in BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22869 from mgaido91/SPARK-25758_3.0. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-05 22:13:20 +00:00
Wenchen Fan	2fbbcd0d27	Revert "[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans" This reverts commit `c2962546d9`.	2018-10-21 09:12:29 +08:00
Marco Gaido	c2962546d9	[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 10:32:25 -07:00
Sean Owen	08c76b5d39	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231) ## What changes were proposed in this pull request? Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines. ## How was this patch tested? Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure. Closes #22400 from srowen/SPARK-25238.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-13 11:19:43 +08:00
WeichenXu	08c02e637a	[SPARK-25345][ML] Deprecate public APIs from ImageSchema ## What changes were proposed in this pull request? Deprecate public APIs from ImageSchema. ## How was this patch tested? N/A Closes #22349 from WeichenXu123/image_api_deprecate. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-08 09:09:14 -07:00
WeichenXu	925449283d	[SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/`) This datasource NOT support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. cluster: 4 nodes, each with 64GB memory, 8 cores CPU test dataset: Flickr8k_Dataset (about 8091 images) time cost: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes #22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-05 11:59:00 -07:00
Marco Gaido	a3dccd24c2	[SPARK-10697][ML] Add lift to Association rules ## What changes were proposed in this pull request? The PR adds the lift measure to Association rules. ## How was this patch tested? existing and modified UTs Closes #22236 from mgaido91/SPARK-10697. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-01 18:07:58 -05:00
Huaxin Gao	b5e1188087	[SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How was this patch tested? I tested the changes on my local. Closes #22136 from huaxingao/spark-25124. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2018-08-23 16:17:27 -07:00
Marco Gaido	20fa456932	[SPARK-25090][ML] Enforce implicit type coercion in ParamGridBuilder ## What changes were proposed in this pull request? When the grid of the parameters is created in `ParamGridBuilder`, the implicit type coercion is not enforced. So using an integer in the list of parameters to set for a parameter accepting a double can cause a class cast exception. The PR proposes to enforce the type coercion when building the parameters. ## How was this patch tested? added UT Closes #22076 from mgaido91/SPARK-25090. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-13 09:11:37 +08:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
Yuhao Yang	ebf33a333e	[SAPRK-25011][ML] add prefix to __all__ in fpm.py ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-25011 add prefix to __all__ in fpm.py ## How was this patch tested? existing unit test. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #21981 from hhbyyh/prefixall.	2018-08-03 15:02:41 +08:00
Huaxin Gao	42dfe4f159	[SPARK-24973][PYTHON] Add numIter to Python ClusteringSummary ## What changes were proposed in this pull request? Add numIter to Python version of ClusteringSummary ## How was this patch tested? Modified existing UT test_multiclass_logistic_regression_summary Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21925 from huaxingao/spark-24973.	2018-07-31 15:23:11 -05:00
zhengruifeng	1223a201fc	[SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well ## What changes were proposed in this pull request? update doc of RandomForestClassifier.featureSubsetStrategy ## How was this patch tested? local built doc rdoc: ![default](https://user-images.githubusercontent.com/7322292/42807787-4dda6362-89e4-11e8-839f-a8519b7c1f1c.png) pydoc: ![default](https://user-images.githubusercontent.com/7322292/43112817-5f1d4d88-8f2a-11e8-93ff-de90db8afdca.png) Author: zhengruifeng <ruifengz@foxmail.com> Closes #21788 from zhengruifeng/rf_doc_py_r.	2018-07-31 13:37:13 -05:00
Marco Gaido	cc4d64bb16	[SPARK-23451][ML] Deprecate KMeans.computeCost ## What changes were proposed in this pull request? Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`. ## How was this patch tested? manual test (deprecation warning displayed) Scala ``` ... scala> model.computeCost(dataset) warning: there was one deprecation warning; re-run with -deprecation for details res1: Double = 0.0 ``` Python ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) ... >>> model.computeCost(df) /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. " instead.", DeprecationWarning) ``` Author: Marco Gaido <marcogaido91@gmail.com> Closes #20629 from mgaido91/SPARK-23451.	2018-07-20 09:18:57 -07:00
Sean Owen	8aceb961c3	[SPARK-24754][ML] Minhash integer overflow ## What changes were proposed in this pull request? Use longs in calculating min hash to avoid bias due to int overflow. ## How was this patch tested? Existing tests. Author: Sean Owen <srowen@gmail.com> Closes #21750 from srowen/SPARK-24754.	2018-07-14 15:59:17 -05:00
hyukjinkwon	044b33b2ed	[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+ ## What changes were proposed in this pull request? This PR proposes to make PySpark's tests compatible with NumPy 0.14+ NumPy 0.14.x introduced rather radical changes about its string representation. For example, the tests below are failed: ``` ******************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 2.], [ 1., 3.]]) Got: DenseMatrix([[0., 2.], [1., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 1.], [ 2., 3.]]) Got: DenseMatrix([[0., 1.], [2., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray Failed example: m.toArray() Expected: array([[ 0., 2.], [ 1., 3.]]) Got: array([[0., 2.], [1., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot Failed example: dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F')) Expected: array([ 5., 11.]) Got: array([ 5., 11.]) ******************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot Failed example: a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]])) Expected: array([ 22., 22.]) Got: array([22., 22.]) ``` See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python2.7'] Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib'] Starting test(python2.7): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.classification Starting test(python3.6): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.clustering Finished test(python2.7): pyspark.ml.clustering (54s) Starting test(python2.7): pyspark.ml.evaluation Finished test(python2.7): pyspark.ml.classification (74s) Starting test(python2.7): pyspark.ml.feature Finished test(python2.7): pyspark.ml.evaluation (27s) Starting test(python2.7): pyspark.ml.fpm Finished test(python2.7): pyspark.ml.fpm (0s) Starting test(python2.7): pyspark.ml.image Finished test(python2.7): pyspark.ml.image (17s) Starting test(python2.7): pyspark.ml.linalg.__init__ Finished test(python2.7): pyspark.ml.linalg.__init__ (1s) Starting test(python2.7): pyspark.ml.recommendation Finished test(python2.7): pyspark.ml.feature (76s) Starting test(python2.7): pyspark.ml.regression Finished test(python2.7): pyspark.ml.recommendation (69s) Starting test(python2.7): pyspark.ml.stat Finished test(python2.7): pyspark.ml.regression (45s) Starting test(python2.7): pyspark.ml.tests Finished test(python2.7): pyspark.ml.stat (28s) Starting test(python2.7): pyspark.ml.tuning Finished test(python2.7): pyspark.ml.tuning (20s) Starting test(python2.7): pyspark.mllib.classification Finished test(python2.7): pyspark.mllib.classification (31s) Starting test(python2.7): pyspark.mllib.clustering Finished test(python2.7): pyspark.mllib.tests (260s) Starting test(python2.7): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.tests (266s) Starting test(python2.7): pyspark.mllib.feature Finished test(python2.7): pyspark.mllib.evaluation (21s) Starting test(python2.7): pyspark.mllib.fpm Finished test(python2.7): pyspark.mllib.feature (38s) Starting test(python2.7): pyspark.mllib.linalg.__init__ Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s) Starting test(python2.7): pyspark.mllib.linalg.distributed Finished test(python2.7): pyspark.mllib.fpm (34s) Starting test(python2.7): pyspark.mllib.random Finished test(python2.7): pyspark.mllib.clustering (64s) Starting test(python2.7): pyspark.mllib.recommendation Finished test(python2.7): pyspark.mllib.random (15s) Starting test(python2.7): pyspark.mllib.regression Finished test(python2.7): pyspark.mllib.linalg.distributed (47s) Starting test(python2.7): pyspark.mllib.stat.KernelDensity Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python2.7): pyspark.mllib.stat._statistics Finished test(python2.7): pyspark.mllib.recommendation (40s) Starting test(python2.7): pyspark.mllib.tree Finished test(python2.7): pyspark.mllib.regression (38s) Starting test(python2.7): pyspark.mllib.util Finished test(python2.7): pyspark.mllib.stat._statistics (19s) Starting test(python3.6): pyspark.ml.classification Finished test(python2.7): pyspark.mllib.tree (26s) Starting test(python3.6): pyspark.ml.clustering Finished test(python2.7): pyspark.mllib.util (27s) Starting test(python3.6): pyspark.ml.evaluation Finished test(python3.6): pyspark.ml.evaluation (30s) Starting test(python3.6): pyspark.ml.feature Finished test(python2.7): pyspark.ml.tests (234s) Starting test(python3.6): pyspark.ml.fpm Finished test(python3.6): pyspark.ml.fpm (1s) Starting test(python3.6): pyspark.ml.image Finished test(python3.6): pyspark.ml.clustering (55s) Starting test(python3.6): pyspark.ml.linalg.__init__ Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Starting test(python3.6): pyspark.ml.recommendation Finished test(python3.6): pyspark.ml.classification (71s) Starting test(python3.6): pyspark.ml.regression Finished test(python3.6): pyspark.ml.image (18s) Starting test(python3.6): pyspark.ml.stat Finished test(python3.6): pyspark.ml.stat (37s) Starting test(python3.6): pyspark.ml.tests Finished test(python3.6): pyspark.ml.regression (59s) Starting test(python3.6): pyspark.ml.tuning Finished test(python3.6): pyspark.ml.feature (93s) Starting test(python3.6): pyspark.mllib.classification Finished test(python3.6): pyspark.ml.recommendation (83s) Starting test(python3.6): pyspark.mllib.clustering Finished test(python3.6): pyspark.ml.tuning (29s) Starting test(python3.6): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.evaluation (26s) Starting test(python3.6): pyspark.mllib.feature Finished test(python3.6): pyspark.mllib.classification (43s) Starting test(python3.6): pyspark.mllib.fpm Finished test(python3.6): pyspark.mllib.clustering (81s) Starting test(python3.6): pyspark.mllib.linalg.__init__ Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s) Starting test(python3.6): pyspark.mllib.linalg.distributed Finished test(python3.6): pyspark.mllib.fpm (48s) Starting test(python3.6): pyspark.mllib.random Finished test(python3.6): pyspark.mllib.feature (54s) Starting test(python3.6): pyspark.mllib.recommendation Finished test(python3.6): pyspark.mllib.random (18s) Starting test(python3.6): pyspark.mllib.regression Finished test(python3.6): pyspark.mllib.linalg.distributed (55s) Starting test(python3.6): pyspark.mllib.stat.KernelDensity Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.6): pyspark.mllib.stat._statistics Finished test(python3.6): pyspark.mllib.recommendation (51s) Starting test(python3.6): pyspark.mllib.tree Finished test(python3.6): pyspark.mllib.regression (45s) Starting test(python3.6): pyspark.mllib.util Finished test(python3.6): pyspark.mllib.stat._statistics (21s) Finished test(python3.6): pyspark.mllib.tree (27s) Finished test(python3.6): pyspark.mllib.util (27s) Finished test(python3.6): pyspark.ml.tests (264s) ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21715 from HyukjinKwon/SPARK-24740.	2018-07-07 11:39:29 +08:00
mcteo	f997be0c31	[SPARK-24698][PYTHON] Fixed typo in pyspark.ml's Identifiable class. ## What changes were proposed in this pull request? Fixed a small typo in the code that caused 20 random characters to be added to the UID, rather than 12. Author: mcteo <mc_teo@live.ie> Closes #21675 from mcteo/SPARK-24698-fix.	2018-07-05 10:05:41 +08:00
Huaxin Gao	2224861f2f	[SPARK-24439][ML][PYTHON] Add distanceMeasure to BisectingKMeans in PySpark ## What changes were proposed in this pull request? add distanceMeasure to BisectingKMeans in Python. ## How was this patch tested? added doctest and also manually tested it. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21557 from huaxingao/spark-24439.	2018-06-28 14:07:28 -07:00
Holden Karau	a95a4af764	[SPARK-23120][PYSPARK][ML] Add basic PMML export support to PySpark ## What changes were proposed in this pull request? Adds basic PMML export support for Spark ML stages to PySpark as was previously done in Scala. Includes LinearRegressionModel as the first stage to implement. ## How was this patch tested? Doctest, the main testing work for this is on the Scala side. (TODO holden add the unittest once I finish locally). Author: Holden Karau <holden@pigscanfly.ca> Closes #21172 from holdenk/SPARK-23120-add-pmml-export-support-to-pyspark.	2018-06-28 13:20:08 -07:00
bravo-zhang	524827f062	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model ## What changes were proposed in this pull request? [SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712) spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark. ## How was this patch tested? LogisticRegressionSuite.scala Python doctest in pyspark.ml.classification.py Author: bravo-zhang <mzhang1230@gmail.com> Closes #18826 from bravo-zhang/spark-14712.	2018-06-28 12:40:39 -07:00
Lee Dongjin	5d6a53d983	[SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala\|Python] unit tests. Author: Lee Dongjin <dongjin@apache.org> Closes #21501 from dongjinleekr/feature/SPARK-15064.	2018-06-12 08:16:37 -07:00
Huaxin Gao	a99d284c16	[SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC ## What changes were proposed in this pull request? add spark.ml Python API for PIC ## How was this patch tested? add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21513 from huaxingao/spark--19826.	2018-06-11 12:15:14 -07:00
hyukjinkwon	173fe450df	[SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule in ml/__init__.py and add ImageSchema into __all__ ## What changes were proposed in this pull request? This PR attaches submodules to ml's `__init__.py` module. Also, adds `ImageSchema` into `image.py` explicitly. ## How was this patch tested? Before: ```python >>> from pyspark import ml >>> ml.image Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' >>> ml.image.ImageSchema Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() False >>> image Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'image' is not defined ``` After: ```python >>> from pyspark import ml >>> ml.image <module 'pyspark.ml.image' from '/.../spark/python/pyspark/ml/image.pyc'> >>> ml.image.ImageSchema <pyspark.ml.image._ImageSchema object at 0x10d973b10> ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() True >>> image <module 'pyspark.ml.image' from #'/.../spark/python/pyspark/ml/image.pyc'> ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21483 from HyukjinKwon/SPARK-24454.	2018-06-08 09:32:11 -07:00
WeichenXu	90ae98d1ac	[SPARK-24146][PYSPARK][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## What changes were proposed in this pull request? spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## How was this patch tested? doctests Author: WeichenXu <weichen.xu@databricks.com> Closes #21265 from WeichenXu123/prefix_span_py.	2018-05-31 06:53:10 -07:00
Huaxin Gao	ec6f971dc5	[SPARK-23161][PYSPARK][ML] Add missing APIs to Python GBTClassifier ## What changes were proposed in this pull request? Add featureSubsetStrategy in GBTClassifier and GBTRegressor. Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses. ## How was this patch tested? Add tests in doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21413 from huaxingao/spark-23161.	2018-05-30 11:04:09 -07:00
Liang-Chi Hsieh	8a13c50968	[SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved separately in metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in order to keep them in sync. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21153 from viirya/SPARK-24058.	2018-05-15 16:50:09 -07:00
Lu WANG	6b94420f6c	[SPARK-24231][PYSPARK][ML] Provide Python API for evaluateEachIteration for spark.ml GBTs ## What changes were proposed in this pull request? Add evaluateEachIteration for GBTClassification and GBTRegressionModel ## How was this patch tested? doctest Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21335 from ludatabricks/SPARK-14682.	2018-05-15 14:16:31 -07:00
hyukjinkwon	f7435bec6a	[SPARK-24044][PYTHON] Explicitly print out skipped tests from unittest module ## What changes were proposed in this pull request? This PR proposes to remove duplicated dependency checking logics and also print out skipped tests from unittests. For example, as below: ``` Skipped tests in pyspark.sql.tests with pypy: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ... Skipped tests in pyspark.sql.tests with python3: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ... ``` Currently, it's not printed out in the console. I think we should better print out skipped tests in the console. ## How was this patch tested? Manually tested. Also, fortunately, Jenkins has good environment to test the skipped output. Author: hyukjinkwon <gurwls223@apache.org> Closes #21107 from HyukjinKwon/skipped-tests-print.	2018-04-26 15:11:42 -07:00
WeichenXu	1ca3c50fef	[SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer ## What changes were proposed in this pull request? Python API for DataFrame-based multivariate summarizer. ## How was this patch tested? doctest added. Author: WeichenXu <weichen.xu@databricks.com> Closes #20695 from WeichenXu123/py_summarizer.	2018-04-17 10:11:08 -07:00
WeichenXu	04614820e1	[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting. ## How was this patch tested? UT added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19627 from WeichenXu123/expose-model-list-py.	2018-04-16 11:31:24 -05:00
JBauerKogentix	9d960de081	typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: JBauerKogentix <37910022+JBauerKogentix@users.noreply.github.com> Closes #21030 from JBauerKogentix/patch-1.	2018-04-11 15:52:13 -07:00
Huaxin Gao	4f1e8b9bb7	[SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid ## What changes were proposed in this pull request? add python api for VectorAssembler handleInvalid ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21003 from huaxingao/spark-23871.	2018-04-10 15:41:45 -07:00
WeichenXu	adb222b957	[SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` Note API with `CDF` is a little difficult to support in python. We can add it in following PR. ## How was this patch tested? doctest Author: WeichenXu <weichen.xu@databricks.com> Closes #20904 from WeichenXu123/ks-test-py.	2018-04-10 11:18:14 -07:00
Huaxin Gao	e998250588	[SPARK-23828][ML][PYTHON] PySpark StringIndexerModel should have constructor from labels ## What changes were proposed in this pull request? The Scala StringIndexerModel has an alternate constructor that will create the model from an array of label strings. Add the corresponding Python API: model = StringIndexerModel.from_labels(["a", "b", "c"]) ## How was this patch tested? Add doctest and unit test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20968 from huaxingao/spark-23828.	2018-04-06 11:51:36 -07:00
Bryan Cutler	44a9f8e6e8	[SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for CountVectorizerModel ## What changes were proposed in this pull request? Adding test for default params for `CountVectorizerModel` constructed from vocabulary. This required that the param `maxDF` be added, which was done in SPARK-23615. ## How was this patch tested? Added an explicit test for CountVectorizerModel in DefaultValuesTests. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20942 from BryanCutler/pyspark-CountVectorizerModel-default-param-test-SPARK-15009.	2018-04-02 09:53:37 -07:00
Kevin Yu	3e778f5a91	[SPARK-23162][PYSPARK][ML] Add r2adj into Python API in LinearRegressionSummary ## What changes were proposed in this pull request? Adding r2adj in LinearRegressionSummary for Python API. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes in tests.py. Author: Kevin Yu <qyu@us.ibm.com> Closes #20842 from kevinyu98/spark-23162.	2018-03-26 15:45:27 -07:00
Huaxin Gao	a33655348c	[SPARK-23615][ML][PYSPARK] Add maxDF Parameter to Python CountVectorizer ## What changes were proposed in this pull request? The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also. ## How was this patch tested? add test Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20777 from huaxingao/spark-23615.	2018-03-23 15:58:48 -07:00
Bryan Cutler	cb43bbe136	[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform ## What changes were proposed in this pull request? Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to `wrapper._transfer_values_to_java`. This incorrectly changes the state of the Param as it should still be marked as a default value only. ## How was this patch tested? Added a new test to verify that when transferring Params to Java, default params have their state preserved. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18982 from BryanCutler/pyspark-ml-param-to-java-defaults-SPARK-21685.	2018-03-23 11:42:40 -07:00
Bryan Cutler	8a72734f33	[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list ## What changes were proposed in this pull request? Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters. This now matches the Scala class hierarchy. ## How was this patch tested? Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.	2018-03-16 11:42:57 -07:00
Benjamin Peterson	7013eea11c	[SPARK-23522][PYTHON] always use sys.exit over builtin exit The exit() builtin is only for interactive use. applications should use sys.exit(). ## What changes were proposed in this pull request? All usage of the builtin `exit()` function is replaced by `sys.exit()`. ## How was this patch tested? I ran `python/run-tests`. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Benjamin Peterson <benjamin@python.org> Closes #20682 from benjaminp/sys-exit.	2018-03-08 20:38:34 +09:00
Yogesh Garg	7706eea6a8	[SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model summaries could also be garbage collected in Java. A test case was added to make sure that relevant error messages are thrown after the objects are deleted. I ran pyspark tests agains `pyspark-ml` module `./python/run-tests --python-executables=$(which python) --modules=pyspark-ml` Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com> Closes #20724 from yogeshg/java_wrapper_memory.	2018-03-05 15:53:10 -08:00
Marco Gaido	e836c27ce0	[SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API ## What changes were proposed in this pull request? The PR adds the `distanceMeasure` param to ClusteringEvaluator in the Python API. This allows the user to specify `cosine` as distance measure in addition to the default `squaredEuclidean`. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20627 from mgaido91/SPARK-23217_python.	2018-02-21 12:39:36 -06:00
Shintaro Murakami	d5ed2108d3	[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations ## What changes were proposed in this pull request? Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4. ## How was this patch tested? Added a unit test. Note: When we merge this PR, please give all the credits to Shintaro Murakami. Author: Shintaro Murakami <mrkm4ntrgmail.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Shintaro Murakami <mrkm4ntr@gmail.com> Closes #20630 from gatorsmile/pr-20568.	2018-02-16 17:17:55 -08:00
Marco Gaido	0783876c81	[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans ## What changes were proposed in this pull request? SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. The PR adds it also to the Python interface. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #20520 from mgaido91/SPARK-23344.	2018-02-10 10:46:45 -06:00
hyukjinkwon	715047b02d	[SPARK-23256][ML][PYTHON] Add columnSchema method to PySpark image reader ## What changes were proposed in this pull request? This PR proposes to add `columnSchema` in Python side too. ```python >>> from pyspark.ml.image import ImageSchema >>> ImageSchema.columnSchema.simpleString() 'struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>' ``` ## How was this patch tested? Manually tested and unittest was added in `python/pyspark/ml/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20475 from HyukjinKwon/SPARK-23256.	2018-02-04 17:53:31 +09:00
Nick Pentreath	a8a3e9b7cf	Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column" This reverts commit `c22eaa94e8`.	2018-01-26 23:48:02 +02:00
Zheng RuiFeng	c22eaa94e8	[SPARK-22797][PYSPARK] Bucketizer support multi-column ## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19892 from zhengruifeng/20542_py.	2018-01-26 12:28:27 +02:00
Bryan Cutler	39ee2acf96	[SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala ## What changes were proposed in this pull request? This syncs the ML Python API with Scala for differences found after the 2.3 QA audit. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20354 from BryanCutler/pyspark-ml-doc-sync-23163.	2018-01-25 01:48:11 -08:00
hyukjinkwon	45ad97df87	[SPARK-23132][PYTHON][ML] Run doctests in ml.image when testing ## What changes were proposed in this pull request? This PR proposes to actually run the doctests in `ml/image.py`. ## How was this patch tested? doctests in `python/pyspark/ml/image.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20294 from HyukjinKwon/trigger-image.	2018-01-18 07:30:54 +09:00
WeichenXu	a7d98d53ce	[SPARK-23008][ML][FOLLOW-UP] mark OneHotEncoder python API deprecated ## What changes were proposed in this pull request? mark OneHotEncoder python API deprecated ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #20241 from WeichenXu123/mark_ohe_deprecated.	2018-01-12 11:27:02 +02:00
WeichenXu	b5042d75c2	[SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu <weichen.xu@databricks.com> Closes #20209 from WeichenXu123/ohe_py.	2018-01-11 16:20:30 -08:00
sethah	70bcc9d5ae	[SPARK-22993][ML] Clarify HasCheckpointInterval param doc ## What changes were proposed in this pull request? Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context. ## How was this patch tested? No tests necessary, just a doc update. Author: sethah <shendrickson@cloudera.com> Closes #20188 from sethah/als_checkpoint_doc.	2018-01-09 23:32:47 -08:00
Sean Owen	c284c4e1f6	[MINOR] Fix a bunch of typos	2018-01-02 07:10:19 +09:00
Nick Pentreath	028ee40165	[SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat numeric columns as categorical Previously, `FeatureHasher` always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes in data sources. In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome. Add a new param `categoricalCols` which specifies the numeric columns that should be treated as categorical features. ## How was this patch tested? New unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #19991 from MLnick/hasher-num-cat.	2017-12-31 14:51:38 +02:00
Bago Amirbekian	816963043a	[SPARK-22734][ML][PYSPARK] Added Python API for VectorSizeHint. (Please fill in changes proposed in this fix) Python API for VectorSizeHint Transformer. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) doc-tests. Author: Bago Amirbekian <bago@databricks.com> Closes #20112 from MrBago/vectorSizeHint-PythonAPI.	2017-12-29 19:45:14 -08:00
Bago Amirbekian	30fcdc0380	[SPARK-22922][ML][PYSPARK] Pyspark portion of the fit-multiple API ## What changes were proposed in this pull request? Adding fitMultiple API to `Estimator` with default implementation. Also update have ml.tuning meta-estimators use this API. ## How was this patch tested? Unit tests. Author: Bago Amirbekian <bago@databricks.com> Closes #20058 from MrBago/python-fitMultiple.	2017-12-29 16:31:25 -08:00
Yanbo Liang	fb0562f346	[SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss. ## What changes were proposed in this pull request? Expose Python API for _LinearRegression_ with _huber_ loss. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19994 from yanboliang/spark-22810.	2017-12-20 17:51:42 -08:00
Bago Amirbekian	0c8fca4608	[SPARK-22811][PYSPARK][ML] Fix pyspark.ml.tests failure when Hive is not available. ## What changes were proposed in this pull request? pyspark.ml.tests is missing a py4j import. I've added the import and fixed the test that uses it. This test was only failing when testing without Hive. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #19997 from MrBago/fix-ImageReaderTest2.	2017-12-16 10:57:35 +09:00
hyukjinkwon	aa4cf2b19e	[SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages ## What changes were proposed in this pull request? Calling `ImageSchema.readImages` multiple times as below in PySpark shell: ```python from pyspark.ml.image import ImageSchema data_path = 'data/mllib/images/kittens' _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect() _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect() ``` throws an error as below: ``` ... org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------ java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details. ... at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) ... at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) ... at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) ... at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:100) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:88) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70) at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:574) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:593) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) at org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253) ... Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 121 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /.../spark/metastore_db. ... Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages dropImageFailures, float(sampleRatio), seed) File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' ``` Seems we better stick to `SparkSession.builder.getOrCreate()` like: `51620e288b/python/pyspark/sql/streaming.py (L329)` `dc5d34d8dc/python/pyspark/sql/column.py (L541)` `33d43bf1b6/python/pyspark/sql/readwriter.py (L105)` ## How was this patch tested? This was tested as below in PySpark shell: ```python from pyspark.ml.image import ImageSchema data_path = 'data/mllib/images/kittens' _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect() _ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect() ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #19845 from HyukjinKwon/SPARK-22651.	2017-12-02 11:55:43 +09:00
hyukjinkwon	92cfbeeb5c	[SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and fix image test failure in Python 3.6.0 / NumPy 1.13.3 ## What changes were proposed in this pull request? Image test seems failed in Python 3.6.0 / NumPy 1.13.3. I manually tested as below: ``` ====================================================================== ERROR: test_read_images (pyspark.ml.tests.ImageReaderTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/ml/tests.py", line 1831, in test_read_images self.assertEqual(ImageSchema.toImage(array, origin=first_row[0]), first_row) File "/.../spark/python/pyspark/ml/image.py", line 149, in toImage data = bytearray(array.astype(dtype=np.uint8).ravel()) TypeError: only integer scalar arrays can be converted to a scalar index ---------------------------------------------------------------------- Ran 1 test in 7.606s ``` To be clear, I think the error seems from NumPy - `75b2d5d427/numpy/core/src/multiarray/number.c (L947)` For a smaller scope: ```python >>> import numpy as np >>> bytearray(np.array([1]).astype(dtype=np.uint8)) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: only integer scalar arrays can be converted to a scalar index ``` In Python 2.7 / NumPy 1.13.1, it prints: ``` bytearray(b'\x01') ``` So, here, I simply worked around it by converting it to bytes as below: ```python >>> bytearray(np.array([1]).astype(dtype=np.uint8).tobytes()) bytearray(b'\x01') ``` Also, while looking into it again, I realised few arguments could be quite confusing, for example, `Row` that needs some specific attributes and `numpy.ndarray`. I added few type checking and added some tests accordingly. So, it shows an error message as below: ``` TypeError: array argument should be numpy.ndarray; however, it got [<class 'str'>]. ``` ## How was this patch tested? Manually tested with `./python/run-tests`. And also: ``` PYSPARK_PYTHON=python3 SPARK_TESTING=1 bin/pyspark pyspark.ml.tests ImageReaderTest ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #19835 from HyukjinKwon/SPARK-21866-followup.	2017-11-30 10:26:55 +09:00
Ilya Matiach	1edb3175d8	[SPARK-21866][ML][PYSPARK] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code is taken from the spark package located here: (https://github.com/Microsoft/spark-images) Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866) Please see mailing list for SPIP vote and approval information: (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html) # Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. ## How was this patch tested? Unit tests in scala ImageSchemaSuite, unit tests in python Author: Ilya Matiach <ilmat@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19439 from imatiach-msft/ilmat/spark-images.	2017-11-22 15:45:45 -08:00
WeichenXu	2d868d9398	[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API ## What changes were proposed in this pull request? Add python api for VectorIndexerModel support handle unseen categories via handleInvalid. ## How was this patch tested? doctest added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19753 from WeichenXu123/vector_indexer_invalid_py.	2017-11-21 10:53:53 -08:00
WeichenXu	20eb95e5e9	[SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested? Test updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19122 from WeichenXu123/par-ml-tuning-py.	2017-10-27 15:19:27 -07:00
hyukjinkwon	d9798c834f	[SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs ## What changes were proposed in this pull request? This PR proposes to mark the existing warnings as `DeprecationWarning` and print out warnings for deprecated functions. This could be actually useful for Spark app developers. I use (old) PyCharm and this IDE can detect this specific `DeprecationWarning` in some cases: Before <img src="https://user-images.githubusercontent.com/6477701/31762664-df68d9f8-b4f6-11e7-8773-f0468f70a2cc.png" height="45" /> After <img src="https://user-images.githubusercontent.com/6477701/31762662-de4d6868-b4f6-11e7-98dc-3c8446a0c28a.png" height="70" /> For console usage, `DeprecationWarning` is usually disabled (see https://docs.python.org/2/library/warnings.html#warning-categories and https://docs.python.org/3/library/warnings.html#warning-categories): ``` >>> import warnings >>> filter(lambda f: f[2] == DeprecationWarning, warnings.filters) [('ignore', <_sre.SRE_Pattern object at 0x10ba58c00>, <type 'exceptions.DeprecationWarning'>, <_sre.SRE_Pattern object at 0x10bb04138>, 0), ('ignore', None, <type 'exceptions.DeprecationWarning'>, None, 0)] ``` so, it won't actually mess up the terminal much unless it is intended. If this is intendedly enabled, it'd should as below: ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) >>> >>> from pyspark.sql import functions >>> functions.approxCountDistinct("a") .../spark/python/pyspark/sql/functions.py:232: DeprecationWarning: Deprecated in 2.1, use approx_count_distinct instead. "Deprecated in 2.1, use approx_count_distinct instead.", DeprecationWarning) ... ``` These instances were found by: ``` cd python/pyspark grep -r "Deprecated" . grep -r "deprecated" . grep -r "deprecate" . ``` ## How was this patch tested? Manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19535 from HyukjinKwon/deprecated-warning.	2017-10-24 12:44:47 +09:00
Nick Pentreath	98057583dd	[SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods). The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe. ## How was this patch tested? New unit tests in `ALSSuite` and Python doctests in `ALS`. Ran updated examples locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18748 from MLnick/als-recommend-df.	2017-10-09 10:42:33 +02:00
Marco Gaido	5ac96854cc	[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator ## What changes were proposed in this pull request? Added Python interface for ClusteringEvaluator ## How was this patch tested? Manual test, eg. the example Python code in the comments. cc yanboliang Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19204 from mgaido91/SPARK-21981.	2017-09-22 13:12:33 +08:00
Yanbo Liang	2f962422a2	[MINOR][ML] Remove unnecessary default value setting for evaluators. ## What changes were proposed in this pull request? Remove unnecessary default value setting for all evaluators, as we have set them in corresponding _HasXXX_ base classes. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19262 from yanboliang/evaluation.	2017-09-19 22:22:35 +08:00
Yanbo Liang	c76153cc7d	[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19220 from yanboliang/SPARK-18608.	2017-09-14 14:09:44 +08:00
Ming Jiang	8d8641f122	[SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ming Jiang <mjiang@fanatics.com> Author: Ming Jiang <jmwdpk@gmail.com> Author: jmwdpk <jmwdpk@gmail.com> Closes #19185 from jmwdpk/SPARK-21854.	2017-09-14 13:53:28 +08:00
Ajay Saini	720c94fe77	[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism. I take this PR #18281 over because the original author is busy but we need merge this PR soon. After this been merged, we can close #18281 . ## How was this patch tested? Test suite added. Author: Ajay Saini <ajays725@gmail.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #19110 from WeichenXu123/spark-21027.	2017-09-12 10:02:27 -07:00
Chunsheng Ji	4bab8f5996	[SPARK-21856] Add probability and rawPrediction to MLPC for Python Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python Add unit test. Author: Chunsheng Ji <chunsheng.ji@gmail.com> Closes #19172 from chunshengji/SPARK-21856.	2017-09-11 16:52:48 +08:00
Xin Ren	31c74fec24	[SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API https://issues.apache.org/jira/browse/SPARK-19866 ## What changes were proposed in this pull request? Add Python API for findSynonymsArray matching Scala API. ## How was this patch tested? Manual test `./python/run-tests --python-executables=python2.7 --modules=pyspark-ml` Author: Xin Ren <iamshrek@126.com> Author: Xin Ren <renxin.ubc@gmail.com> Author: Xin Ren <keypointt@users.noreply.github.com> Closes #17451 from keypointt/SPARK-19866.	2017-09-08 12:09:00 -07:00
hyukjinkwon	8598d03a00	[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe ## What changes were proposed in this pull request? This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame. For example, this causes a `ValueError` in Python 2.x when param is a unicode string: ```python >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> lr.hasParam("threshold") True >>> lr.hasParam(u"threshold") Traceback (most recent call last): ... raise TypeError("hasParam(): paramName must be a string") TypeError: hasParam(): paramName must be a string ``` This PR is based on https://github.com/apache/spark/pull/13036 ## How was this patch tested? Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: sethah <seth.hendrickson16@gmail.com> Closes #17096 from HyukjinKwon/SPARK-15243.	2017-09-08 11:57:33 -07:00
Weichen Xu	d6b30edd49	[SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column when transforming data. ## How was this patch tested? Test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.	2017-08-22 21:16:34 -07:00
Bryan Cutler	41bb1ddc63	[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator ## What changes were proposed in this pull request? Added call to copy values of Params from Estimator to Model after fit in PySpark ML. This will copy values for any params that are also defined in the Model. Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object. This is a temporary fix that can be removed once the PySpark models properly define the params themselves. ## How was this patch tested? Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.	2017-08-22 17:40:50 -07:00
Nick Pentreath	988b84d7ed	[SPARK-21468][PYSPARK][ML] Python API for FeatureHasher Add Python API for `FeatureHasher` transformer. ## How was this patch tested? New doc test. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.	2017-08-21 14:35:38 +02:00
Ajay Saini	35db3b9fe3	[SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages ## What changes were proposed in this pull request? Implemented a Python-only persistence framework for pipelines containing stages that cannot be saved using Java. ## How was this patch tested? Created a custom Python-only UnaryTransformer, included it in a Pipeline, and saved/loaded the pipeline. The loaded pipeline was compared against the original using _compare_pipelines() in tests.py. Author: Ajay Saini <ajays725@gmail.com> Closes #18888 from ajaysaini725/PythonPipelines.	2017-08-11 23:57:08 -07:00
WeichenXu	b35660dd0e	[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #18797 from WeichenXu123/update-breeze.	2017-08-09 14:44:10 +08:00
Yanbo Liang	f763d8464b	[SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation. ## What changes were proposed in this pull request? PySpark GLR ```model.summary``` should return a printable representation by calling Scala ```toString```. ## How was this patch tested? ``` from pyspark.ml.regression import GeneralizedLinearRegression dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt") glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3) model = glr.fit(dataset) model.summary ``` Before this PR: ![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png) After this PR: ![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #18870 from yanboliang/spark-19270.	2017-08-08 08:43:58 +08:00
Ajay Saini	fdcee028af	[SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters. ## How was this patch tested? Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests. Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master. Author: Ajay Saini <ajays725@gmail.com> Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.	2017-08-07 17:03:20 -07:00
Ajay Saini	1347b2a697	[SPARK-21633][ML][PYTHON] UnaryTransformer in Python ## What changes were proposed in this pull request? Implemented UnaryTransformer in Python. ## How was this patch tested? This patch was tested by creating a MockUnaryTransformer class in the unit tests that extends UnaryTransformer and testing that the transform function produced correct output. Author: Ajay Saini <ajays725@gmail.com> Closes #18746 from ajaysaini725/AddPythonUnaryTransformer.	2017-08-04 01:01:32 -07:00
zero323	845c039ceb	[SPARK-20601][ML] Python API for Constrained Logistic Regression ## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes #18759 from yanboliang/SPARK-20601.	2017-08-02 18:10:26 +08:00
Zheng RuiFeng	253a07e43a	[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold ## What changes were proposed in this pull request? GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #18612 from zhengruifeng/override_HasXXX.	2017-08-01 21:34:26 +08:00
Yan Facai (颜发才)	a5a3189974	[SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.	2017-07-28 10:10:35 +08:00
Ajay Saini	7047f49f45	[SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest ## What changes were proposed in this pull request? Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark. ## How was this patch tested? Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms. Author: Ajay Saini <ajays725@gmail.com> Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.	2017-07-17 10:07:32 -07:00
Yanbo Liang	69e5282d3c	[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column. ## What changes were proposed in this pull request? ```RFormula``` should handle invalid for both features and label column. #18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases. ## How was this patch tested? Add test cases. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18613 from yanboliang/spark-20307.	2017-07-15 20:56:38 +08:00
Zheng RuiFeng	d2d2a5de18	[SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## What changes were proposed in this pull request? 1, HasHandleInvaild support override 2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## How was this patch tested? existing tests [JIRA](https://issues.apache.org/jira/browse/SPARK-18619) Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.	2017-07-12 22:09:03 +08:00
actuaryzhang	4852b7d447	[SPARK-21310][ML][PYSPARK] Expose offset in PySpark ## What changes were proposed in this pull request? Add offset to PySpark in GLM as in #16699. ## How was this patch tested? Python test Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18534 from actuaryzhang/pythonOffset.	2017-07-05 18:41:00 +08:00
Yanbo Liang	c19680be1c	[SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data ## What changes were proposed in this pull request? This PR is to maintain API parity with changes made in SPARK-17498 to support a new option 'keep' in StringIndexer to handle unseen labels or NULL values with PySpark. Note: This is updated version of #17237 , the primary author of this PR is VinceShieh . ## How was this patch tested? Unit tests. Author: VinceShieh <vincent.xie@intel.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes #18453 from yanboliang/spark-19852.	2017-07-02 16:17:03 +08:00
Ruifeng Zheng	e0b047eafe	[SPARK-18518][ML] HasSolver supports override ## What changes were proposed in this pull request? 1, make param support non-final with `finalFields` option 2, generate `HasSolver` with `finalFields = false` 3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver` ## How was this patch tested? existing tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16028 from zhengruifeng/param_non_final.	2017-07-01 15:37:41 +08:00
Joseph K. Bradley	cc67bd5732	[SPARK-20929][ML] LinearSVC should use its own threshold param ## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.	2017-06-19 23:04:17 -07:00
actuaryzhang	ff5676b01f	[SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula ## What changes were proposed in this pull request? PySpark supports stringIndexerOrderType in RFormula as in #17967. ## How was this patch tested? docstring test Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18122 from actuaryzhang/PythonRFormula.	2017-05-31 01:02:19 +08:00
Yan Facai (颜发才)	139da116f1	[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. ## What changes were proposed in this pull request? Expose numPartitions (expert) param of PySpark FPGrowth. ## How was this patch tested? + [x] Pass all unit tests. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.	2017-05-25 21:40:39 +08:00
Yanbo Liang	913a6bfe4b	[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. ## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18089 from yanboliang/spark-19281.	2017-05-25 20:15:15 +08:00
zero323	1816eb3bef	[SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.	2017-05-24 19:57:44 +08:00
Peng	9afcf127d3	[SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? Add test cases for PR-18062 ## How was this patch tested? The existing UT Author: Peng <peng.meng@intel.com> Closes #18068 from mpjlu/moreTest.	2017-05-24 19:54:17 +08:00
Bago Amirbekian	9434280cfd	[SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators Changes: pyspark.ml Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning). Testing: Existing unit tests. Author: Bago Amirbekian <bago@databricks.com> Closes #18077 from MrBago/delegate_params.	2017-05-23 20:56:01 -07:00
Peng	cfca01136b	[SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes. ## How was this patch tested? The existing UT Author: Peng <peng.meng@intel.com> Closes #18062 from mpjlu/spark-20764.	2017-05-22 22:42:37 +08:00
Wayne Zhang	0f2f56c37b	[SPARK-20736][PYTHON] PySpark StringIndexer supports StringOrderType ## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in #17879. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17978 from actuaryzhang/PythonStringIndexer.	2017-05-21 16:51:55 -07:00
Yanbo Liang	dbe81633a7	[SPARK-20501][ML] ML 2.2 QA: New Scala APIs, docs ## What changes were proposed in this pull request? Review new Scala APIs introduced in 2.2. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17934 from yanboliang/spark-20501.	2017-05-15 21:21:54 -07:00
Yanbo Liang	d4022d4951	[SPARK-20707][ML] ML deprecated APIs should be removed in major release. ## What changes were proposed in this pull request? Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0. Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR. Before: ![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png) After: ![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png) ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17946 from yanboliang/spark-20707.	2017-05-16 10:08:23 +08:00
Yanbo Liang	0698e6c88c	[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit `b8733e0ad9`. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17944 from yanboliang/spark-20606-revert.	2017-05-11 14:48:13 +08:00
zero323	804949c6bf	[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17891 from zero323/SPARK-20631.	2017-05-10 16:57:52 +08:00
Yanbo Liang	b8733e0ad9	[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17867 from yanboliang/spark-20606.	2017-05-09 17:30:37 +08:00
Nick Pentreath	e300a5a145	[SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUsers,Items Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems` ## How was this patch tested? New doc tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17622 from MLnick/SPARK-20300-pyspark-recall.	2017-05-02 10:49:13 +02:00
Yanbo Liang	dbb06c689c	[MINOR][ML] Fix some PySpark & SparkR flaky tests ## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17757 from yanboliang/flaky-test.	2017-04-26 21:34:18 +08:00
Yanbo Liang	67eef47acf	[SPARK-20449][ML] Upgrade breeze version to 0.13.1 ## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449.	2017-04-25 17:10:41 +00:00
Liang-Chi Hsieh	1a52a62377	[SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation ## What changes were proposed in this pull request? The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it. ## How was this patch tested? Python unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17494 from viirya/correlation-python-api.	2017-04-07 11:00:10 +02:00
Liang-Chi Hsieh	12206058e8	[SPARK-20214][ML] Make sure converted csc matrix has sorted indices ## What changes were proposed in this pull request? `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that: from scipy.sparse import lil_matrix lil = lil_matrix((4, 1)) lil[1, 0] = 1 lil[3, 0] = 2 _convert_to_vector(lil.todok()) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector return SparseVector(l.shape[0], csc.indices, csc.data) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__ % (self.indices[i], self.indices[i + 1])) TypeError: Indices 3 and 1 are not strictly increasing A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices: >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17532 from viirya/make-sure-sorted-indices.	2017-04-05 17:46:44 -07:00
Bago Amirbekian	a5c87707ea	[SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest ## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago Amirbekian <bago@databricks.com> Closes #17421 from MrBago/chiSquareTestWrapper.	2017-03-28 19:19:16 -07:00
zero323	0bc8847aa2	[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth ## What changes were proposed in this pull request? - Add `HasSupport` and `HasConfidence` `Params`. - Add new module `pyspark.ml.fpm`. - Add `FPGrowth` / `FPGrowthModel` wrappers. - Provide tests for new features. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17218 from zero323/SPARK-19281.	2017-03-26 16:49:27 -07:00
Nick Pentreath	d9f4ce6943	[SPARK-15040][ML][PYSPARK] Add Imputer to PySpark Add Python wrapper for `Imputer` feature transformer. ## How was this patch tested? New doc tests and tweak to PySpark ML `tests.py` Author: Nick Pentreath <nickp@za.ibm.com> Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.	2017-03-24 08:01:15 -07:00
Yanbo Liang	81303f7ca7	[SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution. ## What changes were proposed in this pull request? PySpark ```GeneralizedLinearRegression``` supports tweedie distribution. ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17146 from yanboliang/spark-19806.	2017-03-08 02:09:36 -08:00
Bryan Cutler	44281ca81d	[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.	2017-03-03 16:43:45 -08:00
Mark Grover	d2a879762a	[SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast ## What changes were proposed in this pull request? Updates the doc string to match up with the code i.e. say dropLast instead of includeFirst ## How was this patch tested? Not much, since it's a doc-like change. Will run unit tests via Jenkins job. Author: Mark Grover <mark@apache.org> Closes #17127 from markgrover/spark_19734.	2017-03-01 22:57:34 -08:00
Yun Ni	3bd8ddf7c3	[MINOR][ML] Fix comments in LSH Examples and Python API ## What changes were proposed in this pull request? Remove `org.apache.spark.examples.` in Add slash in one of the python doc. ## How was this patch tested? Run examples using the commands in the comments. Author: Yun Ni <yunn@uber.com> Closes #17104 from Yunni/yunn_minor.	2017-03-01 22:55:13 -08:00
Nick Pentreath	b405466513	[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation. The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible. The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)). ## How was this patch tested? New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12896 from MLnick/SPARK-14489-als-nan.	2017-02-28 16:17:35 +02:00
Bryan Cutler	2f69e3f60f	[SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation ## What changes were proposed in this pull request? Fixed the PySpark Params.copy method to behave like the Scala implementation. The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map. ## How was this patch tested? Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.	2017-02-23 18:05:58 -08:00
Yun Ni	08c1972a06	[SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <yunn@uber.com> Author: Yanbo Liang <ybliang8@gmail.com> Author: Yunni <Euler57721@gmail.com> Closes #16715 from Yunni/spark-18080.	2017-02-15 16:26:05 -08:00
VinceShieh	6eca21ba88	[SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: https://github.com/apache/spark/pull/15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #16922 from VinceShieh/spark-19590.	2017-02-15 10:12:07 -08:00
zero323	5e7cd3322b	[SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util ## What changes were proposed in this pull request? Add missing `warnings` import. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #16846 from zero323/SPARK-19506.	2017-02-13 09:26:49 -08:00
zero323	fab0d62a71	[SPARK-19467][ML][PYTHON] Remove cyclic imports from pyspark.ml.pipeline ## What changes were proposed in this pull request? Remove cyclic imports between `pyspark.ml.pipeline` and `pyspark.ml`. ## How was this patch tested? Existing unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #16814 from zero323/SPARK-19467.	2017-02-06 18:12:20 -08:00
Zheng RuiFeng	317fa75081	[SPARK-19421][ML][PYSPARK] Remove numClasses and numFeatures methods in LinearSVC ## What changes were proposed in this pull request? Methods `numClasses` and `numFeatures` in LinearSVCModel are already usable by inheriting `JavaClassificationModel` we should not explicitly add them. ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16727 from zhengruifeng/nits_in_linearSVC.	2017-02-05 19:06:51 -08:00
Joseph K. Bradley	1d5d2a9d09	[SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC ## What changes were proposed in this pull request? * Removed Since tags in Python Params since they are inherited by other classes * Fixed doc links for LinearSVC ## How was this patch tested? * doc tests * generating docs locally and checking manually Author: Joseph K. Bradley <joseph@databricks.com> Closes #16723 from jkbradley/pyparam-fix-doc.	2017-02-02 11:58:46 -08:00
Bryan Cutler	57d70d26c8	[SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays ## What changes were proposed in this pull request? Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed. ## How was this patch tested? Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.	2017-01-31 15:42:36 -08:00
wm624@hotmail.com	bb1a1fe05e	[SPARK-19336][ML][PYSPARK] LinearSVC Python API ## What changes were proposed in this pull request? Add Python API for the newly added LinearSVC algorithm. ## How was this patch tested? Add new doc string test. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16694 from wangmiao1981/ser.	2017-01-27 16:03:53 -08:00
Zheng RuiFeng	8ccca9170f	[SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary ## What changes were proposed in this pull request? add loglikelihood in GMM.summary ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #12064 from zhengruifeng/gmm_metric.	2017-01-19 03:46:37 -08:00
Peng, Meng	32286ba68a	[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change ## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.	2017-01-10 13:09:58 +00:00
Yanbo Liang	3ef6d98a80	[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml ## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png) ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15413 from yanboliang/spark-17847.	2017-01-09 21:38:46 -08:00
Niranjan Padmanabhan	a1e40b1f5d	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo ## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.	2017-01-04 15:07:29 +00:00
Peng	79ff853631	[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.	2016-12-28 00:49:36 -08:00
krishnakalyan3	c802ad8718	[SPARK-18628][ML] Update Scala param and Python param to have quotes ## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string.	2016-12-11 09:28:16 +00:00
Sandeep Singh	78bb7f8071	[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper ## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <sandeep@techaddict.me> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #15843 from techaddict/SPARK-18274.	2016-12-01 13:22:40 -08:00
Sandeep Singh	fe854f2e4f	[SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes #15817 from techaddict/SPARK-18366.	2016-11-30 11:33:15 +02:00
Jeff Zhang	4c82ca86d9	[SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark ## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <zjffdu@apache.org> Closes #13557 from zjffdu/SPARK-15819.	2016-11-29 20:51:27 -08:00
Yuhao	9b670bcaec	[SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit ## What changes were proposed in this pull request? make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15972 from hhbyyh/experimental21.	2016-11-29 18:46:59 -08:00
Yanbo Liang	c4a7eef0ce	[SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15913 from yanboliang/spark-18481.	2016-11-26 05:28:41 -08:00
hyukjinkwon	933a6548d4	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation ## What changes were proposed in this pull request? It seems in Python, there are - `Note:` - `NOTE:` - `Note that` - `.. note::` This PR proposes to fix those to `.. note::` to be consistent. Before <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png"> <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png"> After <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png"> <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png"> ## How was this patch tested? The notes were found via ```bash grep -r "Note: " . grep -r "NOTE: " . grep -r "Note that " . ``` And then fixed one by one comparing with API documentation. After that, manually tested via `make html` under `./python/docs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15947 from HyukjinKwon/SPARK-18447.	2016-11-22 11:40:18 +00:00
sethah	e811fbf9ed	[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries.	2016-11-21 05:36:49 -08:00
Felix Cheung	55964c15a7	[SPARK-18239][SPARKR] Gradient Boosted Tree for R ## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt.	2016-11-08 16:00:45 -08:00
Zheng RuiFeng	9dc9f9a5dd	[SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier ## What changes were proposed in this pull request? Add missing 'subsamplingRate' of pyspark GBTClassifier ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15692 from zhengruifeng/gbt_subsamplingRate.	2016-11-03 07:45:20 -07:00
Joseph K. Bradley	91c33a0ca5	[SPARK-18088][ML] Various ChiSqSelector cleanups ## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes #15647 from jkbradley/chisqselector-follow-ups.	2016-11-01 17:00:00 -07:00
Felix Cheung	7c37869292	[SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest regression and classification ## What changes were proposed in this pull request? Add subsmaplingRate to randomForestClassifier Add varianceCol to randomForestRegressor In Python ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15638 from felixcheung/pyrandomforest.	2016-10-30 16:21:37 -07:00
VinceShieh	0b076d4cb6	[SPARK-17219][ML] enhanced NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.	2016-10-27 11:52:15 -07:00
Peng	c8b612decb	[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference ## What changes were proposed in this pull request? For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features. So we change statistic to pValue for SelectKBest and SelectPercentile ## How was this patch tested? change existing test Author: Peng <peng.meng@intel.com> Closes #15444 from mpjlu/chisqure-bug.	2016-10-14 12:48:57 +01:00
Yanbo Liang	1db8feab8c	[SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load ## What changes were proposed in this pull request? Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy. ## How was this patch tested? Add python doctest. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13194 from yanboliang/spark-15402.	2016-10-14 04:17:03 -07:00
Yanbo Liang	44cbb61b34	[SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15430 from yanboliang/spark-15957-python.	2016-10-13 19:44:24 -07:00
WeichenXu	0d4a695279	[SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter ## What changes were proposed in this pull request? update python api for NaiveBayes: add weight col parameter. ## How was this patch tested? doctests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15406 from WeichenXu123/nb_python_update.	2016-10-12 19:52:57 -07:00
Zheng RuiFeng	c17f971839	[SPARK-17744][ML] Parity check between the ml and mllib test suites for NB ## What changes were proposed in this pull request? 1,parity check and add missing test suites for ml's NB 2,remove some unused imports ## How was this patch tested? manual tests in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15312 from zhengruifeng/nb_test_parity.	2016-10-04 06:54:48 -07:00
zero323	d8399b600c	[SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract ## What changes were proposed in this pull request? Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior. Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`. ## How was this patch tested? PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved. Author: zero323 <zero323@users.noreply.github.com> Closes #15144 from zero323/SPARK-17587.	2016-10-03 17:57:54 -07:00
Jason White	1f31bdaef6	[SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch ## What changes were proposed in this pull request? This PR removes a patch on ListConverter from https://github.com/apache/spark/pull/5570, as it is no longer necessary. The underlying issue in Py4J https://github.com/bartdag/py4j/issues/160 was patched in `224b94b666` and is present in 0.10.3, the version currently in use in Spark. ## How was this patch tested? The original test added in https://github.com/apache/spark/pull/5570 remains. Author: Jason White <jason.white@shopify.com> Closes #15254 from JasonMWhite/remove_listconverter_patch.	2016-10-03 14:12:03 -07:00
Sean Owen	b88cb63da3	[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. ## What changes were proposed in this pull request? Partial revert of #15277 to instead sort and store input to model rather than require sorted input ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15299 from srowen/SPARK-17704.2.	2016-10-01 16:10:39 -04:00
WeichenXu	7f16affa26	[SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression ## What changes were proposed in this pull request? Add Python API for multinomial logistic regression. - add `family` param in python api. - expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel` - add python-side testcase for multinomial logistic regression - update python doc. ## How was this patch tested? existing and added doc tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14852 from WeichenXu123/add_MLOR_python.	2016-09-27 00:00:21 -07:00
Yanbo Liang	ac65139be9	[SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API. ## What changes were proposed in this pull request? #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed: * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```. * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also. * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620) * We should use lower case of the selector type names to follow MLlib convention. * Add ML Python API. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15214 from yanboliang/spark-17017.	2016-09-26 09:45:33 +01:00
Sean Owen	248916f558	[SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most one 0 ## What changes were proposed in this pull request? Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0 ## How was this patch tested? Jenkins tests plus new test cases Author: Sean Owen <sowen@cloudera.com> Closes #15149 from srowen/SPARK-17057.	2016-09-24 08:15:55 +01:00
WeichenXu	72d9fba26c	[SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression ## What changes were proposed in this pull request? Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR. ## How was this patch tested? Existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.	2016-09-22 04:35:54 -07:00
VinceShieh	57dc326bd0	[SPARK-17219][ML] Add NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.	2016-09-21 10:20:57 +01:00
Yanbo Liang	883c763184	[SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means\|\| default init steps from 5 to 2. ## What changes were proposed in this pull request? #14956 reduced default k-means\|\| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15050 from yanboliang/spark-17389.	2016-09-11 13:47:13 +01:00
Yanbo Liang	39d538dddf	[MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel. ## What changes were proposed in this pull request? ```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it. ## How was this patch tested? Doc change. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14967 from yanboliang/mlp-weights.	2016-09-06 03:30:37 -07:00
Yanbo Liang	6b8cb1fe52	[SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level configurable. ## What changes were proposed in this pull request? [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function. ## How was this patch tested? Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result: ![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14766 from yanboliang/spark-17197.	2016-08-25 02:26:33 -07:00
Holden Karau	b264cbb16f	[SPARK-15113][PYSPARK][ML] Add missing num features num classes ## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.	2016-08-22 12:21:22 +02:00
Bryan Cutler	39f328ba35	[SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used without stages ## What changes were proposed in this pull request? When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code. ## How was this patch tested? Added new unit tests to verify an empty Pipeline acts as an identity transformer Author: Bryan Cutler <cutlerb@gmail.com> Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.	2016-08-19 23:46:36 -07:00
Jeff Zhang	072acf5e14	[SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector. ## What changes were proposed in this pull request? 1. In scala, add negative low bound checking and put all the low/upper bound checking in one place 2. In python, add low/upper bound checking of indices. ## How was this patch tested? unit test added Author: Jeff Zhang <zjffdu@apache.org> Closes #14555 from zjffdu/SPARK-16965.	2016-08-19 12:38:15 +01:00
Nick Lavers	5377fc6236	[SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace JIRA issue link: https://issues.apache.org/jira/browse/SPARK-16961 Changed one line of Utils.randomizeInPlace to allow elements to stay in place. Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution. Author: Nick Lavers <nick.lavers@videoamp.com> Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.	2016-08-19 10:11:59 +01:00
Yanbo Liang	ccc6dc0f4b	[MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark ## What changes were proposed in this pull request? Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14454 from yanboliang/TreeEnsembleModel.	2016-08-11 22:39:19 -07:00
=^_^=	639df046a2	[SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics ## What changes were proposed in this pull request? avgMetrics was summed, not averaged, across folds Author: =^_^= <maxmoroz@gmail.com> Closes #14456 from pkch/pkch-patch-1.	2016-08-03 04:18:28 -07:00
krishnakalyan3	7e8279fde1	[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc ## What changes were proposed in this pull request? Updated ML pipeline Cross Validation Scaladoc & PyDoc. ## How was this patch tested? Documentation update (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #13894 from krishnakalyan3/kfold-cv.	2016-07-27 15:37:38 +02:00
WeichenXu	ad3708e783	[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14286 from WeichenXu123/update_ann_tol.	2016-07-25 20:00:37 +01:00
WeichenXu	37bed97de5	[PYSPARK] add picklable SparseMatrix in pyspark.ml.common ## What changes were proposed in this pull request? add `SparseMatrix` class whick support pickler. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14265 from WeichenXu123/picklable_py.	2016-07-24 02:29:08 -07:00
Yanbo Liang	670891496a	[SPARK-16494][ML] Upgrade breeze version to 0.12 ## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.	2016-07-19 12:31:04 +01:00
Joseph K. Bradley	5ffd5d3838	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * Reviewers: please check this carefully * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * Reviewers: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * Reviewers: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.	2016-07-15 13:38:23 -07:00
Joseph K. Bradley	01f09b1612	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.	2016-07-13 12:33:39 -07:00
Joseph K. Bradley	fdde7d0aa0	[SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls ## What changes were proposed in this pull request? Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark. This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X ## How was this patch tested? Existing unit tests. Manual testing in an environment where this was an issue. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14023 from jkbradley/SPARK-16348.	2016-07-05 17:00:24 -07:00
Nick Pentreath	18faa588ca	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg [SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.	2016-06-22 10:05:25 -07:00
Holden Karau	d281b0bafe	[SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs ## What changes were proposed in this pull request? Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc. ## How was this patch tested? Built docs locally & PySpark SQL tests Author: Holden Karau <holden@us.ibm.com> Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.	2016-06-22 11:54:49 +02:00
Bryan Cutler	b76e355376	[SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None ## What changes were proposed in this pull request? Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero. ## How was this patch tested? Ran PySpark tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.	2016-06-21 11:43:25 -07:00
Nick Pentreath	37494a18e8	[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.	2016-06-21 00:39:47 -07:00
Bryan Cutler	a42bf55532	[SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.	2016-06-20 16:28:11 -07:00
Liang-Chi Hsieh	baa3e633e1	[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.	2016-06-13 19:59:53 -07:00
Bryan Cutler	7d7a0a5e07	[SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.	2016-06-10 11:27:30 -07:00
WeichenXu	cdd7f5a57a	[SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameter ## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.	2016-06-10 12:26:53 +01:00
Jeff Zhang	e594b49283	[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.	2016-06-09 09:54:38 -07:00
Yanbo Liang	a95252823e	[SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.	2016-06-06 09:36:34 +01:00
Zheng RuiFeng	fd8af39713	[MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.	2016-06-06 09:35:47 +01:00

... 2 3 4 5 6 ...

562 commits