ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Weichen Xu	80161238fe	[SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading ### What changes were proposed in this pull request? Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-01 09:36:42 +08:00
zhengruifeng	689c294102	[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR ### What changes were proposed in this pull request? use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. ### Does this PR introduce _any_ user-facing change? yes, param blockSize -> blockSizeInMB in master ### How was this patch tested? updated testsuites Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-18 23:02:31 +08:00
zhengruifeng	a2887164bc	[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC ### What changes were proposed in this pull request? 1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors; 2, infer an appropriate `maxBlockSizeInMB` if set 0; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. f2jBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 326481 \| 26143 \| 25710 \| 24726 \| 25395 \| 25840 \| 26846 \| 25927 \| 27431 \| 26190 \| 26056 \| 26347 \| 27204 epsilon3000(67%) \| 455247 \| 35893 \| 34366 \| 34985 \| 38387 \| 38901 \| 40426 \| 40044 \| 39161 \| 38767 \| 39965 \| 39523 \| 39108 epsilon4000(50%) \| 306390 \| 42256 \| 41164 \| 43748 \| 48638 \| 50892 \| 50986 \| 51091 \| 51072 \| 51289 \| 51652 \| 53312 \| 52146 epsilon5000(40%) \| 307619 \| 43639 \| 42992 \| 44743 \| 50800 \| 51939 \| 51871 \| 52190 \| 53850 \| 52607 \| 51062 \| 52509 \| 51570 epsilon10000(20%) \| 310070 \| 58371 \| 55921 \| 56317 \| 56618 \| 53694 \| 52131 \| 51768 \| 51728 \| 52233 \| 51881 \| 51653 \| 52440 epsilon20000(10%) \| 316565 \| 109193 \| 95121 \| 82764 \| 69653 \| 60764 \| 56066 \| 53371 \| 52822 \| 52872 \| 52769 \| 52527 \| 53508 epsilon200000(1%) \| 336181 \| 1569721 \| 1069355 \| 673718 \| 375043 \| 218230 \| 145393 \| 110926 \| 94327 \| 87039 \| 83926 \| 81890 \| 81787 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 12.48827602 \| 12.69859977 \| 13.20395535 \| 12.85611341 \| 12.63471362 \| 12.16125307 \| 12.59231689 \| 11.90189931 \| 12.46586483 \| 12.5299739 \| 12.39158158 \| 12.00121306 epsilon3000(67%) \| 1 \| 12.68344803 \| 13.2470174 \| 13.01263399 \| 11.85940553 \| 11.70270687 \| 11.26124276 \| 11.36866946 \| 11.62500958 \| 11.74315784 \| 11.39114225 \| 11.51853351 \| 11.64076404 epsilon4000(50%) \| 1 \| 7.250804619 \| 7.443154212 \| 7.003520161 \| 6.299395534 \| 6.020396133 \| 6.00929667 \| 5.996946625 \| 5.999177632 \| 5.973795551 \| 5.931812902 \| 5.747111345 \| 5.875618456 epsilon5000(40%) \| 1 \| 7.049176196 \| 7.155261444 \| 6.875243055 \| 6.055492126 \| 5.92269778 \| 5.930462108 \| 5.894213451 \| 5.712516249 \| 5.847491779 \| 6.024421292 \| 5.858405226 \| 5.965076595 epsilon10000(20%) \| 1 \| 5.312055644 \| 5.544786395 \| 5.505797539 \| 5.4765269 \| 5.774760681 \| 5.947900481 \| 5.98960748 \| 5.994239097 \| 5.93628549 \| 5.976561747 \| 6.002942714 \| 5.912852784 epsilon20000(10%) \| 1 \| 2.899132728 \| 3.328024306 \| 3.824911797 \| 4.544886796 \| 5.209745902 \| 5.64629187 \| 5.931404695 \| 5.993052137 \| 5.987384627 \| 5.999071425 \| 6.026710073 \| 5.916218136 epsilon200000(1%) \| 1 \| 0.214166084 \| 0.314377358 \| 0.498993644 \| 0.896379882 \| 1.540489392 \| 2.312222734 \| 3.03067811 \| 3.563995463 \| 3.862417997 \| 4.005683578 \| 4.105275369 \| 4.110445425 OpenBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 299119 \| 26047 \| 25049 \| 25239 \| 28001 \| 35138 \| 36438 \| 36279 \| 36114 \| 35111 \| 35428 \| 36295 \| 35197 epsilon3000(67%) \| 439798 \| 33321 \| 34423 \| 34336 \| 38906 \| 51756 \| 54138 \| 54085 \| 53412 \| 54766 \| 54425 \| 54221 \| 54842 epsilon4000(50%) \| 302963 \| 42960 \| 40678 \| 43483 \| 48254 \| 50888 \| 54990 \| 52647 \| 51947 \| 51843 \| 52891 \| 53410 \| 52020 epsilon5000(40%) \| 303569 \| 44225 \| 44961 \| 45065 \| 51768 \| 52776 \| 51930 \| 53587 \| 53104 \| 51833 \| 52138 \| 52574 \| 53756 epsilon10000(20%) \| 307403 \| 58447 \| 55993 \| 56757 \| 56694 \| 54038 \| 52734 \| 52073 \| 52051 \| 52150 \| 51986 \| 52407 \| 52390 epsilon20000(10%) \| 313344 \| 107580 \| 94679 \| 83329 \| 70226 \| 60996 \| 57130 \| 55461 \| 54641 \| 52712 \| 52541 \| 53101 \| 53312 epsilon200000(1%) \| 334679 \| 1642726 \| 1073148 \| 654481 \| 364974 \| 213881 \| 140248 \| 107579 \| 91757 \| 85090 \| 81940 \| 80492 \| 80250 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 11.48381771 \| 11.94135494 \| 11.85146004 \| 10.68243991 \| 8.512692811 \| 8.208985125 \| 8.244962651 \| 8.282632774 \| 8.519238985 \| 8.443011178 \| 8.241328007 \| 8.498423161 epsilon3000(67%) \| 1 \| 13.19882356 \| 12.7762833 \| 12.80865564 \| 11.30411762 \| 8.497526857 \| 8.123646976 \| 8.131607655 \| 8.234067251 \| 8.030493372 \| 8.080808452 \| 8.111211523 \| 8.01936472 epsilon4000(50%) \| 1 \| 7.052211359 \| 7.44783421 \| 6.967389555 \| 6.278505409 \| 5.953525389 \| 5.509419895 \| 5.754610899 \| 5.832155851 \| 5.843855487 \| 5.728063376 \| 5.672402172 \| 5.823971549 epsilon5000(40%) \| 1 \| 6.86419446 \| 6.751829363 \| 6.736247642 \| 5.864027971 \| 5.752027437 \| 5.845734643 \| 5.664974714 \| 5.716499699 \| 5.856674319 \| 5.822413595 \| 5.774127896 \| 5.647164968 epsilon10000(20%) \| 1 \| 5.259517169 \| 5.490025539 \| 5.416124883 \| 5.422143437 \| 5.688645028 \| 5.829313157 \| 5.903308816 \| 5.905803923 \| 5.894592522 \| 5.913188166 \| 5.865685882 \| 5.867589235 epsilon20000(10%) \| 1 \| 2.912660346 \| 3.309540658 \| 3.760323537 \| 4.461937174 \| 5.137123746 \| 5.48475407 \| 5.649807973 \| 5.734594901 \| 5.944452876 \| 5.963799699 \| 5.900905821 \| 5.87755102 epsilon200000(1%) \| 1 \| 0.203733915 \| 0.311866583 \| 0.511365494 \| 0.916994087 \| 1.564790701 \| 2.38633706 \| 3.111006795 \| 3.647449241 \| 3.933235398 \| 4.084439834 \| 4.157916315 \| 4.170454829 ### Does this PR introduce _any_ user-facing change? yes, param `blockSize` -> `blockSizeInMB` in master ### How was this patch tested? added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907)) Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-12 19:14:07 +08:00
Ruifeng Zheng	6244407ce6	Revert "[WIP] Test (#30327 )" This reverts commit `61ee5d8a4e`. ### What changes were proposed in this pull request? I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009, but I merged it to master by mistake. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 11:32:12 +09:00
WeichenXu	61ee5d8a4e	[WIP] Test (#30327 ) * resend * address comments * directly gen new Iter * directly gen new Iter * update blockify strategy * address comments * try to fix 2.13 * try to fix scala 2.13 * use 1.0 as the default value for gemv * update Co-authored-by: zhengruifeng <ruifengz@foxmail.com>	2020-11-12 10:20:33 +08:00
zero323	090962cd42	[SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.ml` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30285 from zero323/SPARK-33251. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 09:33:48 +09:00
zero323	31a16fbb40	[SPARK-32714][PYTHON] Initial pyspark-stubs port ### What changes were proposed in this pull request? This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. This PR adds type annotations directly to Spark source. This can impact interaction with development tools for users, which haven't used `pyspark-stubs`. ### How was this patch tested? - [x] MyPy tests of the PySpark source ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` - [x] MyPy tests of Spark examples ``` MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` - [x] Existing Flake8 linter - [x] Existing unit tests Tested against: - `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894` - `mypy==0.782` Closes #29591 from zero323/SPARK-32681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:15:36 +09:00
zero323	c918909c1a	[SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword ### What changes were proposed in this pull request? Replace `__metaclass__` fields with `metaclass` keyword in the class statements. ### Why are the changes needed? `__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons. ``` >>> from pyspark.sql.types import BooleanType >>> BooleanType() is BooleanType() False ``` and classes, which suppose to be abstract, are not ``` >>> import inspect >>> from pyspark.ml import Estimator >>> inspect.isabstract(Estimator) False ``` ### Does this PR introduce _any_ user-facing change? Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix. ### How was this patch tested? Existing tests. Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:22:11 +09:00
Fokko Driesprong	9fcf0ea718	[SPARK-32319][PYSPARK] Disallow the use of unused imports Disallow the use of unused imports: - Unnecessary increases the memory footprint of the application - Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :) ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused python/pyspark/heapq3.py:868:5: F401 '_heapq.' imported but unused python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused python/pyspark/rdd.py:21:1: F401 're' imported but unused python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.' imported but unused python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.' imported but unused python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.' imported but unused python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.' imported but unused python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.' imported but unused python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.' imported but unused python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.' imported but unused python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.' imported but unused python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.' imported but unused python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.' imported but unused python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.' imported but unused python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.' imported but unused python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.' imported but unused python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.' imported but unused python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.' imported but unused python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.' imported but unused python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.' imported but unused python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.' imported but unused python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.' imported but unused python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.' imported but unused python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.' imported but unused python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.' imported but unused python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.' imported but unused python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.' imported but unused python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.' imported but unused python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.' imported but unused python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.' imported but unused python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.' imported but unused python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.' imported but unused python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.' imported but unused python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.' imported but unused python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.' imported but unused python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.' imported but unused python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.' imported but unused python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.' imported but unused python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.' imported but unused python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.' imported but unused python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.' imported but unused python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.' imported but unused python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.' imported but unused python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.' imported but unused python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.' imported but unused python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.' imported but unused python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.' imported but unused python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused ``` After: ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" fokkodriesprongFan spark % ``` ### What changes were proposed in this pull request? Removing unused imports from the Python files to keep everything nice and tidy. ### Why are the changes needed? Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding the rule to the existing Flake8 checks. Closes #29121 from Fokko/SPARK-32319. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-08 08:51:57 -07:00
HyukjinKwon	4ad9bfd53b	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 ### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-14 11:22:44 +09:00
Huaxin Gao	3ce1dff7ba	[SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations ### What changes were proposed in this pull request? jira link: https://issues.apache.org/jira/browse/SPARK-30930 Remove ML/MLLIB DeveloperApi annotations. ### Why are the changes needed? The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR. ### Does this PR introduce any user-facing change? Yes. DeveloperApi annotations are removed from docs. ### How was this patch tested? existing tests Closes #27859 from huaxingao/spark-30930. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-16 12:41:22 -05:00
Huaxin Gao	a7ae77a8d8	[SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP ### What changes were proposed in this pull request? Add ```HasBlockSize``` in shared Params in both Scala and Python. Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Add ```HasBlockSize ``` in ALS, so user can specify the blockSize. Make ```HasBlockSize``` a shared param so both ALS and MLP can use it. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize/getBlockSize``` ```ALSModel.setBlockSize/getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27501 from huaxingao/spark_30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-09 13:14:30 +08:00
zhengruifeng	12e1bbaddb	Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" ### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-08 08:46:16 +08:00
zhengruifeng	073ce12543	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors 2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (40% ~ 92%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27374 from zhengruifeng/blockify_lor. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 10:52:07 -06:00
zhengruifeng	96d27274f5	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, stack input vectors to blocks (like ALS/MLP); 2, add new param `blockSize`; 3, add a new class `InstanceBlock` 4, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save ~40% in test) 2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS) ### Does this PR introduce any user-facing change? a new param `blockSize` ### How was this patch tested? existing and updated testsuites Closes #27360 from zhengruifeng/blockify_svc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-28 20:55:21 +08:00
John Bauer	e804ed5e33	[SPARK-29691][ML][PYTHON] ensure Param objects are valid in fit, transform modify Param._copyValues to check valid Param objects supplied as extra ### What changes were proposed in this pull request? Estimator.fit() and Model.transform() accept a dictionary of extra parameters whose values are used to overwrite those supplied at initialization or by default. Additionally, the ParamGridBuilder.addGrid accepts a parameter and list of values. The keys are presumed to be valid Param objects. This change adds a check that only Param objects are supplied as keys. ### Why are the changes needed? Param objects are created by and bound to an instance of Params (Estimator, Model, or Transformer). They may be obtained from their parent as attributes, or by name through getParam. The documentation does not state that keys must be valid Param objects, nor describe how one may be obtained. The current behavior is to silently ignore keys which are not valid Param objects. ### Does this PR introduce any user-facing change? If the user does not pass in a Param object as required for keys in `extra` for Estimator.fit() and Model.transform(), and `param` for ParamGridBuilder.addGrid, an error will be raised indicating it is an invalid object. ### How was this patch tested? Added method test_copy_param_extras_check to test_param.py. Tested with Python 3.7 Closes #26527 from JohnHBauer/paramExtra. Authored-by: John Bauer <john.h.bauer@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-11-19 14:15:00 -08:00
zhengruifeng	bb478706b5	[SPARK-29645][ML][PYSPARK] ML add param RelativeError ### What changes were proposed in this pull request? 1, add shared param `relativeError` 2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError` ### Why are the changes needed? It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead. `QuantileDiscretizer` had already added this param, while other algs not yet. ### Does this PR introduce any user-facing change? yes, new param is added in `Imputer`/`RobusterScaler` ### How was this patch tested? existing testsutes Closes #26305 from zhengruifeng/add_relative_err. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-31 13:52:28 +08:00
Huaxin Gao	b19fd487df	[SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py ### What changes were proposed in this pull request? Remove automatically generated param setters in _shared_params_code_gen.py ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes Add some setters in Python ML XXXModels ### How was this patch tested? unit tests Closes #26232 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-28 11:36:10 +08:00
Huaxin Gao	901ff92969	[SPARK-29464][PYTHON][ML] PySpark ML should expose Params.clear() to unset a user supplied Param ### What changes were proposed in this pull request? change PySpark ml ```Params._clear``` to ```Params.clear``` ### Why are the changes needed? PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param. ### Does this PR introduce any user-facing change? Yes. PySpark ml ```Params._clear``` ---> ```Params.clear``` ### How was this patch tested? Add test. Closes #26130 from huaxingao/spark-29464. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-10-17 17:02:31 -07:00
zhengruifeng	4d27a25908	[SPARK-22797][ML][PYTHON] Bucketizer support multi-column ### What changes were proposed in this pull request? Bucketizer support multi-column in the python side ### Why are the changes needed? Bucketizer should support multi-column like the scala side. ### Does this PR introduce any user-facing change? yes, this PR add new Python API ### How was this patch tested? added testsuites Closes #25801 from zhengruifeng/20542_py. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-17 11:52:20 +08:00
zhengruifeng	4664a082c2	[SPARK-28968][ML] Add HasNumFeatures in the scala side ### What changes were proposed in this pull request? Add HasNumFeatures in the scala side, with `1<<18` as the default value ### Why are the changes needed? HasNumFeatures is already added in the py side, it is reasonable to keep them in sync. I don't find other similar place. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #25671 from zhengruifeng/add_HasNumFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-06 11:50:45 +08:00
Huaxin Gao	ba5ee27706	[SPARK-28243][PYSPARK][ML][FOLLOW-UP] Move Python DecisionTreeParams to regression.py ## What changes were proposed in this pull request? Leave ```shared.py``` untouched. Move Python ```DecisionTreeParams``` to ```regression.py``` ## How was this patch tested? Use existing tests Closes #25406 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-15 10:21:26 -05:00
Huaxin Gao	72c80ee81c	[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## What changes were proposed in this pull request? Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## How was this patch tested? Use existing tests. Closes #25046 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-20 10:44:33 -05:00
Huaxin Gao	20278e719e	[SPARK-24333][ML][PYTHON] Add fit with validation set to spark.ml GBT: Python API ## What changes were proposed in this pull request? Add validationIndicatorCol and validationTol to GBT Python. ## How was this patch tested? Add test in doctest to test the new API. Closes #21465 from huaxingao/spark-24333. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-07 13:53:35 -08:00
Huaxin Gao	2224861f2f	[SPARK-24439][ML][PYTHON] Add distanceMeasure to BisectingKMeans in PySpark ## What changes were proposed in this pull request? add distanceMeasure to BisectingKMeans in Python. ## How was this patch tested? added doctest and also manually tested it. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21557 from huaxingao/spark-24439.	2018-06-28 14:07:28 -07:00
WeichenXu	04614820e1	[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting. ## How was this patch tested? UT added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19627 from WeichenXu123/expose-model-list-py.	2018-04-16 11:31:24 -05:00
Nick Pentreath	a8a3e9b7cf	Revert "[SPARK-22797][PYSPARK] Bucketizer support multi-column" This reverts commit `c22eaa94e8`.	2018-01-26 23:48:02 +02:00
Zheng RuiFeng	c22eaa94e8	[SPARK-22797][PYSPARK] Bucketizer support multi-column ## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19892 from zhengruifeng/20542_py.	2018-01-26 12:28:27 +02:00
WeichenXu	b5042d75c2	[SPARK-23008][ML] OnehotEncoderEstimator python API ## What changes were proposed in this pull request? OnehotEncoderEstimator python API. ## How was this patch tested? doctest Author: WeichenXu <weichen.xu@databricks.com> Closes #20209 from WeichenXu123/ohe_py.	2018-01-11 16:20:30 -08:00
sethah	70bcc9d5ae	[SPARK-22993][ML] Clarify HasCheckpointInterval param doc ## What changes were proposed in this pull request? Add a note to the `HasCheckpointInterval` parameter doc that clarifies that this setting is ignored when no checkpoint directory has been set on the spark context. ## How was this patch tested? No tests necessary, just a doc update. Author: sethah <shendrickson@cloudera.com> Closes #20188 from sethah/als_checkpoint_doc.	2018-01-09 23:32:47 -08:00
Yanbo Liang	fb0562f346	[SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss. ## What changes were proposed in this pull request? Expose Python API for _LinearRegression_ with _huber_ loss. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19994 from yanboliang/spark-22810.	2017-12-20 17:51:42 -08:00
Ajay Saini	720c94fe77	[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism. I take this PR #18281 over because the original author is busy but we need merge this PR soon. After this been merged, we can close #18281 . ## How was this patch tested? Test suite added. Author: Ajay Saini <ajays725@gmail.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #19110 from WeichenXu123/spark-21027.	2017-09-12 10:02:27 -07:00
hyukjinkwon	8598d03a00	[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe ## What changes were proposed in this pull request? This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame. For example, this causes a `ValueError` in Python 2.x when param is a unicode string: ```python >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> lr.hasParam("threshold") True >>> lr.hasParam(u"threshold") Traceback (most recent call last): ... raise TypeError("hasParam(): paramName must be a string") TypeError: hasParam(): paramName must be a string ``` This PR is based on https://github.com/apache/spark/pull/13036 ## How was this patch tested? Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: sethah <seth.hendrickson16@gmail.com> Closes #17096 from HyukjinKwon/SPARK-15243.	2017-09-08 11:57:33 -07:00
Ajay Saini	fdcee028af	[SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters. ## How was this patch tested? Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests. Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master. Author: Ajay Saini <ajays725@gmail.com> Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.	2017-08-07 17:03:20 -07:00
zero323	845c039ceb	[SPARK-20601][ML] Python API for Constrained Logistic Regression ## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes #18759 from yanboliang/SPARK-20601.	2017-08-02 18:10:26 +08:00
Bryan Cutler	2f69e3f60f	[SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation ## What changes were proposed in this pull request? Fixed the PySpark Params.copy method to behave like the Scala implementation. The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map. ## How was this patch tested? Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.	2017-02-23 18:05:58 -08:00
Joseph K. Bradley	1d5d2a9d09	[SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC ## What changes were proposed in this pull request? * Removed Since tags in Python Params since they are inherited by other classes * Fixed doc links for LinearSVC ## How was this patch tested? * doc tests * generating docs locally and checking manually Author: Joseph K. Bradley <joseph@databricks.com> Closes #16723 from jkbradley/pyparam-fix-doc.	2017-02-02 11:58:46 -08:00
wm624@hotmail.com	bb1a1fe05e	[SPARK-19336][ML][PYSPARK] LinearSVC Python API ## What changes were proposed in this pull request? Add Python API for the newly added LinearSVC algorithm. ## How was this patch tested? Add new doc string test. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16694 from wangmiao1981/ser.	2017-01-27 16:03:53 -08:00
Sean Owen	248916f558	[SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most one 0 ## What changes were proposed in this pull request? Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0 ## How was this patch tested? Jenkins tests plus new test cases Author: Sean Owen <sowen@cloudera.com> Closes #15149 from srowen/SPARK-17057.	2016-09-24 08:15:55 +01:00
Yanbo Liang	6b8cb1fe52	[SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level configurable. ## What changes were proposed in this pull request? [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function. ## How was this patch tested? Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result: ![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14766 from yanboliang/spark-17197.	2016-08-25 02:26:33 -07:00
DB Tsai	e2efe0529a	[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.	2016-05-17 12:51:07 -07:00
Zheng RuiFeng	87d69a01f0	[MINOR][PYSPARK] update _shared_params_code_gen.py ## What changes were proposed in this pull request? 1, add arg-checkings for `tol` and `stepSize` to keep in line with `SharedParamsCodeGen.scala` 2, fix one typo ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12996 from zhengruifeng/py_args_checking.	2016-05-13 08:52:06 +02:00
Yanbo Liang	d26f7cb012	[SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up ## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.	2016-05-03 16:46:13 +02:00
Jason Lee	bfda099913	[SPARK-14768][ML][PYSPARK] removed expectedType from Param __init__() ## What changes were proposed in this pull request? Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA. ## How was this patch tested? Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix. Author: Jason Lee <cjlee@us.ibm.com> Closes #12581 from jasoncl/SPARK-14768.	2016-04-25 15:32:11 +02:00
Joseph K. Bradley	36da5e3234	[SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable ## What changes were proposed in this pull request? Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python. This PR: Use unicode everywhere in Python. ## How was this patch tested? Updated persistence unit test to check uid type Author: Joseph K. Bradley <joseph@databricks.com> Closes #12368 from jkbradley/python-uid-unicode.	2016-04-16 11:23:28 -07:00
sethah	129f2f455d	[SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <seth.hendrickson16@gmail.com> Closes #11939 from sethah/SPARK-14104.	2016-04-15 12:14:41 -07:00
wm624@hotmail.com	e0ad75f2b5	[SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API ## What changes were proposed in this pull request? A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code. This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor. ## How was this patch tested? ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (12s) Finished test(python2.7): pyspark.ml.clustering (18s) Finished test(python2.7): pyspark.ml.classification (30s) Finished test(python2.7): pyspark.ml.recommendation (28s) Finished test(python2.7): pyspark.ml.feature (43s) Finished test(python2.7): pyspark.ml.regression (31s) Finished test(python2.7): pyspark.ml.tuning (19s) Finished test(python2.7): pyspark.ml.tests (34s) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12116 from wangmiao1981/fix_api.	2016-04-08 10:47:05 -07:00
Joseph K. Bradley	cf823bead1	[SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one Primary change: * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning. * spark.mllib now calls the spark.ml implementation. * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed. ml.tree.DecisionTreeModel * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation. ml.tree.Node * Added ```private[tree] def deepCopy```, used by unit tests Copied developer comments from spark.mllib implementation to spark.ml one. Moving unit tests * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite. * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side. * I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values. * No new unit tests were added. * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type. Changes of behavior (to be noted in SPARK-13448 once this PR is merged) * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448. * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11855 from jkbradley/remove-mllib-tree-impl.	2016-03-23 21:16:00 -07:00
sethah	30bdb5cbd9	[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params ## What changes were proposed in this pull request? This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type. This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira. ## How was this patch tested? Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided. Author: sethah <seth.hendrickson16@gmail.com> Closes #11663 from sethah/SPARK-13068-tc.	2016-03-23 11:20:44 -07:00
Bryan Cutler	d8813fa043	[SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list ## What changes were proposed in this pull request? Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance. This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error. ## How was this patch tested? Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class. Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.	2016-03-08 17:34:25 -08:00

1 2

81 commits