ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Phillip Henry	397b843890	[SPARK-34415][ML] Randomization in hyperparameter optimization ### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-27 08:34:39 -06:00
Fokko Driesprong	e4d1c10760	[SPARK-32320][PYSPARK] Remove mutable default arguments This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ ``` fokkodriesprongFan spark % grep -R "={}" python \| grep def python/pyspark/resource/profile.py: def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}): python/pyspark/sql/functions.py:def from_json(col, schema, options={}): python/pyspark/sql/functions.py:def to_json(col, options={}): python/pyspark/sql/functions.py:def schema_of_json(json, options={}): python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}): python/pyspark/sql/functions.py:def to_csv(col, options={}): python/pyspark/sql/functions.py:def from_csv(col, schema, options={}): python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}): ``` ``` fokkodriesprongFan spark % grep -R "=\[\]" python \| grep def python/pyspark/ml/tuning.py: def __init__(self, bestModel, avgMetrics=[], subModels=None): python/pyspark/ml/tuning.py: def __init__(self, bestModel, validationMetrics=[], subModels=None): ``` ### What changes were proposed in this pull request? Removing the mutable default arguments. ### Why are the changes needed? Removing the mutable default arguments, and changing the signature to `Optional[...]`. ### Does this PR introduce _any_ user-facing change? No 👍 ### How was this patch tested? Using the Flake8 bugbear code analysis plugin. Closes #29122 from Fokko/SPARK-32320. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 09:35:36 +08:00
Weichen Xu	7e759b2d95	[SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator ### What changes were proposed in this pull request? make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model ### Why are the changes needed? Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark. CrossValidator and TrainValidateSplit support tuning these python backend estimator, but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance. OneVsRest saving/load now only support java backend classifier due to similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30471 from WeichenXu123/support_pyio_tuning. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-12-04 08:35:50 +08:00
zero323	31a16fbb40	[SPARK-32714][PYTHON] Initial pyspark-stubs port ### What changes were proposed in this pull request? This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. This PR adds type annotations directly to Spark source. This can impact interaction with development tools for users, which haven't used `pyspark-stubs`. ### How was this patch tested? - [x] MyPy tests of the PySpark source ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` - [x] MyPy tests of Spark examples ``` MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` - [x] Existing Flake8 linter - [x] Existing unit tests Tested against: - `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894` - `mypy==0.782` Closes #29591 from zero323/SPARK-32681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:15:36 +09:00

4 commits