### What changes were proposed in this pull request?
revert changes related to ANN
### Why are the changes needed?
using the new `softmax` may cause flaky failure
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
reverted testsuite
Closes#33049 from zhengruifeng/revert_softmax_ann.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB
### Why are the changes needed?
to simplify impl
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite
Closes#32991 from zhengruifeng/softmax_support_offset_step.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.
This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.
### Why are the changes needed?
To mimic the JVM behaviour about thread lifecycle
### Does this PR introduce _any_ user-facing change?
Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.
### How was this patch tested?
Existing tests should cover this.
Closes#32962 from HyukjinKwon/SPARK-35498-SPARK-35303.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.
While I am here, I fixed another minor issue at `DataFrams.schema` together:
```diff
- except AttributeError as e:
- raise Exception(
- "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+ raise ValueError(
+ "Unable to parse datatype from schema. %s" % e) from e
```
Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.
### Why are the changes needed?
For users to expect the proper exceptions.
### Does this PR introduce _any_ user-facing change?
Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.
### How was this patch tested?
Existing unittests should cover,
Closes#31238Closes#32650 from HyukjinKwon/SPARK-32194.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit
### Why are the changes needed?
The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work.
### Does this PR introduce _any_ user-facing change?
No, docs only change.
### How was this patch tested?
Existing tests.
Closes#32554 from srowen/TuningImports.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if:
- change number of partitions;
- just change the way to compute the sum of weights;
- change the underlying BLAS impl
Also uses more permissive precision on `Word2Vec` test case.
### Why are the changes needed?
To recover the build and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test cases.
Closes#32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas
### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.
[1] https://docs.python.org/3/library/exceptions.html#TypeError
Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.
### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.
### How was this patch tested?
Existing test case and UT
Closes#32368 from Yikun/SPARK-35176.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
There are two types of dense vectors:
* pyspark.ml.linalg.DenseVector
* pyspark.mllib.linalg.DenseVector
In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
The documentation is ambiguous & can lead to the false conclusion that instances of
pyspark.mllib.linalg.DenseVector will be returned.
Conversion from ml versions to mllib versions can easly be achieved with
mlutils.convertVectorColumnsToML helper.
### What changes were proposed in this pull request?
Make documentation more explicit
### Why are the changes needed?
The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No test were run as only the documentation was changed
Closes#32255 from jlafaye/master.
Authored-by: Julien Lafaye <jlafaye@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites
### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
updated testsuites and added testsuites
Closes#32124 from zhengruifeng/svc_agg_refactor.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`.
### Why are the changes needed?
Bugfix
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#32245 from harupy/SPARK-35142.
Authored-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation
### Why are the changes needed?
To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782
### Does this PR introduce _any_ user-facing change?
Virtually no.
### How was this patch tested?
Found via (Mac OS):
```bash
# In Spark root directory
cd python
pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .`
```
Closes#32048 from HyukjinKwon/minor-fix.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Document `mode` as a supported Imputer strategy in Pyspark docs.
### Why are the changes needed?
Support was added in 3.1, and documented in Scala, but some Python docs were missed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#31883 from srowen/ImputerModeDocs.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError.
### Why are the changes needed?
The example is slightly wrong and we should not show an error in the docs.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
Existing tests
Closes#31760 from srowen/SPARK-34642.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
All code is entirely my own work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.
Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.
### Does this PR introduce _any_ user-facing change?
A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
### How was this patch tested?
Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.
Closes#31535 from PhillHenry/ParamRandomBuilder.
Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This follows up #31160 to update score function in the document.
### Why are the changes needed?
Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No, only doc change.
Closes#31531 from viirya/SPARK-34080-minor.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector
### Why are the changes needed?
this pr is about 30% faster than existing impl
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
existing testsuites
Closes#31276 from zhengruifeng/w2v_findSynonyms_opt.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
update the ParamValidators of `maxDepth`
### Why are the changes needed?
current impl of tree models only support maxDepth<=30
### Does this PR introduce _any_ user-facing change?
If `maxDepth`>30, fail quickly
### How was this patch tested?
existing testsuites
Closes#31163 from zhengruifeng/param_maxDepth_upbound.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR:
- Adds as small hierarchy of warnings to be used in PySpark applications. These extend built-in classes and top level `PySparkWarning`.
- Replaces `DeprecationWarnings` (intended for developers) with PySpark specific subclasses of `FutureWarning` (intended for end users).
### Why are the changes needed?
- To be more precise and add users additional control (in addition to standard module level filters) over PySpark warnings handling.
- Correct semantics (at the moment we use `DeprecationWarning` in user-facing API, but it is intended "for warnings about deprecated features when those warnings are intended for other Python developers").
### Does this PR introduce _any_ user-facing change?
Yes. Code can raise different type of warning than before.
### How was this patch tested?
Existing tests.
Closes#30985 from zero323/SPARK-33730.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector
### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.
### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100)
```
Or
numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100)
```
### How was this patch tested?
Add Unit test
Closes#31160 from huaxingao/UnivariateSelector.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model
### Why are the changes needed?
Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark.
CrossValidator and TrainValidateSplit support tuning these python backend estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance.
OneVsRest saving/load now only support java backend classifier due to similar issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30471 from WeichenXu123/support_pyio_tuning.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
This is to update `labelsArray`'s since tag.
### Why are the changes needed?
The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A. Just tag change.
Closes#30582 from viirya/SPARK-33636-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup to add missing `labelsArray` to PySpark `StringIndexer`.
### Why are the changes needed?
`labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too.
### Does this PR introduce _any_ user-facing change?
Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0.
### How was this patch tested?
Unit test.
Closes#30579 from viirya/SPARK-22798-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading
When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.
Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100]) \
.addGrid(lr.maxIter, [100, 200]) \
.build()
tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator())
tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)
# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~
~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)
tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)
# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#30539 from WeichenXu123/fix_tuning_param_maps_io.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.
- https://docs.python.org/3/library/unittest.html#deprecated-aliases
### Why are the changes needed?
For consistency and eventual removal of deprecated APIs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#30557 from BryanCutler/replace-deprecated-apis-in-tests.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add array_to_vector function for dataframe column
### Why are the changes needed?
Utility function for array to vector conversion.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
scala unit test & doctest.
Closes#30498 from WeichenXu123/array_to_vec.
Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
impl a new strategy `mode`: replace missing using the most frequent value along each column.
### Why are the changes needed?
it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time.
### Does this PR introduce _any_ user-facing change?
Yes, a new strategy is added
### How was this patch tested?
updated testsuites
Closes#30397 from zhengruifeng/imputer_max_freq.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors;
### Why are the changes needed?
the performance gain is mainly related to the nnz of block.
### Does this PR introduce _any_ user-facing change?
yes, param blockSize -> blockSizeInMB in master
### How was this patch tested?
updated testsuites
Closes#30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.
Lead-authored-by: zhengruifeng <ruifengz@foxmail.com>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
This reverts commit 61ee5d8a4e.
### What changes were proposed in this pull request?
I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009,
but I merged it to master by mistake.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
* resend
* address comments
* directly gen new Iter
* directly gen new Iter
* update blockify strategy
* address comments
* try to fix 2.13
* try to fix scala 2.13
* use 1.0 as the default value for gemv
* update
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Changes
pyspark.sql.dataframe.DataFrame
to
:py:class:`pyspark.sql.DataFrame`
### Why are the changes needed?
Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104).
### Does this PR introduce _any_ user-facing change?
User will see shorter reference with a link.
### How was this patch tested?
`dev/lint-python` and manual check of the rendered docs.
Closes#30313 from zero323/SPARK-33251-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
This PR proposes migration of `pyspark.ml` to NumPy documentation style.
### Why are the changes needed?
To improve documentation style.
### Does this PR introduce _any_ user-facing change?
Yes, this changes both rendered HTML docs and console representation (SPARK-33243).
### How was this patch tested?
`dev/lint-python` and manual inspection.
Closes#30285 from zero323/SPARK-33251.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment)
### Why are the changes needed?
The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with
```
File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1))
AssertionError: False is not true
```
```
File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS
Failed example:
predictions[0]
Expected:
Row(user=0, item=2, newPrediction=0.6929101347923279)
Got:
Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This path changes a test target. Just executed the tests to verify they pass.
Closes#30104 from AlessandroPatti/apatti/rounding-errors.
Authored-by: Alessandro Patti <ale812@yahoo.it>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR:
- removes annotations for modules which are not part of the public API.
- removes `__init__.pyi` files, if no annotations, beyond exports, are present.
### Why are the changes needed?
Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests and additional MyPy checks:
```
mypy --no-incremental --config python/mypy.ini python/pyspark
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```
Closes#29879 from zero323/SPARK-33002.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
Yes. This PR adds type annotations directly to Spark source.
This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.
### How was this patch tested?
- [x] MyPy tests of the PySpark source
```
mypy --no-incremental --config python/mypy.ini python/pyspark
```
- [x] MyPy tests of Spark examples
```
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```
- [x] Existing Flake8 linter
- [x] Existing unit tests
Tested against:
- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`
Closes#29591 from zero323/SPARK-32681.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
revert blockify gmm
### Why are the changes needed?
WeichenXu123 and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it.
In GMM, we found that the pre-allocated memory maybe too large and should be discarded:
```
transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)
```
We had some offline discuss and thought it is better to revert blockify GMM.
### Does this PR introduce _any_ user-facing change?
blockSize added in master branch will be removed
### How was this patch tested?
existing testsuites
Closes#29782 from zhengruifeng/unblockify_gmm.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR adjusts signatures of methods decorated with `keyword_only` to indicate using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/).
__Note__:
For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579
### Why are the changes needed?
Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that.
In practice we could probably capture `locals` and drop `keyword_only` completel, i.e:
```python
keyword_only
def __init__(self, *, featuresCol="features"):
...
kwargs = self._input_kwargs
self.setParams(**kwargs)
```
could be replaced with
```python
def __init__(self, *, featuresCol="features"):
kwargs = locals()
del kwargs["self"]
...
self.setParams(**kwargs)
```
### Does this PR introduce _any_ user-facing change?
Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments.
For example with ` LinearSVC` will change from
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__(
self,
featuresCol='features',
labelCol='label',
predictionCol='prediction',
maxIter=100,
regParam=0.0,
tol=1e-06,
rawPredictionCol='rawPrediction',
fitIntercept=True,
standardization=True,
threshold=0.0,
weightCol=None,
aggregationDepth=2,
)
Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2):
File: /path/to/python/pyspark/ml/classification.py
Type: function
```
to
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__ (
self,
*,
featuresCol='features',
labelCol='label',
predictionCol='prediction',
maxIter=100,
regParam=0.0,
tol=1e-06,
rawPredictionCol='rawPrediction',
fitIntercept=True,
standardization=True,
threshold=0.0,
weightCol=None,
aggregationDepth=2,
blockSize=1,
)
Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2, blockSize=1):
File: ~/Workspace/spark/python/pyspark/ml/classification.py
Type: function
```
### How was this patch tested?
Existing tests.
Closes#29799 from zero323/SPARK-32933.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Replace `__metaclass__` fields with `metaclass` keyword in the class statements.
### Why are the changes needed?
`__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons.
```
>>> from pyspark.sql.types import BooleanType
>>> BooleanType() is BooleanType()
False
```
and classes, which suppose to be abstract, are not
```
>>> import inspect
>>> from pyspark.ml import Estimator
>>> inspect.isabstract(Estimator)
False
```
### Does this PR introduce _any_ user-facing change?
Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix.
### How was this patch tested?
Existing tests.
Closes#29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
https://issues.apache.org/jira/browse/SPARK-32719
### What changes were proposed in this pull request?
Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).
### Why are the changes needed?
To make sure that the quality of the Python code is up to standard.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit-tests and Flake8 static analysis
Closes#29563 from Fokko/fd-add-check-missing-imports.
Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models.
### Why are the changes needed?
`copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated `test_copy` to make sure `copy()` is called on models instead of lists of models.
Closes#29553 from Louiszr/fix-cv-copy.
Authored-by: Louiszr <zxhst14@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods.
### Why are the changes needed?
Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls.
Closes#29445 from Louiszr/master.
Authored-by: Louiszr <zxhst14@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result.
`InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation.
### Why are the changes needed?
It seems like an oversight to not include this as all similar "summary" classes are.
`InheritableThread` should also be documented.
### Does this PR introduce _any_ user-facing change?
I don't believe there are functional changes. It should make this public class appear in docs.
### How was this patch tested?
Existing tests / N/A.
Closes#29470 from srowen/KMeansSummary.
Lead-authored-by: Sean Owen <srowen@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Disallow the use of unused imports:
- Unnecessary increases the memory footprint of the application
- Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :)
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused
python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused
python/pyspark/heapq3.py:868:5: F401 '_heapq.*' imported but unused
python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused
python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused
python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused
python/pyspark/rdd.py:21:1: F401 're' imported but unused
python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused
python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused
python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused
python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.*' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused
python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.*' imported but unused
python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.*' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused
python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused
python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.*' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused
python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.*' imported but unused
python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.*' imported but unused
python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused
python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused
python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.*' imported but unused
python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused
python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.*' imported but unused
python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.*' imported but unused
python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused
python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.*' imported but unused
python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.*' imported but unused
python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.*' imported but unused
python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.*' imported but unused
python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.*' imported but unused
python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused
python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.*' imported but unused
python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused
python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused
python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.*' imported but unused
python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.*' imported but unused
python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused
python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.*' imported but unused
python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.*' imported but unused
python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused
python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused
python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.*' imported but unused
python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.*' imported but unused
python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused
python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused
python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.*' imported but unused
python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.*' imported but unused
python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.*' imported but unused
python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.*' imported but unused
python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused
python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused
python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused
python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused
python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused
python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused
python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused
python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.*' imported but unused
python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.*' imported but unused
python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused
python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.*' imported but unused
python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.*' imported but unused
python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.*' imported but unused
python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused
python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused
python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.*' imported but unused
python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.*' imported but unused
python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.*' imported but unused
python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.*' imported but unused
python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.*' imported but unused
python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused
python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.*' imported but unused
python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.*' imported but unused
python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.*' imported but unused
python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused
python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused
python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused
python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.*' imported but unused
python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused
python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused
python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused
python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused
python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.*' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.*' imported but unused
python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.*' imported but unused
python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.*' imported but unused
python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.*' imported but unused
python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.*' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.*' imported but unused
python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused
python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.*' imported but unused
python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused
python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.*' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.*' imported but unused
python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused
python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused
python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused
python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused
python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused
```
After:
```
fokkodriesprongFan spark % flake8 python | grep -i "imported but unused"
fokkodriesprongFan spark %
```
### What changes were proposed in this pull request?
Removing unused imports from the Python files to keep everything nice and tidy.
### Why are the changes needed?
Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Adding the rule to the existing Flake8 checks.
Closes#29121 from Fokko/SPARK-32319.
Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`.
### Why are the changes needed?
- To distinguish structs/maps from arrays.
- To make `show`'s output consistent with Hive and conversions to Hive strings.
- To display dataframe content in the same form by `spark-sql` and `show`
- To be consistent with the `*.sql` tests
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
By existing test suite `CastSuite`.
Closes#29308 from MaxGekk/show-struct-map.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
set params default values in trait Params for feature and tuning in both Scala and Python.
### Why are the changes needed?
Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing and modified tests
Closes#29153 from huaxingao/default2.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>