### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well.
### Why are the changes needed?
To support consistent APIs.
### Does this PR introduce any user-facing change?
No. It adds a new API.
### How was this patch tested?
Manually tested and doctest was added.
Closes#27251 from HyukjinKwon/SPARK-30539.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds:
- `pyspark.ml.regression.JavaRegressor`
- `pyspark.ml.regression.JavaRegressionModel`
classes and replaces `JavaPredictor` and `JavaPredictionModel` in
- `LinearRegression` / `LinearRegressionModel`
- `DecisionTreeRegressor` / `DecisionTreeRegressionModel` (just addition as `JavaPredictionModel` hasn't been used)
- `RandomForestRegressor` / `RandomForestRegressionModel` (just addition as `JavaPredictionModel` hasn't been used)
- `GBTRegressor` / `GBTRegressionModel` (just addition as `JavaPredictionModel` hasn't been used)
- `AFTSurvivalRegression` / `AFTSurvivalRegressionModel`
- `GeneralizedLinearRegression` / `GeneralizedLinearRegressionModel`
- `FMRegressor` / `FMRegressionModel`
### Why are the changes needed?
- Internal PySpark consistency.
- Feature parity with Scala.
- Intermediate step towards implementing [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212)
### Does this PR introduce any user-facing change?
It adds new base classes, so it will affect `mro`. Otherwise interfaces should stay intact.
### How was this patch tested?
Existing tests.
Closes#27241 from zero323/SPARK-30533.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Another followup of 4398dfa709
I missed two more tests added:
```
======================================================================
ERROR [0.133s]: test_to_pandas_from_mixed_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 617, in test_to_pandas_from_mixed_dataframe
self.assertTrue(np.all(pdf_with_only_nulls.dtypes == pdf_with_some_nulls.dtypes))
AssertionError: False is not true
======================================================================
ERROR [0.061s]: test_to_pandas_from_null_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 588, in test_to_pandas_from_null_dataframe
self.assertEqual(types[0], np.float64)
AssertionError: dtype('O') != <class 'numpy.float64'>
----------------------------------------------------------------------
```
### Why are the changes needed?
To make the test independent of default values of configuration.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested and Jenkins should test.
Closes#27250 from HyukjinKwon/SPARK-29188-followup2.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to explicitly disable Arrow execution for the test of toPandas empty types. If `spark.sql.execution.arrow.pyspark.enabled` is enabled by default, this test alone fails as below:
```
======================================================================
ERROR [0.205s]: test_to_pandas_from_empty_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../pyspark/sql/tests/test_dataframe.py", line 568, in test_to_pandas_from_empty_dataframe
self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df))
AssertionError: False is not true
----------------------------------------------------------------------
```
it should be best to explicitly disable for the test that only works when it's disabled.
### Why are the changes needed?
To make the test independent of default values of configuration.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested and Jenkins should test.
Closes#27247 from HyukjinKwon/SPARK-29188-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to remove the SQL config `spark.sql.execution.pandas.respectSessionTimeZone` which has been deprecated since Spark 2.3.
### Why are the changes needed?
To improve code maintainability.
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
by running python tests, https://spark.apache.org/docs/latest/building-spark.html#pyspark-tests-with-maven-or-sbtCloses#27218 from MaxGekk/remove-respectSessionTimeZone.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test
### Why are the changes needed?
In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing test
Closes#27204 from huaxingao/spark-ovr-minor.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add setInputCol/setOutputCol in OHEModel
### Why are the changes needed?
setInputCol/setOutputCol should be in OHEModel too.
### Does this PR introduce any user-facing change?
Yes.
```OHEModel.setInputCol```
```OHEModel.setOutputCol```
### How was this patch tested?
Manually tested.
Closes#27228 from huaxingao/spark-29565.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/27109. It should match the parameter lists in `createDataFrame`.
### Why are the changes needed?
To pass parameters supposed to pass.
### Does this PR introduce any user-facing change?
No (it's only in master)
### How was this patch tested?
Manually tested and existing tests should cover.
Closes#27225 from HyukjinKwon/SPARK-30434-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removal of following `Param` fields:
- `factorSize`
- `fitLinear`
- `miniBatchFraction`
- `initStd`
- `solver`
from `FMClassifier` and `FMRegressor`
### Why are the changes needed?
This `Param` members are already provided by `_FactorizationMachinesParams`
0f3d744c3f/python/pyspark/ml/regression.py (L2303-L2318)
which is mixed into `FMRegressor`:
0f3d744c3f/python/pyspark/ml/regression.py (L2350)
and `FMClassifier`:
0f3d744c3f/python/pyspark/ml/classification.py (L2793)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual testing.
Closes#27205 from zero323/SPARK-30378-FOLLOWUP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`.
### Why are the changes needed?
Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded:
```python
from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel
from pyspark.ml.linalg import DenseVector
df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features"))
ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
ovrm = ovr.fit(df)
ovr.getWeightCol()
## 'w'
ovrm.getWeightCol()
## 'w'
ovr.write().overwrite().save("/tmp/ovr")
ovr_ = OneVsRest.load("/tmp/ovr")
ovr_.getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...)
ovrm.write().overwrite().save("/tmp/ovrm")
ovrm_ = OneVsRestModel.load("/tmp/ovrm")
ovrm_ .getWeightCol()
## KeyError
## ...
## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ...
```
### Does this PR introduce any user-facing change?
After this PR is merged, loaded objects will have `weightCol` `Param` set.
### How was this patch tested?
- Manual testing.
- Extension of existing persistence tests.
Closes#27190 from zero323/SPARK-30504.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights
2, make RF supports weights
### Why are the changes needed?
`weightCol` is already exposed, while RF has not support weights.
### Does this PR introduce any user-facing change?
Yes, new setters
### How was this patch tested?
added testsuites
Closes#27097 from zhengruifeng/rf_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some parity issues between python and scala
### Why are the changes needed?
keep parity between python and scala
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
existing tests
Closes#27196 from huaxingao/spark-30498.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Fix all the failed tests when enable AQE.
### Why are the changes needed?
Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests
Closes#26813 from JkSelf/enableAQEDefault.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
add weight support in BisectingKMeans
### Why are the changes needed?
BisectingKMeans should support instance weighting
### Does this PR introduce any user-facing change?
Yes. BisectingKMeans.setWeight
### How was this patch tested?
Unit test
Closes#27035 from huaxingao/spark_30351.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Make Regressors extend abstract class Regressor:
```AFTSurvivalRegression extends Estimator => extends Regressor```
```DecisionTreeRegressor extends Predictor => extends Regressor```
```FMRegressor extends Predictor => extends Regressor```
```GBTRegressor extends Predictor => extends Regressor```
```RandomForestRegressor extends Predictor => extends Regressor```
We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType.
### Why are the changes needed?
Make class hierarchy consistent for all Regressors
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#27168 from huaxingao/spark-30377.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol` methods.
### Why are the changes needed?
Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application.
### Does this PR introduce any user-facing change?
Not beyond scope of SPARK-29093].
### How was this patch tested?
Existing tests.
CC huaxingao zhengruifeng
Closes#27181 from zero323/SPARK-30493.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR proposes to increase the memory in `WorkerMemoryTest.test_memory_limit` in order to make the test pass with PyPy.
The test is currently failed only in PyPy as below in some PRs unexpectedly:
```
Current mem limits: 18446744073709551615 of max 18446744073709551615
Setting mem limits to 1048576 of max 1048576
RPython traceback:
File "pypy_module_pypyjit_interp_jit.c", line 289, in portal_5
File "pypy_interpreter_pyopcode.c", line 3468, in handle_bytecode__AccessDirect_None
File "pypy_interpreter_pyopcode.c", line 5558, in dispatch_bytecode__AccessDirect_None
out of memory: couldn't allocate the next arena
ERROR
```
It seems related to how PyPy allocates the memory and GC works PyPy-specifically. There seems nothing wrong in this configuration implementation itself in PySpark side.
I roughly tested in higher PyPy versions on Ubuntu (PyPy v7.3.0) and this test seems passing fine so I suspect this might be an issue in old PyPy behaviours.
The change only increases the limit so it would not affect actual memory allocations. It just needs to test if the limit is properly set in worker sides. For clarification, the memory is unlimited in the machine if not set.
### Why are the changes needed?
To make the tests pass and unblock other PRs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually and Jenkins should test it out.
Closes#27186 from HyukjinKwon/SPARK-30480.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removing the sorting of PySpark SQL Row fields that were previously sorted by name alphabetically for Python versions 3.6 and above. Field order will now match that as entered. Rows will be used like tuples and are applied to schema by position. For Python versions < 3.6, the order of kwargs is not guaranteed and therefore will be sorted automatically as in previous versions of Spark.
### Why are the changes needed?
This caused inconsistent behavior in that local Rows could be applied to a schema by matching names, but once serialized the Row could only be used by position and the fields were possibly in a different order.
### Does this PR introduce any user-facing change?
Yes, Row fields are no longer sorted alphabetically but will be in the order entered. For Python < 3.6 `kwargs` can not guarantee the order as entered, so `Row`s will be automatically sorted.
An environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" can be set that will override construction of `Row` to maintain compatibility with Spark 2.x.
### How was this patch tested?
Existing tests are run with PYSPARK_ROW_FIELD_SORTING_ENABLED=true and added new test with unsorted fields for Python 3.6+
Closes#26496 from BryanCutler/pyspark-remove-Row-sorting-SPARK-29748.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
This patch increases the memory limit in the test 'test_memory_limit' from 1m to 8m.
Credit to srowen and HyukjinKwon to provide the idea of suspicion and guide how to fix.
### Why are the changes needed?
We observed consistent Pyspark test failures on multiple PRs (#26955, #26201, #27064) which block the PR builds whenever the test is included.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Jenkins builds passed in WIP PR (#27159)
Closes#27162 from HeartSaVioR/SPARK-30480.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add ```predict``` and ```numFeatures``` in Python ```IsotonicRegressionModel```
### Why are the changes needed?
```IsotonicRegressionModel``` doesn't extend ```JavaPredictionModel```, so it doesn't get ```predict``` and ```numFeatures``` from the super class.
### Does this PR introduce any user-facing change?
Yes. Python version of
```
IsotonicRegressionModel.predict
IsotonicRegressionModel.numFeatures
```
### How was this patch tested?
doctest
Closes#27122 from huaxingao/spark-30452.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds a note that we're not adding "pandas compatible" aliases anymore.
### Why are the changes needed?
We added "pandas compatible" aliases as of https://github.com/apache/spark/pull/5544 and https://github.com/apache/spark/pull/6066 . There are too many differences and I don't think it makes sense to add such aliases anymore at this moment.
I was even considering deprecating them out but decided to take a more conservative approache by just documenting it.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover.
Closes#27142 from HyukjinKwon/SPARK-30464.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move pandas related functionalities into pandas package. Namely:
```bash
pyspark/sql/pandas
├── __init__.py
├── conversion.py # Conversion between pandas <> PySpark DataFrames
├── functions.py # pandas_udf
├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply
├── map_ops.py # Map Iter UDF + mapInPandas
├── serializers.py # pandas <> PyArrow serializers
├── types.py # Type utils between pandas <> PyArrow
└── utils.py # Version requirement checks
```
In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below:
```python
class PandasMapOpsMixin(object):
def mapInPandas(self, ...):
...
return ...
# other Pandas <> PySpark APIs
```
```python
class DataFrame(PandasMapOpsMixin):
# other DataFrame APIs equivalent to Scala side.
```
Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods.
### Why are the changes needed?
There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now.
Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover. Also, I manually built the PySpark API documentation and checked.
Closes#27109 from HyukjinKwon/pandas-refactoring.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds a note first and last can be non-deterministic in SQL function docs as well.
This is already documented in `functions.scala`.
### Why are the changes needed?
Some people look reading SQL docs only.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Jenkins will test.
Closes#27099 from HyukjinKwon/SPARK-30335.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds a note that UserDefinedFunction's constructor is private.
### Why are the changes needed?
To match with Scala side. Scala side does not have it at all.
### Does this PR introduce any user-facing change?
Doc only changes but it declares UserDefinedFunction's constructor is private explicitly.
### How was this patch tested?
Jenkins
Closes#27101 from HyukjinKwon/SPARK-30430.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
PySpark UDF to convert MLlib vectors to dense arrays.
Example:
```
from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))
```
### Why are the changes needed?
If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT.
Closes#26910 from WeichenXu123/vector_to_array.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
### What changes were proposed in this pull request?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams```
### Why are the changes needed?
Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap```
### Does this PR introduce any user-facing change?
Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now
### How was this patch tested?
Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there.
Closes#26838 from huaxingao/spark-30144.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
expose predictRaw and predictProbability on Python side
### Why are the changes needed?
to keep parity between scala and python
### Does this PR introduce any user-facing change?
Yes. Expose python ```predictRaw``` and ```predictProbability```
### How was this patch tested?
doctest
Closes#27082 from huaxingao/spark-30358.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
add getter/setter in Python FM
### Why are the changes needed?
to be consistent with other algorithms
### Does this PR introduce any user-facing change?
Yes.
add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel
### How was this patch tested?
doctest
Closes#27044 from huaxingao/spark-30378.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/26780
In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`.
However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing.
This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema`
### Why are the changes needed?
Unify and simplify the Avro data source options.
### Does this PR introduce any user-facing change?
Yes.
To deserialize Avro data with an evolved schema, before changes:
```
from_avro('col, expectedSchema, ("actualSchema" -> actualSchema))
```
After changes:
```
from_avro('col, actualSchema, ("avroSchema" -> expectedSchema))
```
The second parameter is always the actual Avro schema after changes.
### How was this patch tested?
Update the existing tests in https://github.com/apache/spark/pull/26780Closes#27045 from gengliangwang/renameAvroOption.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
supports instance weighting in GMM
### Why are the changes needed?
ML should support instance weighting
### Does this PR introduce any user-facing change?
yes, a new param `weightCol` is exposed
### How was this patch tested?
added testsuits
Closes#26735 from zhengruifeng/gmm_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side.
### Does this PR introduce any user-facing change?
add Python class ```MultivariateGaussian```
### How was this patch tested?
doctest
Closes#27020 from huaxingao/spark-30247.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#27000 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
compute the medians/ranges more distributedly
### Why are the changes needed?
It is a bottleneck to collect the whole Array[QuantileSummaries] from executors,
since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`).
In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM.
After this PR, it will sucessfuly fit the model.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26803 from zhengruifeng/robust_high_dim.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26124 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation.
- `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627))
- `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990))
- `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412))
**Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete.
### Why are the changes needed?
To document available options in sources properly.
### Does this PR introduce any user-facing change?
In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`.
### How was this patch tested?
Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only:
```bash
$ ls -al tmp
...
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc
```
```python
>>> spark.read.text("tmp", pathGlobFilter="*c").show()
```
```
+-----+
|value|
+-----+
| ac|
| cc|
+-----+
```
Closes#26958 from HyukjinKwon/doc-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79
2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1
### Why are the changes needed?
Shouldn't change master.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-masterCloses#26915 from wangyum/revert-master.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
expose gaussian in PySpark
### Why are the changes needed?
A ```GaussianMixtureModel``` contains two parts of coefficients: ```weights``` & ```gaussians```. However, ```gaussians``` is not exposed on Python side.
### Does this PR introduce any user-facing change?
Yes. ```GaussianMixtureModel.gaussians``` is exposed in PySpark.
### How was this patch tested?
add doctest
Closes#26882 from huaxingao/spark-30247.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info.
### Why are the changes needed?
Documentation improvement
### Does this PR introduce any user-facing change?
No (doc-only change).
### How was this patch tested?
Manually tested by documentation build.
Closes#26896 from bboutkov/pyspark_doc_fix.
Authored-by: Boris Boutkov <boris.boutkov@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR mainly targets:
1. Expose only explain(mode: String) in Scala side
2. Clean up related codes
- Hide `ExplainMode` under private `execution` package. No particular reason but just because `ExplainUtils` exists there
- Use `case object` + `trait` pattern in `ExplainMode` to look after `ParseMode`.
- Move `Dataset.toExplainString` to `QueryExecution.explainString` to look after `QueryExecution.simpleString`, and deduplicate the codes at `ExplainCommand`.
- Use `ExplainMode` in `ExplainCommand` too.
- Add `explainString` to `PythonSQLUtils` to avoid unexpected test failure of PySpark during refactoring Scala codes side.
### Why are the changes needed?
To minimised exposed APIs, deduplicate, and clean up.
### Does this PR introduce any user-facing change?
`Dataset.explain(mode: ExplainMode)` will be removed (which only exists in master).
### How was this patch tested?
Manually tested and existing tests should cover.
Closes#26898 from HyukjinKwon/SPARK-30200-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr is a followup of #26861 to address minor comments from viirya.
### Why are the changes needed?
For better error messages.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested.
Closes#26886 from maropu/SPARK-30231-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
- Reverts commit 1f94bf4 and d6be46e
- Switches python to python3 in Docker release image.
### Why are the changes needed?
`dev/make-distribution.sh` and `python/setup.py` are use python3.
https://github.com/apache/spark/pull/26844/files#diff-ba2c046d92a1d2b5b417788bfb5cb5f8L236https://github.com/apache/spark/pull/26330/files#diff-8cf6167d58ce775a08acafcfe6f40966
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
```
yumwangubuntu-3513086:~/spark$ dev/create-release/do-release-docker.sh -n -d /home/yumwang/spark-release
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.0]: 3.0.0-preview2
RC # [1]:
This is a dry run. Please confirm the ref that will be built for testing.
Ref [master]:
ASF user [yumwang]:
Full name [Yuming Wang]:
GPG key [yumwangapache.org]: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
================
Release details:
BRANCH: master
VERSION: 3.0.0-preview2
TAG: v3.0.0-preview2-rc1
NEXT: 3.0.1-SNAPSHOT
ASF USER: yumwang
GPG KEY: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338
FULL NAME: Yuming Wang
E-MAIL: yumwangapache.org
================
Is this info correct [y/n]? y
GPG passphrase:
========================
= Building spark-rm image with tag latest...
Command: docker build -t spark-rm:latest --build-arg UID=110302528 /home/yumwang/spark/dev/create-release/spark-rm
Log file: docker-build.log
Building v3.0.0-preview2-rc1; output will be at /home/yumwang/spark-release/output
gpg: directory '/home/spark-rm/.gnupg' created
gpg: keybox '/home/spark-rm/.gnupg/pubring.kbx' created
gpg: /home/spark-rm/.gnupg/trustdb.gpg: trustdb created
gpg: key 6E1B4122F6A3A338: public key "Yuming Wang <yumwangapache.org>" imported
gpg: key 6E1B4122F6A3A338: secret key imported
gpg: Total number processed: 1
gpg: imported: 1
gpg: secret keys read: 1
gpg: secret keys imported: 1
========================
= Creating release tag v3.0.0-preview2-rc1...
Command: /opt/spark-rm/release-tag.sh
Log file: tag.log
It may take some time for the tag to be synchronized to github.
Press enter when you've verified that the new tag (v3.0.0-preview2-rc1) is available.
========================
= Building Spark...
Command: /opt/spark-rm/release-build.sh package
Log file: build.log
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
========================
= Publishing release
Command: /opt/spark-rm/release-build.sh publish-release
Log file: publish.log
```
Generated doc:
![image](https://user-images.githubusercontent.com/5399861/70693075-a7723100-1cf7-11ea-9f88-9356a02349a1.png)
Closes#26848 from wangyum/SPARK-30216.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr intends to support explain modes implemented in #26829 for PySpark.
### Why are the changes needed?
For better debugging info. in PySpark dataframes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UTs.
Closes#26861 from maropu/ExplainModeInPython.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing.
### Why are the changes needed?
Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong.
### Does this PR introduce any user-facing change?
Yes; the error reported in the JIRA issue should not happen anymore.
### How was this patch tested?
Through unit tests in `pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe`
Closes#26747 from dlindelof/SPARK-29188.
Authored-by: David <dlindelof@expediagroup.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Follow up of https://github.com/apache/spark/pull/24405
### What changes were proposed in this pull request?
The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized.
The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before.
### Why are the changes needed?
Consider the following example.
```
// schema ID: 1
val schema1 = """
{
"type": "record",
"name": "MySchema",
"fields": [
{"name": "col1", "type": "int"},
{"name": "col2", "type": "string"}
]
}
"""
// schema ID: 2
val schema2 = """
{
"type": "record",
"name": "MySchema",
"fields": [
{"name": "col1", "type": "int"},
{"name": "col2", "type": "string"},
{"name": "col3", "type": "string", "default": ""}
]
}
"""
```
The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value.
Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. **We want to combine them into one dataframe** for storing or further processing.
With the current `from_avro` function we can only decode each of them with the corresponding schema:
```
scalaval df1 = ... // Avro events created with schema1
df1: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>]
scalaval df2= ... // Avro events created with schema2
df2: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded")
decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```
but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way:
```
scalaimport scala.collection.JavaConverters._
scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```
so that both dataframes have the same schemas and can be merged.
### Does this PR introduce any user-facing change?
This PR allows users to pass a new configuration but it doesn't affect current code.
### How was this patch tested?
A new unit test was added.
Closes#26780 from Fokko/SPARK-27506.
Lead-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR aims to remove deprecation warnings by importing ABCs from `collections.abc` instead of `collections`.
- https://github.com/python/cpython/pull/10596
### Why are the changes needed?
This will remove deprecation warnings in Python 3.7 and 3.8.
```
$ python -V
Python 3.7.5
$ python python/pyspark/resultiterable.py
python/pyspark/resultiterable.py:23: DeprecationWarning:
Using or importing the ABCs from 'collections' instead of from 'collections.abc'
is deprecated since Python 3.3,and in 3.9 it will stop working
class ResultIterable(collections.Iterable):
```
### Does this PR introduce any user-facing change?
No, this doesn't introduce user-facing change
### How was this patch tested?
Manually because this is about deprecation warning messages.
Closes#26835 from tirkarthi/spark-30205-fix-abc-warnings.
Authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>