Commit graph

2672 commits

Author SHA1 Message Date
Peter Toth ab8a9a0ceb [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite
### What changes were proposed in this pull request?

pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122
This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that
```
new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType))))
```
and
```
new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType))))
```
are currently equal and the second instance is replaced to the short code of the first one during serialization.

### Why are the changes needed?
The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description:

```
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>>
>>> def udf1(data_type):
        def u1(e):
            return e[0]
        return udf(u1, data_type)
>>>
>>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
>>>
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>>
>>> df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+

>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
+---+

>>> df.select("c3", "c4").show()
+---+----+
| c3|  c4|
+---+----+
|1.0|null|
+---+----+
```
This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113

After this PR:
```
>>> df.select("c3", "c4").show()
+---+---+
| c3| c4|
+---+---+
|1.0|  1|
+---+---+
```

### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new UT + manual tests.

Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-07 19:12:42 -06:00
Sean Owen 2f30cdebb1 [SPARK-34642][DOCS][ML] Fix TypeError in Pyspark Linear Regression docs
### What changes were proposed in this pull request?

Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError.

### Why are the changes needed?

The example is slightly wrong and we should not show an error in the docs.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?

Existing tests

Closes #31760 from srowen/SPARK-34642.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-06 07:32:01 -08:00
Takuya UESHIN 331d459ee7 [SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests
### What changes were proposed in this pull request?

Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values.

### Why are the changes needed?

The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`.

```py
>>> df = spark.range(10).toDF('id') \
...             .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \
...             .withColumn("v", explode(col('vs'))) \
...             .drop('vs') \
...             .withColumn('w', lit(1.0))
>>> udf('double')
... def plus_one(v):
...   assert isinstance(v, (int, float))
...   return v + 1
...
>>> pandas_udf('double', PandasUDFType.GROUPED_AGG)
... def sum_udf(v):
...   return v.sum()
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
|        null|    2900.0|
+------------+----------+
```

This is meaningless and should be:

```py
>>> udf('double')
... def plus_one(v):
...   assert isinstance(v, (int, float))
...   return float(v + 1)
...
>>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show()
+------------+----------+
|plus_one(id)|sum_udf(v)|
+------------+----------+
|         1.0|     245.0|
|         2.0|     255.0|
|         3.0|     265.0|
|         4.0|     275.0|
|         5.0|     285.0|
|         6.0|     295.0|
|         7.0|     305.0|
|         8.0|     315.0|
|         9.0|     325.0|
|        10.0|     335.0|
+------------+----------+
```

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Fixed the test.

Closes #31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-04 10:03:54 +09:00
HyukjinKwon 800590035c [SPARK-34604][PYTHON][TESTS] Use eventually in TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse
### What changes were proposed in this pull request?

`TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse` can be flaky and fails sometimes:

```
======================================================================
ERROR [1.798s]: test_task_context_correct_with_python_worker_reuse (pyspark.tests.test_taskcontext.TaskContextTestsWithWorkerReuse)
...
test_task_context_correct_with_python_worker_reuse
    self.assertTrue(pid in worker_pids)
AssertionError: False is not true

----------------------------------------------------------------------
```

I suspect that the Python worker was killed for whatever reason and new attempt created a new Python worker.

This PR fixes the flakiness simply by retrying the test case.

### Why are the changes needed?

To make the tests more robust.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested it by controlling the conditions manually in the test codes.

Closes #31723 from HyukjinKwon/SPARK-34604.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-04 08:40:48 +09:00
Richard Penney 7d0743b493 [SPARK-33678][SQL] Product aggregation function
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.

This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.

This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.

### Does this PR introduce _any_ user-facing change?
No - only adds new function.

### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).

An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw

df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))

df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x  |factorial      |
+---+---------------+
|1  |1.0            |
|2  |2.0            |
|3  |6.0            |
|4  |24.0           |
|5  |120.0          |
|6  |720.0          |
|7  |5040.0         |
|8  |40320.0        |
|9  |362880.0       |
|10 |3628800.0      |
|11 |3.99168E7      |
|12 |4.790016E8     |
|13 |6.2270208E9    |
|14 |8.71782912E10  |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```

Closes #30745 from rwpenney/feature/agg-product.

Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:07 +09:00
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
HyukjinKwon b5470ae294 [MINOR][DOCS] Replace http to https when possible in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes:
- Change http to https for better security
- Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/)

### Why are the changes needed?

For better security, and to use official link.

### Does this PR introduce _any_ user-facing change?

Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation.

### How was this patch tested?

I manually checked if each link works

Closes #31616 from HyukjinKwon/minor-https.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-23 11:18:47 +09:00
“attilapiros” bdcad33d8b [SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler
### What changes were proposed in this pull request?

Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler.

Some files and their responsibilities within this PR:
- `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access
- `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions
- `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19".

After this the correct process to update Jekyll version is the following:
1. update the version in `Gemfile`
2. run "bundle update" which updates the `Gemfile.lock`
3. commit both files

This process for version update is tested for details please check the testing section.

### Why are the changes needed?

Using different Jekyll versions can generate different output documents.
This PR standardize the process.

### Does this PR introduce _any_ user-facing change?

No, assuming the release was done via docker by using `do-release-docker.sh`.
In that case  there should be no difference at all as the same Jekyll version is specified in the Gemfile.

### How was this patch tested?

#### Testing document generation

Doc generation step was triggered via  the docker release:

```
$ ./do-release-docker.sh -d ~/working -n -s docs
...
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
Skipping publish step.
```

The docs.log contains the followings:
```
Building Spark docs
Fetching gem metadata from https://rubygems.org/.........
Using bundler 2.2.9
Fetching rb-fsevent 0.10.4
Fetching forwardable-extended 2.6.0
Fetching public_suffix 4.0.6
Fetching colorator 1.1.0
Fetching eventmachine 1.2.7
Fetching http_parser.rb 0.6.0
Fetching ffi 1.14.2
Fetching concurrent-ruby 1.1.8
Installing colorator 1.1.0
Installing forwardable-extended 2.6.0
Installing rb-fsevent 0.10.4
Installing public_suffix 4.0.6
Installing http_parser.rb 0.6.0 with native extensions
Installing eventmachine 1.2.7 with native extensions
Installing concurrent-ruby 1.1.8
Fetching rexml 3.2.4
Fetching liquid 4.0.3
Installing ffi 1.14.2 with native extensions
Installing rexml 3.2.4
Installing liquid 4.0.3
Fetching mercenary 0.4.0
Installing mercenary 0.4.0
Fetching rouge 3.26.0
Installing rouge 3.26.0
Fetching safe_yaml 1.0.5
Installing safe_yaml 1.0.5
Fetching unicode-display_width 1.7.0
Installing unicode-display_width 1.7.0
Fetching webrick 1.7.0
Installing webrick 1.7.0
Fetching pathutil 0.16.2
Fetching kramdown 2.3.0
Fetching terminal-table 2.0.0
Fetching addressable 2.7.0
Fetching i18n 1.8.9
Installing terminal-table 2.0.0
Installing pathutil 0.16.2
Installing i18n 1.8.9
Installing addressable 2.7.0
Installing kramdown 2.3.0
Fetching kramdown-parser-gfm 1.1.0
Installing kramdown-parser-gfm 1.1.0
Fetching rb-inotify 0.10.1
Fetching sassc 2.4.0
Fetching em-websocket 0.5.2
Installing rb-inotify 0.10.1
Installing em-websocket 0.5.2
Installing sassc 2.4.0 with native extensions
Fetching listen 3.4.1
Installing listen 3.4.1
Fetching jekyll-watch 2.2.1
Installing jekyll-watch 2.2.1
Fetching jekyll-sass-converter 2.1.0
Installing jekyll-sass-converter 2.1.0
Fetching jekyll 4.2.0
Installing jekyll 4.2.0
Fetching jekyll-redirect-from 0.16.0
Installing jekyll-redirect-from 0.16.0
Bundle complete! 4 Gemfile dependencies, 30 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
```

#### Testing Jekyll (or other gem) update

First locally I reverted Jekyll to 4.1.0:
```
$ rm Gemfile.lock
$ rm -rf .local_ruby_bundle

# edited Gemfile to use version 4.1.0
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.1.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ bundle install
...
```

Testing Jekyll version before the update:

```
$ bundle exec jekyll --version
jekyll 4.1.0
```

Imitating Jekyll update coming from git by reverting my local changes:

```
$ git checkout Gemfile
Updated 1 path from the index
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.2.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"

$ git checkout Gemfile.lock
Updated 1 path from the index
```

Run the install:

```
$ bundle install
...
```

Checking the updated Jekyll version:
```
$ bundle exec jekyll --version
jekyll 4.2.0
```

Closes #31559 from attilapiros/pin-jekyll-version.

Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-18 12:17:57 +09:00
Max Gekk 5957bc18a1 [SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs
### What changes were proposed in this pull request?
Move the datetime rebase SQL configs from the `legacy` namespace by:
1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`.
2. Add the legacy configs as alternatives
3. Deprecate the legacy rebase configs.

### Why are the changes needed?
The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs.

### Does this PR introduce _any_ user-facing change?
Should not. Users will see a warning if they still use one of the legacy configs.

### How was this patch tested?
1. Manually checking new configs:
```scala
scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res0: String = EXCEPTION

scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead.

scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead")
res2: String = LEGACY
```
2. By running a datetime rebasing test suite:
```
$ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite"
```

Closes #31576 from MaxGekk/rebase-confs-alternatives.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-17 14:04:47 +00:00
Eric Lemmon e3b6e4ad43 [SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs
### What changes were proposed in this pull request?
Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well.

### Why are the changes needed?
No docs were generated for `pyspark.sql.conf.RuntimeConfig`.

### Does this PR introduce _any_ user-facing change?
Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch.

### How was this patch tested?
First built the Python docs:
```bash
cd $SPARK_HOME/docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
```
Then verified all pages and links:
1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page:
http://localhost:4000/api/python/reference/index.html
![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png)

2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page:
http://localhost:4000/api/python/reference/pyspark.sql.html#configuration
![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)**

3. RuntimeConfig page displayed:
http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html
![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png)

4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page:
http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html
![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png)

Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable.

Authored-by: Eric Lemmon <eric@lemmon.cc>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-13 09:32:55 -06:00
HyukjinKwon 92a83463c9 [SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance
### What changes were proposed in this pull request?

This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier.

### Why are the changes needed?

For the better readability of codes and maintenance.

### Does this PR introduce _any_ user-facing change?

No, refactoring.

### How was this patch tested?

Ran the existing unittests. CI in this PR should test it out too.

Closes #31537 from HyukjinKwon/SPARK-34408.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-11 10:57:02 +09:00
David Li 9b875ceada [SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas
### What changes were proposed in this pull request?

Creating a Pandas dataframe via Apache Arrow currently can use twice as much memory as the final result, because during the conversion, both Pandas and Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >= 0.16) to avoid this, by freeing each column after conversion. This PR integrates support for this in toPandas, handling a couple of edge cases:

self_destruct has no effect unless the memory is allocated appropriately, which is handled in the Arrow serializer here. Essentially, the issue is that self_destruct frees memory column-wise, but Arrow record batches are oriented row-wise:

```
Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ...
Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ...
```

In this scenario, Arrow will drop references to all of column 0's chunks, but no memory will actually be freed, as the chunks were just slices of an underlying allocation. The PR copies each column into its own allocation so that memory is instead arranged as so:

```
Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk 0, ...
Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk 1, ...
```

The optimization is disabled by default, and can be enabled with the Spark SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" set to "true". We can't always apply this optimization because it's more likely to generate a dataframe with immutable buffers, which Pandas doesn't always handle well, and because it is slower overall (since it only converts one column at a time instead of in parallel).

### Why are the changes needed?

This lets us load larger datasets - in particular, with N bytes of memory, before we could never load a dataset bigger than N/2 bytes; now the overhead is more like N/1.25 or so.

### Does this PR introduce _any_ user-facing change?

Yes - it adds a new SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled"

### How was this patch tested?

See the [mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html) - it was tested with Python memory_profiler. Unit tests added to check memory within certain bounds and correctness with the option enabled.

Closes #29818 from lidavidm/spark-32953.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2021-02-10 09:58:46 -08:00
Liang-Chi Hsieh 1fbd576410 [SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document
### What changes were proposed in this pull request?

This follows up #31160 to update score function in the document.

### Why are the changes needed?

Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No, only doc change.

Closes #31531 from viirya/SPARK-34080-minor.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-10 09:24:25 +09:00
Max Gekk a85490659f [SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read
### What changes were proposed in this pull request?
In the PR, I propose new options for the Parquet datasource:
1. `datetimeRebaseMode`
2. `int96RebaseMode`

Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps.

The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely;
- `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar.
- `"CORRECTED"`, dates/timestamps are read AS IS from parquet files.
- `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.

### Why are the changes needed?
1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance:
```scala
val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1)
val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2)
df1.join(df2, ...)
```
Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead`  influences on both reads.

2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps:
```scala
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy")
spark.read.parquet(folder).distinct.rdd.collect()
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV1Suite"
$ build/sbt "sql/test:testOnly *ParquetRebaseDatetimeV2Suite"
```

Closes #31489 from MaxGekk/parquet-rebase-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-08 13:28:40 +00:00
gengjiaan 2c243c93d9 [SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly
### What changes were proposed in this pull request?
The current implement of some DDL not unify the output and not pass the output properly to physical command.
Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`.

As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`.

Take `show tables` and `show table extended like 'tbl'` as example.
The output before this PR:
`show tables`
|database|tableName|isTemporary|
-- | -- | --
| default|      tbl|      false|

If catalog is v2 session catalog, the output before this PR:
|namespace|tableName|
-- | --
| default|      tbl

`show table extended like 'tbl'`
|database|tableName|isTemporary|         information|
-- | -- | -- | --
| default|      tbl|      false|Database: default...|

The output after this PR:
`show tables`
|namespace|tableName|isTemporary|
-- | -- | --
|  default|      tbl|      false|

`show table extended like 'tbl'`
|namespace|tableName|isTemporary|         information|
-- | -- | -- | --
|  default|      tbl|      false|Database: default...|

### Why are the changes needed?
This PR have benefits as follows:
First, Unify schema for the output of SHOW TABLES.
Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe.

### Does this PR introduce _any_ user-facing change?
Yes.
The output schema of `SHOW TABLES` replace `database` by `namespace`.

### How was this patch tested?
Jenkins test.

Closes #31245 from beliefer/SPARK-34157.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-02-08 08:39:58 +00:00
Xinrong Meng 747ad1809b [PYTHON][MINOR] Fix docstring of DataFrame.join
### What changes were proposed in this pull request?
Fix docstring of PySpark `DataFrame.join`.

### Why are the changes needed?
For a better view of PySpark documentation.

### Does this PR introduce _any_ user-facing change?
No (only documentation changes).

### How was this patch tested?
Manual test.

From
![image](https://user-images.githubusercontent.com/47337188/106977730-c14ab080-670f-11eb-8df8-5aea90902104.png)

To
![image](https://user-images.githubusercontent.com/47337188/106977834-ed663180-670f-11eb-9c5e-d09be26e0ca8.png)

Closes #31463 from xinrong-databricks/fixDoc.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-06 09:08:49 -06:00
yi.wu e9362c2571 [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas
### What changes were proposed in this pull request?

Resolve duplicate attributes for `FlatMapCoGroupsInPandas`.

### Why are the changes needed?

When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example,

```scala
df = spark.createDataFrame([(1, 1)], ("column", "value"))
row = df.groupby("ColUmn").cogroup(
    df.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, "column long, value long")
row.join(row).show()
```
error:

```scala
...
Conflicting attributes: column#163321L,value#163322L
;;
’Join Inner
:- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
:  :- Project [ColUmn#163312L, column#163312L, value#163313L]
:  :  +- LogicalRDD [column#163312L, value#163313L], false
:  +- Project [COLUMN#163312L, column#163312L, value#163313L]
:     +- LogicalRDD [column#163312L, value#163313L], false
+- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
   :- Project [ColUmn#163312L, column#163312L, value#163313L]
   :  +- LogicalRDD [column#163312L, value#163313L], false
   +- Project [COLUMN#163312L, column#163312L, value#163313L]
      +- LogicalRDD [column#163312L, value#163313L], false
...
```

### Does this PR introduce _any_ user-facing change?

yes, the query like the above example won't fail.

### How was this patch tested?

Adde unit tests.

Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 16:25:32 +09:00
David Toneian d99d0d27be [SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of dev/lint-python
This changeset is published into the public domain.

### What changes were proposed in this pull request?

Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed.

### Why are the changes needed?
In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed.

### Does this PR introduce _any_ user-facing change?

Slight improvements in documentation, and in standard output of `dev/lint-python`.

### How was this patch tested?

Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change.

Closes #31401 from DavidToneian/SPARK-34300.

Authored-by: David Toneian <david@toneian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:30:50 +09:00
HyukjinKwon 30468a9015 [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs
### What changes were proposed in this pull request?

This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.

In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
  - `sumDistinct` -> `sum_distinct`
  - `bitwiseNOT` -> `bitwise_not`
  - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
  - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
  - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
  - (Scala-specific) `callUDF` -> `call_udf`

### Why are the changes needed?

To keep the consistent naming in APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it deprecates some APIs and add new renamed APIs as described above.

### How was this patch tested?

Unittests were added.

Closes #31408 from HyukjinKwon/SPARK-34306.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:29:40 +09:00
Ruifeng Zheng 2c4e4f8412 [SPARK-34189][ML] w2v findSynonyms optimization
### What changes were proposed in this pull request?
1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector

### Why are the changes needed?
this pr is about 30% faster than existing impl

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
existing testsuites

Closes #31276 from zhengruifeng/w2v_findSynonyms_opt.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-01-27 10:08:53 +08:00
Takuya UESHIN 43fdd1271e [SPARK-33489][PYSPARK] Add NullType support for Arrow executions
### What changes were proposed in this pull request?

Adds `NullType` support for Arrow executions.

### Why are the changes needed?

As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled.

### Does this PR introduce _any_ user-facing change?

Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled.

### How was this patch tested?

Added tests.

Closes #31285 from ueshin/issues/SPARK-33489/arrow_nulltype.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-25 11:34:47 +09:00
pgrz 121eb0130e [SPARK-34191][PYTHON][SQL] Add typing for udf overload
### What changes were proposed in this pull request?
Added typing for keyword-only single argument udf overload.

### Why are the changes needed?

The intended use case is:
```
udf(returnType="string")
def f(x): ...
```

### Does this PR introduce _any_ user-facing change?

Yes - a new typing for udf is considered valid.

### How was this patch tested?

Existing tests.

Closes #31282 from pgrz/patch-1.

Authored-by: pgrz <grzegorski.piotr@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-22 21:19:20 +09:00
itholic 28131a7794 [SPARK-34190][DOCS] Supplement the description for Python Package Management
### What changes were proposed in this pull request?

This PR supplements the contents in the "Python Package Management".

If there is no Python installed in the local for all nodes when using `venv-pack`, job would fail as below.

```python
>>> from pyspark.sql.functions import pandas_udf
>>> pandas_udf('double')
... def pandas_plus_one(v: pd.Series) -> pd.Series:
...     return v + 1
...
>>> spark.range(10).select(pandas_plus_one("id")).show()
...
Cannot run program "./environment/bin/python": error=2, No such file or directory
...
```

This is because the Python in the [packed environment via `venv-pack` has a symbolic link](https://github.com/jcrist/venv-pack/issues/5) that connects Python to the local one.

To avoid this confusion, it seems better to have an additional explanation for this.

### Why are the changes needed?

To provide more detailed information to users so that they don’t get confused

### Does this PR introduce _any_ user-facing change?

Yes, this PR fixes the part of "Python Package Management"  in the "User Guide" documents.

### How was this patch tested?

Manually built the doc.

![Screen Shot 2021-01-21 at 7 10 38 PM](https://user-images.githubusercontent.com/44108233/105336258-5e8bec00-5c1c-11eb-870c-86acfc77c082.png)

Closes #31280 from itholic/SPARK-34190.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-21 22:15:42 +09:00
HyukjinKwon 0130a3813a [SPARK-33730][PYTHON][FOLLOW-UP] Consider the case when the current frame is None
### What changes were proposed in this pull request?

This PR proposes to consider the case when [`inspect.currentframe()`](https://docs.python.org/3/library/inspect.html#inspect.currentframe) returns `None` because the underlyining Python implementation does not support frame.

### Why are the changes needed?

To be safer and potentially for the official support of other Python implementations in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested via:

When frame is available:

```
vi tmp.py
```

```python
from inspect import *
lineno = getframeinfo(currentframe()).lineno + 1 if currentframe() is not None else 0
print(warnings.formatwarning(
    "Failed to set memory limit: {0}".format(Exception("argh!")),
    ResourceWarning,
    __file__,
    lineno),
    file=sys.stderr)
```

```
python tmp.py
```

```
/.../tmp.py:3: ResourceWarning: Failed to set memory limit: argh!
  print(warnings.formatwarning(
```

When frame is not available:

```
vi tmp.py
```

```python
from inspect import *
lineno = getframeinfo(currentframe()).lineno + 1 if None is not None else 0
print(warnings.formatwarning(
    "Failed to set memory limit: {0}".format(Exception("argh!")),
    ResourceWarning,
    __file__,
    lineno),
    file=sys.stderr)
```

```
python tmp.py
```

```
/.../tmp.py:0: ResourceWarning: Failed to set memory limit: argh!
```

Closes #31239 from HyukjinKwon/SPARK-33730-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-19 15:30:42 +09:00
Ruifeng Zheng d8cbef1abf [SPARK-34093][ML] param maxDepth should check upper bound
### What changes were proposed in this pull request?
update the ParamValidators of `maxDepth`

### Why are the changes needed?
current impl of tree models only support maxDepth<=30

### Does this PR introduce _any_ user-facing change?
If `maxDepth`>30, fail quickly

### How was this patch tested?
existing testsuites

Closes #31163 from zhengruifeng/param_maxDepth_upbound.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-01-18 11:36:10 -06:00
zero323 098f2268e4 [SPARK-33730][PYTHON] Standardize warning types
### What changes were proposed in this pull request?

This PR:

- Adds as small  hierarchy of warnings to be used in PySpark applications. These extend built-in classes and top level `PySparkWarning`.
- Replaces `DeprecationWarnings` (intended for developers) with PySpark specific subclasses of `FutureWarning` (intended for end users).

### Why are the changes needed?

- To be more precise and add users additional control (in addition to standard module level filters) over PySpark warnings handling.
- Correct semantics (at the moment we use `DeprecationWarning` in user-facing API, but it is intended "for warnings about deprecated features when those warnings are intended for other Python developers").

### Does this PR introduce _any_ user-facing change?

Yes. Code can raise different type of warning than before.

### How was this patch tested?

Existing tests.

Closes #30985 from zero323/SPARK-33730.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-18 09:32:55 +09:00
Huaxin Gao f3548837c6 [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector

### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.

### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures",  numTopFeatures=100)
```

Or

numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures",  numTopFeatures=100)
```

### How was this patch tested?
Add Unit test

Closes #31160 from huaxingao/UnivariateSelector.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-01-16 11:09:23 +08:00
ulysses-you 92e5cfd58d [SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql
### What changes were proposed in this pull request?

This PR aims to strip auto-generated cast. The main logic is:
1. Add tag if Cast is specified by user.
2. Wrap `PrettyAttribute` in usePrettyExpression.

### Why are the changes needed?

Make sql consistent with dsl. Here is an inconsistent example before this PR:

```
-- output field name: FLOOR(1)
spark.emptyDataFrame.select(floor(lit(1)))

-- output field name: FLOOR(CAST(1 AS DOUBLE))
spark.sql("select floor(1)")
```

Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string.

### Does this PR introduce _any_ user-facing change?

Yes, the default field name may change.

### How was this patch tested?

Add test and pass exists test.

Closes #31034 from ulysses-you/SPARK-33989.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-01-14 15:27:14 +00:00
Takuya UESHIN ad8e40e2ab [SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function
### What changes were proposed in this pull request?

This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function.

### Why are the changes needed?

The original PRs are missing tests with column-based arguments instead of literals.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests and existing tests.

Closes #31159 from ueshin/issues/SPARK-32338/slice_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-13 09:56:38 +09:00
HyukjinKwon aa388cf3d0 [SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
- `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page
- Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
- Mention other migration guides as well because PySpark can get affected by it.

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It fixes user-facing docs. However, it's not released out yet.

### How was this patch tested?

Manually tested by running:

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

Closes #31082 from HyukjinKwon/SPARK-34041.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-08 09:28:31 +09:00
HyukjinKwon ff284fb6ac [SPARK-30681][PYTHON][FOLLOW-UP] Keep the name similar with Scala side in higher order functions
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side.

Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary.

### Why are the changes needed?

To make the usage similar with Scala side, and for consistency.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

GitHub Actions and Jenkins build will test it out.

Closes #31062 from HyukjinKwon/SPARK-30681.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-06 18:46:20 +09:00
HyukjinKwon 329850c667 [SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/29703.
It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there:
- https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html
- https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support
- http://crs4.github.io/pydoop/_pydoop1/installation.html

### Why are the changes needed?

To avoid the environment variables is unexpectedly conflicted.

### Does this PR introduce _any_ user-facing change?

It renames the environment variable but it's not released yet.

### How was this patch tested?

Existing unittests will test.

Closes #31028 from HyukjinKwon/SPARK-32017-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 17:21:32 +09:00
HyukjinKwon d6322bf70c [SPARK-33983][PYTHON] Update cloudpickle to v1.6.0
### What changes were proposed in this pull request?

This PR proposes to upgrade cloudpickle from 1.5.0 to 1.6.0.
It virtually contains one fix:

4510be850d

From a cursory look, this isn't a regression, and not even properly supported in Python:

```python
>>> import pickle
>>> pickle.dumps({}.keys())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot pickle 'dict_keys' object
```

So it seems fine not to backport.

### Why are the changes needed?

To leverage bug fixes from the cloudpickle upstream.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub actions build will test it out.

Closes #31007 from HyukjinKwon/cloudpickle-upgrade.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:36:31 -08:00
HyukjinKwon 6b86aa0b52 [SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1
### What changes were proposed in this pull request?

This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).

### Why are the changes needed?

To leverage fixes from the upstream in Py4J.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub Actions will test it out.

Closes #31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:23:38 -08:00
Dongjoon Hyun 271c4f6e00 [SPARK-33978][SQL] Support ZSTD compression in ORC data source
### What changes were proposed in this pull request?

This PR aims to support ZSTD compression in ORC data source.

### Why are the changes needed?

Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost.
- https://issues.apache.org/jira/browse/ORC-363

**BEFORE**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none.
```

**AFTER**
```scala
scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd")
```

```bash
$ orc-tools meta /tmp/zstd
Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230]
Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc
File Version: 0.12 with ORC_14
Rows: 1
Compression: ZSTD
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9

Stripes:
  Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 24
    Stream: column 1 section DATA start: 38 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 230 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.2.0
```

### Does this PR introduce _any_ user-facing change?

Yes, this is a new feature.

### How was this patch tested?

Pass the newly added test case.

Closes #31002 from dongjoon-hyun/SPARK-33978.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 00:54:47 -08:00
Gabor Somogyi 678294ddc2 [SPARK-33824][PYTHON][DOCS][FOLLOW-UP] Clarify about PYSPARK_DRIVER_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON
### What changes were proposed in this pull request?

This PR proposes to clarify:
- `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates.
- `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application.

### Why are the changes needed?

To clarify what's required and not.

### Does this PR introduce _any_ user-facing change?

Yes, this is a user-facing doc change.

### How was this patch tested?

Manually tested.

Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me.
I also manually tested all again to double check.

Closes #30938 from HyukjinKwon/SPARK-33824-followup.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 09:52:42 +09:00
Yuanjian Li 86c1cfc579 [SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API
### What changes were proposed in this pull request?
Follow up work for #30521, document the following behaviors in the API doc:

- Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table.
- Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created.

### Why are the changes needed?
We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Document only.

Closes #30885 from xuanyuanking/SPARK-33659.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-24 12:44:37 +09:00
Kyle Krueger 0bf3828ac4 [MINOR] update dstream.py with more accurate exceptions
### What changes were proposed in this pull request?

Reopened from https://github.com/apache/spark/pull/27525.
The exception messages for dstream.py when using windows were improved to be specific about what sliding duration is important.

### Why are the changes needed?

The batch interval of dstreams are improperly named as sliding windows. The term sliding window is also used to reference the new window of a dstream collected over a window of rdds in a parent dstream. We should probably fix the naming convention of sliding window used in the dstream class, but for now more this more explicit exception message may reduce confusion.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

It wasn't since this is only a change of the exception message

Closes #30871 from kykrueger/kykrueger-patch-1.

Authored-by: Kyle Krueger <kyle.s.krueger@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 14:17:09 -08:00
HyukjinKwon 4106731fdd [SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable
### What changes were proposed in this pull request?

This PR proposes to:

- Make doctests simpler to show the usage (since we're not running them now).
- Use the test utils to drop the tables if exists.

### Why are the changes needed?

Better docs and code readability.

### Does this PR introduce _any_ user-facing change?

No, dev-only. It includes some doc changes in unreleased branches.

### How was this patch tested?

Manually tested.

```bash
cd python
./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests"
```

Closes #30873 from HyukjinKwon/SPARK-33836.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2020-12-22 06:27:27 +09:00
HyukjinKwon 38bbccab75 [SPARK-33869][PYTHON][SQL][TESTS] Have a separate metastore directory for each PySpark test job
### What changes were proposed in this pull request?

This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations.

### Why are the changes needed?

To make PySpark tests less flaky.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested by trying some sleeps in https://github.com/apache/spark/pull/30873.

Closes #30875 from HyukjinKwon/SPARK-33869.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 11:11:25 -08:00
Jungtaek Lim 8d4d433191 [SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable
### What changes were proposed in this pull request?

This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming.

### Why are the changes needed?

Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark.

### Does this PR introduce _any_ user-facing change?

Yes, PySpark users will be able to read and write with table in Structured Streaming query.

### How was this patch tested?

Manually tested.

> v1 table

>> create table A and ingest to the table A

```
spark.sql("""
create table table_pyspark_parquet (
    value long,
    `timestamp` timestamp
) USING parquet
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5')
query.lastProgress
query.stop()
```

>> read table A and ingest to the table B which doesn't exist

```
df2 = spark.readStream.table('table_pyspark_parquet')
query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2')
query2.lastProgress
query2.stop()
```

>> select tables

```
spark.sql("DESCRIBE TABLE table_pyspark_parquet").show()
spark.sql("SELECT * FROM table_pyspark_parquet").show()

spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show()
spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show()
```

> v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well)

>> create table A and ingest to the table A

```
spark.sql("""
create table iceberg_catalog.default.table_pyspark_v2table (
    value long,
    `timestamp` timestamp
) USING iceberg
""")
df = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1')
query.lastProgress
query.stop()
```

>> ingest to the non-exist table B

```
df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2')
query2.lastProgress
query2.stop()
```

>> ingest to the non-exist table C partitioned by `value % 10`

```
df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load()
df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition')
query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3')
query3.lastProgress
query3.stop()
```

>> select tables

```
spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show()

spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show()

spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show()
```

Closes #30835 from HeartSaVioR/SPARK-33836.

Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-21 19:42:59 +09:00
HyukjinKwon 6315118676 [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page
### What changes were proposed in this pull request?

This PR proposes to restructure and refine the Python dependency management page.
I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation.
FWIW, it has been reviewed by some tech writers and engineers.

I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It's doc change but only in unreleased bracnhs for now.

### How was this patch tested?

I manually built the docs as:

```bash
cd python/docs
make clean html
open
```

Closes #30822 from HyukjinKwon/SPARK-33824.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-18 10:03:07 +09:00
HyukjinKwon e2cdfcebd9 [SPARK-32447][CORE][PYTHON][FOLLOW-UP] Fix other occurrences of 'python' to 'python3'
### What changes were proposed in this pull request?

This PR proposes to change python to python3 in several places missed.

### Why are the changes needed?

To use Python 3 by default safely.

### Does this PR introduce _any_ user-facing change?

Yes, it will uses `python3` as its default Python interpreter.

### How was this patch tested?

It was tested together in https://github.com/apache/spark/pull/30735. The test cases there will verify this change together.

Closes #30750 from HyukjinKwon/SPARK-32447.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-13 10:41:47 +09:00
Fokko Driesprong e4d1c10760 [SPARK-32320][PYSPARK] Remove mutable default arguments
This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

```
fokkodriesprongFan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
```

```
fokkodriesprongFan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):
```

### What changes were proposed in this pull request?

Removing the mutable default arguments.

### Why are the changes needed?

Removing the mutable default arguments, and changing the signature to `Optional[...]`.

### Does this PR introduce _any_ user-facing change?

No 👍

### How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

Closes #29122 from Fokko/SPARK-32320.

Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2020-12-08 09:35:36 +08:00
HyukjinKwon 5250841537
[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style
### What changes were proposed in this pull request?

This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.

Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.

### Why are the changes needed?

To tell developers that PySpark now follows NumPy documentation style.

### Does this PR introduce _any_ user-facing change?

No, it's a change in unreleased branches yet.

### How was this patch tested?

Manually tested via `make clean html` under `python/docs`:

![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)

Closes #30622 from HyukjinKwon/SPARK-33256.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 01:22:24 -08:00
Dongjoon Hyun de9818f043
[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #30606 from dongjoon-hyun/SPARK-3.2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 14:10:42 -08:00
Weichen Xu 7e759b2d95 [SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator
### What changes were proposed in this pull request?
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model

### Why are the changes needed?
Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark.

CrossValidator and TrainValidateSplit support tuning these python backend estimator,
but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance.

OneVsRest saving/load now only support java backend classifier due to similar issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30471 from WeichenXu123/support_pyio_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2020-12-04 08:35:50 +08:00
Gabor Somogyi bd711863fd [SPARK-33629][PYTHON] Make spark.buffer.size configuration visible on driver side
### What changes were proposed in this pull request?
`spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue.

### Why are the changes needed?
Apply the mentioned config on driver side.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests + manually.

Added the following code temporarily:
```
def local_connect_and_auth(port, auth_secret):
...
            sock.connect(sa)
            print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition
            sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536)))
...
```

Test:
```
#Compile Spark

echo "spark.buffer.size 10000" >> conf/spark-defaults.conf

$ ./bin/pyspark
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 21 2020 10:48:26)
Spark context Web UI available at http://192.168.0.189:4040
Spark context available as 'sc' (master = local[*], app id = local-1606999094506).
SparkSession available as 'spark'.
>>> sc.setLogLevel("TRACE")
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
...
SPARK_BUFFER_SIZE: 10000
...
[[0], [2], [3], [4], [6]]
>>>
```

Closes #30592 from gaborgsomogyi/SPARK-33629.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 01:37:44 +09:00
Liang-Chi Hsieh 3b2ff16ee6 [SPARK-33636][PYTHON][ML][FOLLOWUP] Update since tag of labelsArray in StringIndexer
### What changes were proposed in this pull request?

This is to update `labelsArray`'s since tag.

### Why are the changes needed?

The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A. Just tag change.

Closes #30582 from viirya/SPARK-33636-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 14:34:44 +09:00
Liang-Chi Hsieh 0880989755 [SPARK-22798][PYTHON][ML][FOLLOWUP] Add labelsArray to PySpark StringIndexer
### What changes were proposed in this pull request?

This is a followup to add missing `labelsArray` to PySpark `StringIndexer`.

### Why are the changes needed?

`labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too.

### Does this PR introduce _any_ user-facing change?

Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0.

### How was this patch tested?

Unit test.

Closes #30579 from viirya/SPARK-22798-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-03 10:57:14 +09:00