ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	5440ea84ee	[SPARK-32312][DOC][FOLLOWUP] Fix the minimum version of PyArrow in the installation guide ### What changes were proposed in this pull request? Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide. ### Why are the changes needed? The minimum version of PyArrow was upgraded to `1.0.0`. ### Does this PR introduce _any_ user-facing change? Users see the correct minimum version in the installation guide. ### How was this patch tested? N/A Closes #29829 from ueshin/issues/SPARK-32312/doc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-22 11:04:14 +09:00
itholic	9c653c957f	[SPARK-32189][DOCS][PYTHON] Development - Setting up IDEs ### What changes were proposed in this pull request? This PR proposes to document the way of setting up IDEs ![스크린샷 2020-09-21 오전 10 43 12](https://user-images.githubusercontent.com/44108233/93727715-5c2a6e80-fbf7-11ea-821b-555723b00bc8.png) ![스크린샷 2020-09-21 오전 10 43 45](https://user-images.githubusercontent.com/44108233/93727716-5f255f00-fbf7-11ea-9c6c-7b8a973bc511.png) ### Why are the changes needed? To let users know how to setup IDEs ### Does this PR introduce _any_ user-facing change? Yes, it adds a new page in the documentation about setting IDEs. ### How was this patch tested? Manually built the doc. Closes #29781 from itholic/SPARK-32189. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-21 12:29:17 +09:00
zero323	7fb9f6884f	[SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName ### What changes were proposed in this pull request? Add optional `allowMissingColumns` argument to SparkR `unionByName`. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? `unionByName` supports `allowMissingColumns`. ### How was this patch tested? Existing unit tests. New unit tests targeting this feature. Closes #29813 from zero323/SPARK-32799. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-21 09:39:34 +09:00
HyukjinKwon	f893a19c4c	[SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide ### What changes were proposed in this pull request? This PR: - rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors" - documents extra dependency installation `pip install pyspark[sql]` - uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html - adds some more details I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html ### Why are the changes needed? To improve installation guide. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing installation guide. ### How was this patch tested? Manually built the doc and tested. Closes #29779 from HyukjinKwon/SPARK-32180. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-20 10:58:17 +09:00
HyukjinKwon	657e39a334	[SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSession.builder.getOrCreate ### What changes were proposed in this pull request? In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below: ```python import warnings from pyspark.sql import SparkSession, SQLContext warnings.simplefilter('always', DeprecationWarning) spark.stop() SparkSession.builder.getOrCreate() ``` it shows the deprecation warning as below: ``` /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` via `d3304268d3/python/pyspark/sql/session.py (L222)` We shouldn't print the deprecation warning from it. This is the only place ^. ### Why are the changes needed? To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly. ### Does this PR introduce _any_ user-facing change? Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`. ### How was this patch tested? Manually tested as above. Closes #29768 from HyukjinKwon/SPARK-32897. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-09-16 10:13:47 -07:00
zero323	c918909c1a	[SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword ### What changes were proposed in this pull request? Replace `__metaclass__` fields with `metaclass` keyword in the class statements. ### Why are the changes needed? `__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons. ``` >>> from pyspark.sql.types import BooleanType >>> BooleanType() is BooleanType() False ``` and classes, which suppose to be abstract, are not ``` >>> import inspect >>> from pyspark.ml import Estimator >>> inspect.isabstract(Estimator) False ``` ### Does this PR introduce _any_ user-facing change? Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix. ### How was this patch tested? Existing tests. Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:22:11 +09:00
Adam Binford	e884290587	[SPARK-32835][PYTHON] Add withField method to the pyspark Column class ### What changes were proposed in this pull request? This PR adds a `withField` method on the pyspark Column class to call the Scala API method added in https://github.com/apache/spark/pull/27066. ### Why are the changes needed? To update the Python API to match a new feature in the Scala API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test Closes #29699 from Kimahriman/feature/pyspark-with-field. Authored-by: Adam Binford <adam.binford@radiantsolutions.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:18:36 +09:00
Liang-Chi Hsieh	550c1c9cfb	[SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV ### What changes were proposed in this pull request? This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header. ### Why are the changes needed? This behavior can confuse users. We should explicitly document it. ### Does this PR introduce _any_ user-facing change? No. Only doc change. ### How was this patch tested? Only doc change. Closes #29765 from viirya/SPARK-32888. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:16:15 +09:00
Abhishek Dixit	6f36db1fa5	[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py ### What changes were proposed in this pull request? Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using existing tests Closes #29242 from abhishekd0907/SPARK-31448. Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-15 08:41:22 -05:00
Sean Owen	ce566bed17	[SPARK-32180][FOLLOWUP] Fix .rst error in new Pyspark installation guide This simply fixes an .rst generation error in https://github.com/apache/spark/pull/29640 Closes #29735 from srowen/SPARK-32180.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-09-11 20:08:22 -07:00
Rohit.Mishra	f6322d1cb1	[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to add getting started- installation to new PySpark docs. ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No. Documentation only. ### How was this patch tested? Generating documents locally. Closes #29640 from rohitmishr1484/SPARK-32180-Getting-Started-Installation. Authored-by: Rohit.Mishra <rohit.mishra@utopusinsights.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-11 10:38:01 -05:00
Bryan Cutler	e0538bd38c	[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-10 14:16:19 +09:00
Wenchen Fan	f7995c576a	Revert "[SPARK-32677][SQL] Load function resource before create" This reverts commit `05fcf26b79`.	2020-09-09 18:15:22 +00:00
itholic	c8c082ce38	[SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "<string>", line 1, in <module> ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 12:22:13 +09:00
HyukjinKwon	c336ae39cd	[SPARK-32186][DOCS][PYTHON] Development - Debugging ### What changes were proposed in this pull request? This PR proposes to document the way of debugging PySpark. It's pretty much self-descriptive. I made a demo site to review it more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/debugging.html ### Why are the changes needed? To let users know how to debug PySpark applications. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new page in the documentation about debugging PySpark. ### How was this patch tested? Manually built the doc. Closes #29639 from HyukjinKwon/SPARK-32186. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 10:32:22 +09:00
itholic	8bd3770552	[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark ### What changes were proposed in this pull request? This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not. ### Why are the changes needed? To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new examples with new argument in the docstring. ### How was this patch tested? Doctest added and manually tested ``` $ python/run-tests --testnames pyspark.sql.dataframe Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.dataframe (35s) Finished test(/.../python3): pyspark.sql.dataframe (35s) Tests passed in 35 seconds ``` Closes #29657 from itholic/SPARK-32798. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 09:41:02 +09:00
ulysses	05fcf26b79	[SPARK-32677][SQL] Load function resource before create ### What changes were proposed in this pull request? Change `CreateFunctionCommand` code that add class check before create function. ### Why are the changes needed? We have different behavior between create permanent function and temporary function when function class is invaild. e.g., ``` create function f as 'test.non.exists.udf'; -- Time taken: 0.104 seconds create temporary function f as 'test.non.exists.udf' -- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath; ``` And Hive also fails both of them. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception when create a invalid udf. ### How was this patch tested? New test. Closes #29502 from ulysses-you/function. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 06:00:23 +00:00
HyukjinKwon	32d87c2b59	[SPARK-32783][DOCS][PYTHON] Development - Testing PySpark ### What changes were proposed in this pull request? This PR proposes to add a page to describe how to test PySpark. Note that it avoids duplication of https://spark.apache.org/developer-tools.html and it more aims to add put the relevant links together. I made a demo site to review more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/testing.html ### Why are the changes needed? To guide PySpark developers easily test. ### Does this PR introduce _any_ user-facing change? Yes, it will adds a new documentation page. ### How was this patch tested? Manually tested. Closes #29634 from HyukjinKwon/SPARK-32783. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-04 18:06:25 +09:00
HyukjinKwon	d80c85c2e3	[SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the main page in migration guide ### What changes were proposed in this pull request? This PR is a minor followup to fix: 1. Slightly reword the wording in the main page. 2. The indentation in the table at the migration guide; from ![Screen Shot 2020-09-01 at 1 53 40 PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png) to ![Screen Shot 2020-09-01 at 1 53 26 PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png) ### Why are the changes needed? In order to show the migration guide pretty. ### Does this PR introduce _any_ user-facing change? Yes, this is a change to user-facing documentation. ### How was this patch tested? Manually built the documentation. Closes #29606 from HyukjinKwon/SPARK-32191. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 15:08:03 +09:00
HyukjinKwon	86ca90ccd7	[SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark ### What changes were proposed in this pull request? This PR proposes to document PySpark specific contribution guides at "Development" section. Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it is a new documentation. See the demo linked above. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29596 from HyukjinKwon/SPARK-32190. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 14:20:07 +09:00
HyukjinKwon	eaaf783148	[MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly ### What changes were proposed in this pull request? This PR fixes the link of Binder in Quickstart notebook and documentation. From: https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb To: https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb This link is the same as the one in RST files: `b54103016a/python/docs/source/conf.py (L57)` ### Why are the changes needed? The link was wrong, and points out non-existent file and repo. ### Does this PR introduce _any_ user-facing change? Yes, it will fixes the link so users can correctly try Binder. ### How was this patch tested? Manually tested by building the documentation. Closes #29597 from HyukjinKwon/minor-link-quickstart. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 22:00:56 +09:00
zero323	5574734093	[SPARK-32138][FOLLOW-UP] Drop obsolete StringIO import branching ### What changes were proposed in this pull request? Removal of branched `StringIO` import. ### Why are the changes needed? Top level `StringIO` is no longer present in Python 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29590 from zero323/SPARK-32138-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 16:56:50 +09:00
Fokko Driesprong	a1e459ed9f	[SPARK-32719][PYTHON] Add Flake8 check missing imports https://issues.apache.org/jira/browse/SPARK-32719 ### What changes were proposed in this pull request? Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard). ### Why are the changes needed? To make sure that the quality of the Python code is up to standard. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit-tests and Flake8 static analysis Closes #29563 from Fokko/fd-add-check-missing-imports. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:23:31 +09:00
Louiszr	a0bd273bb0	[SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to copy models instead of list ### What changes were proposed in this pull request? Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? `copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29553 from Louiszr/fix-cv-copy. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-28 10:15:16 -07:00
HyukjinKwon	5775073a01	[SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 1.0.1, pandas 1.1.1, Python 3.7) ### What changes were proposed in this pull request? This PR updates the chart generated at SPARK-25666. We bumped up the minimal PyArrow version. It's better to use PyArrow 0.15.1+ ### Why are the changes needed? To track the changes in type coercion of PySpark <> PyArrow <> pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` Closes #29569 from HyukjinKwon/SPARK-32722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:38:39 +09:00
HyukjinKwon	c154629171	[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide"). Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29548 from HyukjinKwon/SPARK-32183. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:09:06 +09:00
Terry Kim	baaa756dee	[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:21:04 +00:00
unirt	d3304268d3	[MINOR][PYTHON] Fix typo in a docsting of RDD.toDF ### What changes were proposed in this pull request? Fixes typo in docsting of `toDF` ### Why are the changes needed? The third argument of `toDF` is actually `sampleRatio`. related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it. Closes #29551 from unirt/minor_fix_docs. Authored-by: unirt <lunirtc@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-26 10:34:49 -07:00
HyukjinKwon	b54103016a	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:23:24 +09:00
Nicholas Chammas	f540031419	[SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable() ### What changes were proposed in this pull request? This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax: ```sql CREATE TABLE ... COMMENT 'this is a fancy table'; ``` ### Why are the changes needed? This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL. ### Does this PR introduce any user-facing change? Yes, it adds a new parameter to `Catalog.createTable()`. ### How was this patch tested? Existing unit tests: ```sh ./python/run-tests \ --python-executables python3.7 \ --testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context' ``` ``` $ ./build/sbt testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite ``` Closes #27908 from nchammas/SPARK-31000-table-description. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 13:42:31 +09:00
Nicholas Chammas	41cf1d093f	[SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list of dict ### What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/29491#discussion_r474451282 and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas. ### Why are the changes needed? This change clarifies to users that this behavior is supported and is not going away in the near future. ### Does this PR introduce _any_ user-facing change? Yes. There used to be a `UserWarning` for this, but now there isn't. ### How was this patch tested? I tested this manually. Before: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) /Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) .../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," DataFrame[a: bigint] ``` After: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) DataFrame[a: bigint] ``` Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-08-24 14:55:11 -07:00
Louiszr	d9eb06ea37	[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-22 09:27:31 -05:00
Sean Owen	891c5e661a	[MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation ### What changes were proposed in this pull request? The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result. `InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation. ### Why are the changes needed? It seems like an oversight to not include this as all similar "summary" classes are. `InheritableThread` should also be documented. ### Does this PR introduce _any_ user-facing change? I don't believe there are functional changes. It should make this public class appear in docs. ### How was this patch tested? Existing tests / N/A. Closes #29470 from srowen/KMeansSummary. Lead-authored-by: Sean Owen <srowen@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-19 14:30:07 +09:00
alexander-daskalov	10edeafc69	[MINOR][SQL] Fixed approx_count_distinct rsd param description ### What changes were proposed in this pull request? In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from _maximum estimation error allowed_ to _maximum relative standard deviation allowed_ ### Why are the changes needed? Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter. ### Does this PR introduce _any_ user-facing change? This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead, ### How was this patch tested? No tests, as no code changes were made. Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description. Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-14 22:10:41 +09:00
Dongjoon Hyun	b421bf0196	[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 ### What changes were proposed in this pull request? This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`. ### Why are the changes needed? In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX. ### Does this PR introduce _any_ user-facing change? Yes. This provides a new built-in option. ### How was this patch tested? Pass the GitHub Action or Jenkins with the revised test cases. Closes #29331 from dongjoon-hyun/SPARK-32517. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-10 07:33:06 -07:00
Weichen Xu	fc62d72076	[MINOR] add test_createDataFrame_empty_partition in pyspark arrow tests ### What changes were proposed in this pull request? add test_createDataFrame_empty_partition in pyspark arrow tests ### Why are the changes needed? test edge cases. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? N/A Closes #29398 from WeichenXu123/add_one_pyspark_arrow_test. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-10 18:43:41 +09:00
Liang-Chi Hsieh	f9f992e9a4	[SPARK-32191][PYTHON][DOCS] Port migration guide for PySpark docs ### What changes were proposed in this pull request? This proposes to port old PySpark migration guide to new PySpark docs. ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No. Documentation only. ### How was this patch tested? Generated document locally. <img width="1521" alt="Screen Shot 2020-08-07 at 1 53 20 PM" src="https://user-images.githubusercontent.com/68855/89687618-672e7700-d8b5-11ea-8f29-67a9ab271fa8.png"> Closes #29385 from viirya/SPARK-32191. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-10 15:41:32 +09:00
Fokko Driesprong	9fcf0ea718	[SPARK-32319][PYSPARK] Disallow the use of unused imports Disallow the use of unused imports: - Unnecessary increases the memory footprint of the application - Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :) ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused python/pyspark/heapq3.py:868:5: F401 '_heapq.' imported but unused python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused python/pyspark/rdd.py:21:1: F401 're' imported but unused python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.' imported but unused python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.' imported but unused python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.' imported but unused python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.' imported but unused python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.' imported but unused python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.' imported but unused python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.' imported but unused python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.' imported but unused python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.' imported but unused python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.' imported but unused python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.' imported but unused python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.' imported but unused python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.' imported but unused python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.' imported but unused python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.' imported but unused python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.' imported but unused python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.' imported but unused python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.' imported but unused python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.' imported but unused python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.' imported but unused python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.' imported but unused python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.' imported but unused python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.' imported but unused python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.' imported but unused python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.' imported but unused python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.' imported but unused python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.' imported but unused python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.' imported but unused python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.' imported but unused python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.' imported but unused python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.' imported but unused python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.' imported but unused python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.' imported but unused python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.' imported but unused python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.' imported but unused python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.' imported but unused python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.' imported but unused python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.' imported but unused python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.' imported but unused python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.' imported but unused python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.' imported but unused python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.' imported but unused python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.' imported but unused python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.' imported but unused python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused ``` After: ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" fokkodriesprongFan spark % ``` ### What changes were proposed in this pull request? Removing unused imports from the Python files to keep everything nice and tidy. ### Why are the changes needed? Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding the rule to the existing Flake8 checks. Closes #29121 from Fokko/SPARK-32319. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-08 08:51:57 -07:00
Liang Zhang	2cb48eabdc	[SPARK-32549][PYSPARK] Add column name in _infer_schema error message ### What changes were proposed in this pull request? The current error message from `_infer_type` in `_infer_schema` only includes the unsupported column type but not the column name. This PR adds the column name in the error message to make it easier for users to identify which column should they drop or convert. ### Why are the changes needed? Improve user experience. ### Does this PR introduce _any_ user-facing change? Yes. The error message from `_infer_schema` is changed. Before: "not supported type: foo" After: "Column bar contains not supported type: foo" ### How was this patch tested? Updated the existing unit test. Closes #29365 from liangz1/types-error-colname. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-07 11:50:46 +09:00
Huaxin Gao	75c2c53e93	[SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests ### What changes were proposed in this pull request? The test creates 10 batches of data to train the model and expects to see error on test data improves as model is trained. If the difference between the 2nd error and the 10th error is smaller than 2, the assertion fails: ``` FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) Test that error on test data improves as model is trained. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction eventually(condition, timeout=180.0) File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 81, in eventually lastValue = condition() File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 ``` I saw this quite a few time on Jenkins but was not able to reproduce this on my local. These are the ten errors I got: ``` 4.517395047937127 4.894265404350079 3.0392090466559876 1.8786361640757654 0.8973106042078115 0.3715780507684368 0.20815690742907672 0.17333033743125845 0.15686783249863873 0.12584413600569616 ``` I am thinking of having 15 batches of data instead of 10, so the model can be trained for a longer time. Hopefully the 15th error - 2nd error will always be larger than 2 on Jenkins. These are the 15 errors I got on my local: ``` 4.517395047937127 4.894265404350079 3.0392090466559876 1.8786361640757658 0.8973106042078115 0.3715780507684368 0.20815690742907672 0.17333033743125845 0.15686783249863873 0.12584413600569616 0.11883853835108477 0.09400261862100823 0.08887491447353497 0.05984929624986607 0.07583948141520978 ``` ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #29380 from huaxingao/flaky_test. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-06 13:54:15 -07:00
HyukjinKwon	15b73339d9	[SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation ### What changes were proposed in this pull request? This PR proposes to write the main page of PySpark documentation. The base work is finished at https://github.com/apache/spark/pull/29188. ### Why are the changes needed? For better usability and readability in PySpark documentation. ### Does this PR introduce _any_ user-facing change? Yes, it creates a new main page as below: ![Screen Shot 2020-07-31 at 10 02 44 PM](https://user-images.githubusercontent.com/6477701/89037618-d2d68880-d379-11ea-9a44-562f2aa0e3fd.png) ### How was this patch tested? Manually built the PySpark documentation. ```bash cd python make clean html ``` Closes #29320 from HyukjinKwon/SPARK-32507. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-05 11:14:14 +09:00
Max Gekk	7eb6f45688	[SPARK-32499][SQL] Use `{}` in conversions maps and structs to strings ### What changes were proposed in this pull request? Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`. ### Why are the changes needed? - To distinguish structs/maps from arrays. - To make `show`'s output consistent with Hive and conversions to Hive strings. - To display dataframe content in the same form by `spark-sql` and `show` - To be consistent with the `*.sql` tests ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suite `CastSuite`. Closes #29308 from MaxGekk/show-struct-map. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-04 14:57:09 +00:00
Takuya UESHIN	7deb67c28f	[SPARK-32160][CORE][PYSPARK][FOLLOWUP] Change the config name to switch allow/disallow SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #29278. This PR changes the config name to switch allow/disallow `SparkContext` in executors as per the comment https://github.com/apache/spark/pull/29278#pullrequestreview-460256338. ### Why are the changes needed? The config name `spark.executor.allowSparkContext` is more reasonable. ### Does this PR introduce _any_ user-facing change? Yes, the config name is changed. ### How was this patch tested? Updated tests. Closes #29340 from ueshin/issues/SPARK-32160/change_config_name. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-04 12:45:06 +09:00
Huaxin Gao	bc7885901d	[SPARK-32310][ML][PYSPARK] ML params default value parity in feature and tuning ### What changes were proposed in this pull request? set params default values in trait Params for feature and tuning in both Scala and Python. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and modified tests Closes #29153 from huaxingao/default2. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-03 08:50:34 -07:00
Takuya UESHIN	8014b0b5d6	[SPARK-32160][CORE][PYSPARK] Add a config to switch allow/disallow to create SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #28986. This PR adds a config to switch allow/disallow to create `SparkContext` in executors. - `spark.driver.allowSparkContextInExecutors` ### Why are the changes needed? Some users or libraries actually create `SparkContext` in executors. We shouldn't break their workloads. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to create `SparkContext` in executors with the config enabled. ### How was this patch tested? More tests are added. Closes #29278 from ueshin/issues/SPARK-32160/add_configs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-31 17:28:35 +09:00
HyukjinKwon	89d9b7cc64	[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode ### What changes were proposed in this pull request? This PR proposes: 1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`. This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below: ```python import pyspark spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) pyspark.InheritableThread(target=print_prop).start() ``` ``` hi ``` 2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify: ```bash PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) ``` This issue is fixed now. 3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue. ### Why are the changes needed? To support pinned thread mode properly without a resource leak, and a proper inheritable local properties. ### Does this PR introduce _any_ user-facing change? Yes, it adds an API `InheritableThread` class for pinned thread mode. ### How was this patch tested? Manually tested as described above, and unit test was added as well. Closes #28968 from HyukjinKwon/SPARK-32010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-30 10:15:25 +09:00
Huaxin Gao	40e6a5bbb0	[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-29 09:58:25 -05:00
Max Gekk	b2180c0950	[SPARK-32471][SQL][DOCS][TESTS][PYTHON][SS] Describe JSON option `allowNonNumericNumbers` ### What changes were proposed in this pull request? 1. Describe the JSON option `allowNonNumericNumbers` which is used in read 2. Add new test cases for allowed JSON field values: NaN, +INF, +Infinity, Infinity, -INF and -Infinity ### Why are the changes needed? To improve UX with Spark SQL and to provide users full info about the supported option. ### Does this PR introduce _any_ user-facing change? Yes, in PySpark. ### How was this patch tested? Added new test to `JsonParsingOptionsSuite` Closes #29275 from MaxGekk/allowNonNumericNumbers-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-29 12:14:13 +09:00
HyukjinKwon	5491c08bf1	Revert "[SPARK-31525][SQL] Return an empty list for df.head() when df is empty" This reverts commit `44a5258ac2`.	2020-07-29 12:07:35 +09:00
Tianshi Zhu	44a5258ac2	[SPARK-31525][SQL] Return an empty list for df.head() when df is empty ### What changes were proposed in this pull request? return an empty list instead of None when calling `df.head()` ### Why are the changes needed? `df.head()` and `df.head(1)` are inconsistent when df is empty. ### Does this PR introduce _any_ user-facing change? Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken. ### How was this patch tested? Closes #29214 from tianshizz/SPARK-31525. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-28 12:32:19 +09:00

1 2 3 4 5 ...

2552 commits