ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	ff284fb6ac	[SPARK-30681][PYTHON][FOLLOW-UP] Keep the name similar with Scala side in higher order functions ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side. Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Closes #31062 from HyukjinKwon/SPARK-30681. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 18:46:20 +09:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Yuanjian Li	86c1cfc579	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-24 12:44:37 +09:00
HyukjinKwon	4106731fdd	[SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable ### What changes were proposed in this pull request? This PR proposes to: - Make doctests simpler to show the usage (since we're not running them now). - Use the test utils to drop the tables if exists. ### Why are the changes needed? Better docs and code readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. It includes some doc changes in unreleased branches. ### How was this patch tested? Manually tested. ```bash cd python ./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests" ``` Closes #30873 from HyukjinKwon/SPARK-33836. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-22 06:27:27 +09:00
Jungtaek Lim	8d4d433191	[SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable ### What changes were proposed in this pull request? This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming. ### Why are the changes needed? Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark. ### Does this PR introduce _any_ user-facing change? Yes, PySpark users will be able to read and write with table in Structured Streaming query. ### How was this patch tested? Manually tested. > v1 table >> create table A and ingest to the table A ``` spark.sql(""" create table table_pyspark_parquet ( value long, `timestamp` timestamp ) USING parquet """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5') query.lastProgress query.stop() ``` >> read table A and ingest to the table B which doesn't exist ``` df2 = spark.readStream.table('table_pyspark_parquet') query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2') query2.lastProgress query2.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE table_pyspark_parquet").show() spark.sql("SELECT * FROM table_pyspark_parquet").show() spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show() spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show() ``` > v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well) >> create table A and ingest to the table A ``` spark.sql(""" create table iceberg_catalog.default.table_pyspark_v2table ( value long, `timestamp` timestamp ) USING iceberg """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1') query.lastProgress query.stop() ``` >> ingest to the non-exist table B ``` df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2') query2.lastProgress query2.stop() ``` >> ingest to the non-exist table C partitioned by `value % 10` ``` df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition') query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3') query3.lastProgress query3.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() ``` Closes #30835 from HeartSaVioR/SPARK-33836. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-21 19:42:59 +09:00
Fokko Driesprong	e4d1c10760	[SPARK-32320][PYSPARK] Remove mutable default arguments This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ ``` fokkodriesprongFan spark % grep -R "={}" python \| grep def python/pyspark/resource/profile.py: def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}): python/pyspark/sql/functions.py:def from_json(col, schema, options={}): python/pyspark/sql/functions.py:def to_json(col, options={}): python/pyspark/sql/functions.py:def schema_of_json(json, options={}): python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}): python/pyspark/sql/functions.py:def to_csv(col, options={}): python/pyspark/sql/functions.py:def from_csv(col, schema, options={}): python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}): ``` ``` fokkodriesprongFan spark % grep -R "=\[\]" python \| grep def python/pyspark/ml/tuning.py: def __init__(self, bestModel, avgMetrics=[], subModels=None): python/pyspark/ml/tuning.py: def __init__(self, bestModel, validationMetrics=[], subModels=None): ``` ### What changes were proposed in this pull request? Removing the mutable default arguments. ### Why are the changes needed? Removing the mutable default arguments, and changing the signature to `Optional[...]`. ### Does this PR introduce _any_ user-facing change? No 👍 ### How was this patch tested? Using the Flake8 bugbear code analysis plugin. Closes #29122 from Fokko/SPARK-32320. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 09:35:36 +08:00
Bryan Cutler	aeb3649fb9	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests ### What changes were proposed in this pull request? This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests. - https://docs.python.org/3/library/unittest.html#deprecated-aliases ### Why are the changes needed? For consistency and eventual removal of deprecated APIs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 10:34:40 +09:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
yangjie01	433ae9064f	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV ### What changes were proposed in this pull request? There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different. The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable. On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default. So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv. ### Why are the changes needed? Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new case similar to that described in SPARK-33566 Closes #30518 from LuciferYang/SPARK-33566. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 15:47:39 +09:00
zero323	d082ad0abf	[SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR ### What changes were proposed in this pull request? This PR adds the following functions (introduced in Scala API with SPARK-33061): - `acosh` - `asinh` - `atanh` to Python and R. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions. ### How was this patch tested? New unit tests. Closes #30501 from zero323/SPARK-33563. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 11:00:09 +09:00
zero323	665817bd4f	[SPARK-33457][PYTHON] Adjust mypy configuration ### What changes were proposed in this pull request? This pull request: - Adds following flags to the main mypy configuration: - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional) - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional) - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls) These flags are enabled only for public API and disabled for tests and internal modules. Additionally, these PR fixes missing annotations. ### Why are the changes needed? Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example https://github.com/apache/spark/pull/29122#pullrequestreview-513112882 Additionally, it will allow us to detect cases where annotations have unintentionally omitted. ### Does this PR introduce _any_ user-facing change? Annotations only. ### How was this patch tested? `dev/lint-python`. Closes #30382 from zero323/SPARK-33457. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-25 09:27:04 +09:00
CC Highman	d338af3101	[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source ### What changes were proposed in this pull request? Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written. ### Why are the changes needed? When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code. ### Does this PR introduce _any_ user-facing change? This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option. Example Usages _Load all CSV files modified after date:_ `spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()` _Load all CSV files modified before date:_ `spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()` _Load all CSV files modified between two dates:_ `spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load() ` ### How was this patch tested? A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments. (quoted from cchighman) Closes #30411 from HeartSaVioR/SPARK-31962. Lead-authored-by: CC Highman <christopher.highman@microsoft.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-23 08:30:41 +09:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00
zero323	4b76a74f1c	[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ ### What changes were proposed in this pull request? Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`. ### Why are the changes needed? API consistency and improved readability of the expressions. ### Does this PR introduce _any_ user-facing change? Before this change col("abc") col("wąż") result in Column<b'abc'> Column<b'w\xc4\x85\xc5\xbc'> After this change we'll get Column<'abc'> Column<'wąż'> ### How was this patch tested? Existing tests and manual inspection. Closes #30322 from zero323/SPARK-33415. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 00:13:17 +09:00
HyukjinKwon	d530ed0ea8	Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends" This reverts commit `b8a440f098`.	2020-11-05 16:15:17 +09:00
zero323	4c8ee8856c	[SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc, desc) ### What changes were proposed in this pull request? This PR adds support for passing `Column`s as input to PySpark sorting functions. ### Why are the changes needed? According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible. ### Does this PR introduce _any_ user-facing change? PySpark users can now provide both `Column` and `str` as an argument for `asc` and `desc` functions. ### How was this patch tested? New unit tests. Closes #30227 from zero323/SPARK-33257. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 22:50:59 +09:00
HyukjinKwon	3959f0d987	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*) ### What changes were proposed in this pull request? This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243. While I am migrating, I also fixed some Python type hints accordingly. ### Why are the changes needed? For better documentation as text itself, and generated HTMLs ### Does this PR introduce _any_ user-facing change? Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243. ### How was this patch tested? Manually tested via running `./dev/lint-python`. Closes #30181 from HyukjinKwon/SPARK-33250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 10:00:49 +09:00
Max Gekk	bdabf60fb4	[SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for `from_json` ### What changes were proposed in this pull request? Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs. Closes #30201 ### Why are the changes needed? Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By linters. Closes #30226 from MaxGekk/from_json-common-schema-parsing-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:10:24 -08:00
Takuya UESHIN	b8a440f098	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 20:28:12 +09:00
Daniel Himmelstein	56587f076d	[SPARK-33310][PYTHON] Relax pyspark typing for sql str functions ### What changes were proposed in this pull request? Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable. ### Why are the changes needed? Convenience & ensuring the typing reflects the functionality ### Does this PR introduce _any_ user-facing change? Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions. ### How was this patch tested? Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this. Closes #30209 from dhimmel/patch-1. Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 19:09:12 +09:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
Takeshi Yamamuro	a6216e2446	[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30169 from maropu/FixPythonUDTCast. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 08:33:02 -07:00
zero323	4e6a310f80	[SPARK-32084][PYTHON][SQL] Expand dictionary functions ### What changes were proposed in this pull request? - [x] Expand dictionary definitions into standalone functions. - [x] Fix annotations for ordering functions. ### Why are the changes needed? To simplify further maintenance of docstrings. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30143 from zero323/SPARK-32084. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 11:05:53 +09:00
HyukjinKwon	66005a3236	[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of https://github.com/apache/spark/pull/28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (`87a1cc21ca`). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-10-21 14:46:47 -07:00
xuewei.linxuewei	388e067a90	[SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py ### What changes were proposed in this pull request? In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was using reflect "Class.forName" in python code to invoke method in SparkSession which is not recommended. using getattr to access "SparkSession$.Module$" instead. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #30092 from leanken/leanken-SPARK-33139-followup. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 16:40:48 +09:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Chuliang Xiao	81d3a8eeca	[MINOR][PYTHON] Fix the typo in the docstring of method agg() ### What changes were proposed in this pull request? Change `df.groupBy.agg()` to `df.groupBy().agg()` in the docstring of `agg()` ### Why are the changes needed? Fix typo in a docstring ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #30060 from ChuliangXiao/patch-1. Authored-by: Chuliang Xiao <ChuliangX@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-15 17:24:22 -07:00
zero323	3beab8d8a8	[SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR ### What changes were proposed in this pull request? - Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801). - Add `assert_true` and `raise_error` to SparkR NAMESPACE. - Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004). ### Why are the changes needed? As discussed in review for #29947. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing tests. - Validation of annotations using MyPy Closes #29978 from zero323/SPARK-32793-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-09 09:50:45 +09:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
zero323	473b3ba6aa	[SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark ### What changes were proposed in this pull request? This PR adds `dropFields` method to: - PySpark `Column` - SparkR `Column` ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? No, new API. ### How was this patch tested? - New unit tests. - Manual verification of examples / doctests. - Manual run of MyPy tests Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 10:37:42 +09:00
zero323	72da6f86cf	[SPARK-33002][PYTHON] Remove non-API annotations ### What changes were proposed in this pull request? This PR: - removes annotations for modules which are not part of the public API. - removes `__init__.pyi` files, if no annotations, beyond exports, are present. ### Why are the changes needed? Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests and additional MyPy checks: ``` mypy --no-incremental --config python/mypy.ini python/pyspark MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` Closes #29879 from zero323/SPARK-33002. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 19:53:59 +09:00
Bryan Cutler	0812d6c17c	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-06 18:11:24 +09:00
Max Gekk	1b60ff5afe	[MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated ### What changes were proposed in this pull request? Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value ### Why are the changes needed? Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation: `0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running `./dev/scalastyle`. Closes #29892 from MaxGekk/doc-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:20:12 +00:00
HyukjinKwon	6868b40517	[SPARK-33020][PYTHON] Add nth_value as a PySpark function ### What changes were proposed in this pull request? `nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API. ### Why are the changes needed? To support the consistent APIs ### Does this PR introduce _any_ user-facing change? Yes, it introduces a new PySpark function API. ### How was this patch tested? Unittest was added. Closes #29899 from HyukjinKwon/SPARK-33020. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:14:28 -07:00
HyukjinKwon	376ede1301	[SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py ### What changes were proposed in this pull request? Move functions related test cases from `test_context.py` to `test_functions.py`. ### Why are the changes needed? To group the similar test cases. ### Does this PR introduce _any_ user-facing change? Nope, test-only. ### How was this patch tested? Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 21:54:00 -07:00
zero323	31a16fbb40	[SPARK-32714][PYTHON] Initial pyspark-stubs port ### What changes were proposed in this pull request? This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. This PR adds type annotations directly to Spark source. This can impact interaction with development tools for users, which haven't used `pyspark-stubs`. ### How was this patch tested? - [x] MyPy tests of the PySpark source ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` - [x] MyPy tests of Spark examples ``` MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` - [x] Existing Flake8 linter - [x] Existing unit tests Tested against: - `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894` - `mypy==0.782` Closes #29591 from zero323/SPARK-32681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:15:36 +09:00
zero323	779f0a84ea	[SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods ### What changes were proposed in this pull request? This PR adjusts signatures of methods decorated with `keyword_only` to indicate using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/). __Note__: For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579 ### Why are the changes needed? Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that. In practice we could probably capture `locals` and drop `keyword_only` completel, i.e: ```python keyword_only def __init__(self, , featuresCol="features"): ... kwargs = self._input_kwargs self.setParams(kwargs) ``` could be replaced with ```python def __init__(self, , featuresCol="features"): kwargs = locals() del kwargs["self"] ... self.setParams(*kwargs) ``` ### Does this PR introduce _any_ user-facing change? Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments. For example with ` LinearSVC` will change from ``` >>> from pyspark.ml.classification import LinearSVC >>> ?LinearSVC.__init__ Signature: LinearSVC.__init__( self, featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol='rawPrediction', fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2, ) Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2): File: /path/to/python/pyspark/ml/classification.py Type: function ``` to ``` >>> from pyspark.ml.classification import LinearSVC >>> ?LinearSVC.__init__ Signature: LinearSVC.__init__ ( self, , featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol='rawPrediction', fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2, blockSize=1, ) Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2, blockSize=1): File: ~/Workspace/spark/python/pyspark/ml/classification.py Type: function ``` ### How was this patch tested? Existing tests. Closes #29799 from zero323/SPARK-32933. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:28:33 +09:00
Max Gekk	7c14f177eb	[SPARK-32306][SQL][DOCS] Clarify the result of `percentile_approx()` ### What changes were proposed in this pull request? More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns one of elements (or array of elements) from the input column. ### Why are the changes needed? To improve Spark docs and avoid misunderstanding of the function behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #29835 from MaxGekk/doc-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-09-22 12:45:19 -07:00
zero323	7fb9f6884f	[SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName ### What changes were proposed in this pull request? Add optional `allowMissingColumns` argument to SparkR `unionByName`. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? `unionByName` supports `allowMissingColumns`. ### How was this patch tested? Existing unit tests. New unit tests targeting this feature. Closes #29813 from zero323/SPARK-32799. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-21 09:39:34 +09:00
HyukjinKwon	657e39a334	[SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSession.builder.getOrCreate ### What changes were proposed in this pull request? In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below: ```python import warnings from pyspark.sql import SparkSession, SQLContext warnings.simplefilter('always', DeprecationWarning) spark.stop() SparkSession.builder.getOrCreate() ``` it shows the deprecation warning as below: ``` /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` via `d3304268d3/python/pyspark/sql/session.py (L222)` We shouldn't print the deprecation warning from it. This is the only place ^. ### Why are the changes needed? To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly. ### Does this PR introduce _any_ user-facing change? Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`. ### How was this patch tested? Manually tested as above. Closes #29768 from HyukjinKwon/SPARK-32897. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-09-16 10:13:47 -07:00
zero323	c918909c1a	[SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword ### What changes were proposed in this pull request? Replace `__metaclass__` fields with `metaclass` keyword in the class statements. ### Why are the changes needed? `__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons. ``` >>> from pyspark.sql.types import BooleanType >>> BooleanType() is BooleanType() False ``` and classes, which suppose to be abstract, are not ``` >>> import inspect >>> from pyspark.ml import Estimator >>> inspect.isabstract(Estimator) False ``` ### Does this PR introduce _any_ user-facing change? Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix. ### How was this patch tested? Existing tests. Closes #29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:22:11 +09:00
Adam Binford	e884290587	[SPARK-32835][PYTHON] Add withField method to the pyspark Column class ### What changes were proposed in this pull request? This PR adds a `withField` method on the pyspark Column class to call the Scala API method added in https://github.com/apache/spark/pull/27066. ### Why are the changes needed? To update the Python API to match a new feature in the Scala API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test Closes #29699 from Kimahriman/feature/pyspark-with-field. Authored-by: Adam Binford <adam.binford@radiantsolutions.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:18:36 +09:00
Liang-Chi Hsieh	550c1c9cfb	[SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV ### What changes were proposed in this pull request? This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header. ### Why are the changes needed? This behavior can confuse users. We should explicitly document it. ### Does this PR introduce _any_ user-facing change? No. Only doc change. ### How was this patch tested? Only doc change. Closes #29765 from viirya/SPARK-32888. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-16 20:16:15 +09:00
Abhishek Dixit	6f36db1fa5	[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py ### What changes were proposed in this pull request? Since the data is serialized on the Python side, we should make cache() in PySpark dataframes use StorageLevel.MEMORY_AND_DISK mode which has deserialized=false. This change was done to `pyspark/rdd.py` as part of SPARK-2014 but was missed from `pyspark/dataframe.py` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using existing tests Closes #29242 from abhishekd0907/SPARK-31448. Authored-by: Abhishek Dixit <abhishekdixit0907@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-15 08:41:22 -05:00
Bryan Cutler	e0538bd38c	[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-10 14:16:19 +09:00
Wenchen Fan	f7995c576a	Revert "[SPARK-32677][SQL] Load function resource before create" This reverts commit `05fcf26b79`.	2020-09-09 18:15:22 +00:00
itholic	8bd3770552	[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark ### What changes were proposed in this pull request? This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not. ### Why are the changes needed? To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new examples with new argument in the docstring. ### How was this patch tested? Doctest added and manually tested ``` $ python/run-tests --testnames pyspark.sql.dataframe Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.dataframe (35s) Finished test(/.../python3): pyspark.sql.dataframe (35s) Tests passed in 35 seconds ``` Closes #29657 from itholic/SPARK-32798. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 09:41:02 +09:00

1 2 3 4 5 ...

1051 commits