ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dominik Gehl	382fe44b55	[SPARK-36258][PYTHON] Exposing functionExists in pyspark sql catalog ### What changes were proposed in this pull request? Exposing functionExists in pyspark sql catalog ### Why are the changes needed? method was available in scala but not pyspark ### Does this PR introduce _any_ user-facing change? Additional method ### How was this patch tested? Unit tests Closes #33481 from dominikgehl/SPARK-36258. Authored-by: Dominik Gehl <dog@open.ch> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 19:15:41 +09:00
Liang-Chi Hsieh	fd36ed4550	[SPARK-36270][BUILD] Change memory settings for enabling GA ### What changes were proposed in this pull request? Trying to adjust build memory settings and serial execution to re-enable GA. ### Why are the changes needed? GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? GA Closes #33447 from viirya/test-ga. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 19:10:45 +09:00
Takuya UESHIN	2fe12a7520	[SPARK-36261][PYTHON] Add remove_unused_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_unused_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_unused_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_unused_categories`. ### How was this patch tested? Added some tests. Closes #33485 from ueshin/issues/SPARK-36261/remove_unused_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 14:04:59 +09:00
Hyukjin Kwon	d6bc8cd681	[SPARK-36268][PYTHON] Set the lowerbound of mypy version to 0.910 ### What changes were proposed in this pull request? This PR proposes to set the lowerbound of mypy version to use in the testing script. ### Why are the changes needed? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141519/console ``` python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types Found 13 errors in 4 files (checked 314 source files) 1 ``` Jenkins installed mypy at SPARK-32797 but seems the version installed is not same as GIthub Actions. It seems difficult to make the codebase compatible with multiple mypy versions. Therefore, this PR sets the lowerbound. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins job in this PR should test it out. Also manually tested: Without mypy: ``` ... flake8 checks passed. The mypy command was not found. Skipping for now. ``` With mypy 0.812: ``` ... flake8 checks passed. The minimum mypy version needs to be 0.910. Your current version is mypy 0.812. Skipping for now. ``` With mypy 0.910: ``` ... flake8 checks passed. starting mypy test... mypy checks passed. all lint-python tests passed! ``` Closes #33487 from HyukjinKwon/SPARK-36268. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:28:16 +09:00
Xinrong Meng	8b3d84bb7e	[SPARK-36248][PYTHON] Add rename_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add rename_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? rename_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. `rename_categories` is supported in pandas API on Spark now. ```py # CategoricalIndex >>> psser = ps.CategoricalIndex(["a", "a", "b"]) >>> psser.rename_categories([0, 1]) CategoricalIndex([0, 0, 1], categories=[0, 1], ordered=False, dtype='category') >>> psser.rename_categories({'a': 'A', 'c': 'C'}) CategoricalIndex(['A', 'A', 'b'], categories=['A', 'b'], ordered=False, dtype='category') >>> psser.rename_categories(lambda x: x.upper()) CategoricalIndex(['A', 'A', 'B'], categories=['A', 'B'], ordered=False, dtype='category') # CategoricalAccessor >>> s = ps.Series(["a", "a", "b"], dtype="category") >>> s.cat.rename_categories([0, 1]) 0 0 1 0 2 1 dtype: category Categories (2, int64): [0, 1] >>> s.cat.rename_categories({'a': 'A', 'c': 'C'}) 0 A 1 A 2 b dtype: category Categories (2, object): ['A', 'b'] >>> s.cat.rename_categories(lambda x: x.upper()) 0 A 1 A 2 B dtype: category Categories (2, object): ['A', 'B'] ``` ### How was this patch tested? Unit tests. Closes #33471 from xinrong-databricks/category_rename_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:26:24 +09:00
Xinrong Meng	75fd1f5b82	[SPARK-36189][PYTHON] Improve bool, string, numeric DataTypeOps tests by avoiding joins ### What changes were proposed in this pull request? Improve bool, string, numeric DataTypeOps tests by avoiding joins. Previously, bool, string, numeric DataTypeOps tests are conducted between two different Series. After the PR, bool, string, numeric DataTypeOps tests should perform on a single DataFrame. ### Why are the changes needed? A considerable number of DataTypeOps tests have operations on different Series, so joining is needed, which takes a long time. We shall avoid joins for a shorter test duration. The majority of joins happen in bool, string, numeric DataTypeOps tests, so we improve them first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33402 from xinrong-databricks/datatypeops_diffframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:20:35 +09:00
Takuya UESHIN	a76a087f7f	[SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings ### What changes were proposed in this pull request? Use `Column.__getitem__` instead of `Column.getItem` to suppress warnings. ### Why are the changes needed? In pandas API on Spark code base, there are some places using `Column.getItem` with `Column` object, but it shows a deprecation warning. ### Does this PR introduce _any_ user-facing change? Yes, users won't see the warnings anymore. - before ```py >>> s = ps.Series(list("abbccc"), dtype="category") >>> s.astype(str) /path/to/spark/python/pyspark/sql/column.py:322: FutureWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead. warnings.warn( 0 a 1 b 2 b 3 c 4 c 5 c dtype: object ``` - after ```py >>> s = ps.Series(list("abbccc"), dtype="category") >>> s.astype(str) 0 a 1 b 2 b 3 c 4 c 5 c dtype: object ``` ### How was this patch tested? Existing tests. Closes #33486 from ueshin/issues/SPARK-36265/getitem. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 11:27:31 +09:00
Dongjoon Hyun	a1a197403b	[SPARK-36262][BUILD] Upgrade ZSTD-JNI to 1.5.0-4 ### What changes were proposed in this pull request? This PR aims to upgrade ZSTD-JNI to 1.5.0-4. ### Why are the changes needed? ZSTD-JNI 1.5.0-3 has a packaging issue. 1.5.0-4 is recommended to be used instead. - https://github.com/luben/zstd-jni/issues/181#issuecomment-885138495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33483 from dongjoon-hyun/SPARK-36262. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-22 14:03:59 -07:00
Sean Owen	b69c26833c	[SPARK-35848][MLLIB] Optimize some treeAggregates in MLlib by delaying allocations ### What changes were proposed in this pull request? Optimize some treeAggregates in MLlib by delaying allocating (thus not sending around) large arrays of zeroes This uses the same idea as in https://github.com/apache/spark/pull/23600/files ### Why are the changes needed? Allocating huge arrays of zeroes takes additional memory and network I/O which is unnecessary in some cases. It can cause operations to run out of memory that might otherwise succeed. Specifically, this should prevent the 'zero' value from having to be (pointlessly) checked for serializability, which can fail when passing through the default JavaSerializer; it would also prevent allocating and sending large 'zero' values for an empty partition in the aggregate. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33443 from srowen/SPARK-35848. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-22 13:59:09 -05:00
Sean Owen	518f00fd78	[SPARK-35310][MLLIB] Update to breeze 1.2 ### What changes were proposed in this pull request? Update to the latest breeze 1.2 ### Why are the changes needed? Minor bug fixes ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests Closes #33449 from srowen/SPARK-35310. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-22 13:58:01 -05:00
Kousuke Saruta	07fa38e2c1	[SPARK-35815][SQL] Allow delayThreshold for watermark to be represented as ANSI interval literals ### What changes were proposed in this pull request? This PR extends the way to represent `delayThreshold` with ANSI interval literals for watermark. ### Why are the changes needed? A `delayThreshold` is semantically an interval value so it's should be represented as ANSI interval literals as well as the conventional `1 second` form. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33456 from sarutak/delayThreshold-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-22 17:36:22 +03:00
Dominik Gehl	3a1db2ddd4	[SPARK-36209][PYTHON][DOCS] Fix link to pyspark Dataframe documentation ### What changes were proposed in this pull request? Bugfix: link to correction location of Pyspark Dataframe documentation ### Why are the changes needed? Current website returns "Not found" ### Does this PR introduce _any_ user-facing change? Website fix ### How was this patch tested? Documentation change Closes #33420 from dominikgehl/feature/SPARK-36209. Authored-by: Dominik Gehl <dog@open.ch> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-22 08:07:00 -05:00
Gengliang Wang	ae9f6126fb	[SPARK-36257][SQL] Updated the version of TimestampNTZ related changes as 3.3.0 ### What changes were proposed in this pull request? As we decided to release TimestampNTZ type in Spark 3.3, we should update the versions of TimestampNTZ related changes as 3.3.0. ### Why are the changes needed? Correct the versions in documentation/code comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33478 from gengliangwang/updateVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-22 21:01:29 +08:00
Kousuke Saruta	13aefd6a66	[SPARK-36256][BUILD] Upgrade lz4-java to 1.8.0 ### What changes were proposed in this pull request? This PR upgrades `lz4-java` to `1.8.0`, which includes not only performance improvement but also Darwin aarch64 support. https://github.com/lz4/lz4-java/releases/tag/1.8.0 https://github.com/lz4/lz4-java/blob/1.8.0/CHANGES.md ### Why are the changes needed? For providing better performance and platform support. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33476 from sarutak/upgrade-lz4-java-1.8.0. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-22 20:39:59 +08:00
itholic	86471ad668	[SPARK-36239][PYTHON][DOCS] Remove some APIs from documentation ### What changes were proposed in this pull request? This PR proposes removing some APIs from pandas-on-Spark documentation. Because they can be easily workaround via Spark DataFrame or Column functions, so they might be removed In the future. ### Why are the changes needed? Because we don't want to expose some functions as a public API. ### Does this PR introduce _any_ user-facing change? The APIs such as `(Series\|Index).spark.data_type`, `(Series\|Index).spark.nullable`, `DataFrame.spark.schema`, `DataFrame.spark.print_schema`, `DataFrame.pandas_on_spark.attach_id_column`, `DataFrame.spark.checkpoint`, `DataFrame.spark.localcheckpoint` and `DataFrame.spark.explain` is removed in the documentation. ### How was this patch tested? Manually build the documents. Closes #33458 from itholic/SPARK-36239. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 19:46:40 +09:00
gengjiaan	900b72a9cd	[SPARK-35088][SQL][FOLLOWUP] Add test case for TimestampNTZ sequence with default step ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/33360 and add test case for `TimestampNTZ` sequence with default step. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. Just add test cases. ### How was this patch tested? New tests. Closes #33462 from beliefer/SPARK-36090-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-22 17:53:22 +08:00
Angerszhuuuu	bb09bd2e2d	[SPARK-36156][SQL] SCRIPT TRANSFORM ROW FORMAT DELIMITED should respect `NULL DEFINED AS` and default value should be `\N` ### What changes were proposed in this pull request? SCRIPT TRANSFORM ROW FORMAT DELIMITED should respect `NULL DEFINED AS` and default value should be `\N` ![image](https://user-images.githubusercontent.com/46485123/125775377-611d4f06-f9e5-453a-990d-5a0018774f43.png) ![image](https://user-images.githubusercontent.com/46485123/125775387-6618bd0c-78d8-4457-bcc2-12dd70522946.png) ### Why are the changes needed? Keep consistence with Hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33363 from AngersZhuuuu/SPARK-36156. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 17:28:37 +08:00
Dominik Gehl	2c35604044	[SPARK-36243][SQL][PYTHON][DOCS] Fixing pyspark tableExists issue with temporary views ### What changes were proposed in this pull request? Additional tests for pyspark tableExists with regard to views and temporary views ### Why are the changes needed? scala documentation indicates that tableExists works for tables/view and also temporary views. This unit tests try to verify that claim. While views seem ok, temporary views don't seem to work. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tests Closes #33461 from dominikgehl/bug/SPARK-36243. Lead-authored-by: Dominik Gehl <dog@open.ch> Co-authored-by: Dominik Gehl <gehl@fastmail.fm> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 18:12:01 +09:00
Enrico Minack	4e9c1b8ba0	[SPARK-34806][SQL] Add Observation helper for Dataset.observe ### What changes were proposed in this pull request? This pull request introduces a helper class that simplifies usage of `Dataset.observe()` for batch datasets: val observation = Observation("name") val observed = ds.observe(observation, max($"id").as("max_id")) observed.count() val metrics = observation.get ### Why are the changes needed? Currently, users are required to implement the `QueryExecutionListener` interface to retrieve the metrics, as well as apply some knowledge on threading and locking to pull the metrics over to the main thread. With the helper class, metrics can be retrieved from batch dataset processing with three lines of code (the action on the observed dataset does not count as a line of code here). ### Does this PR introduce _any_ user-facing change? Yes, one new class and one `Dataset`` method. ### How was this patch tested? Adds a unit test to `DataFrameSuite`, similar to `"get observable metrics by callback"` in `DataFrameCallbackSuite`. Closes #33422 from EnricoMi/branch-observation. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 08:57:04 +00:00
itholic	d1a037a27c	[SPARK-35810][PYTHON][FOLLWUP] Deprecate ps.broadcast API ### What changes were proposed in this pull request? This PR follows up #33379 to fix build error in Sphinx ### Why are the changes needed? The Sphinx build is failed with missing newline in docstring ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test the Sphinx build Closes #33479 from itholic/SPARK-35810-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:10:03 +09:00
itholic	6578f0b135	[SPARK-35809][PYTHON] Add `index_col` argument for ps.sql ### What changes were proposed in this pull request? This PR proposes adding an argument `index_col` for `ps.sql` function, to preserve the index when users want. NOTE that the `reset_index()` have to be performed before using `ps.sql` with `index_col`. ```python >>> psdf A B a 1 4 b 2 5 c 3 6 >>> psdf_reset_index = psdf.reset_index() >>> ps.sql("SELECT * from {psdf_reset_index} WHERE A > 1", index_col="index") A B index b 2 5 c 3 6 ``` Otherwise, the index is always lost. ```python >>> ps.sql("SELECT * from {psdf} WHERE A > 1") A B 0 2 5 1 3 6 ``` ### Why are the changes needed? Index is one of the key object for the existing pandas users, so we should provide the way to keep the index after computing the `ps.sql`. ### Does this PR introduce _any_ user-facing change? Yes, the new argument is added. ### How was this patch tested? Add a unit test and manually check the build pass. Closes #33450 from itholic/SPARK-35809. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:08:34 +09:00
Takuya UESHIN	a3c7ae18e2	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_categories`. ### How was this patch tested? Added some tests. Closes #33474 from ueshin/issues/SPARK-36249/remove_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:06:12 +09:00
Holden Karau	89a83196ac	[SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake ### What changes were proposed in this pull request? GHA probably doesn't have the same resources as jenkins so move down from 5 to 3 execs and give a bit more time for them to come up. ### Why are the changes needed? Test is timing out in GHA ### Does this PR introduce _any_ user-facing change? No, test only change. ### How was this patch tested? Run through GHA verify no OOM during WorkerDecommissionExtended Closes #33467 from holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA. Lead-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: Holden Karau <hkarau@netflix.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 15:17:48 +09:00
Takuya UESHIN	dcc0aaa3ef	[SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `add_categories`. ### How was this patch tested? Added some tests. Closes #33470 from ueshin/issues/SPARK-36214/add_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-21 22:34:04 -07:00
Hyukjin Kwon	f3e29574d9	[SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package ### What changes were proposed in this pull request? This PR adds the version that added pandas API on Spark in PySpark documentation. ### Why are the changes needed? To document the version added. ### Does this PR introduce _any_ user-facing change? No to end user. Spark 3.2 is not released yet. ### How was this patch tested? Linter and documentation build. Closes #33473 from HyukjinKwon/SPARK-36253. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 14:21:43 +09:00
allisonwang-db	de8e4be92c	[SPARK-36063][SQL] Optimize OneRowRelation subqueries ### What changes were proposed in this pull request? This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: `spark.sql.optimizer.optimizeOneRowRelationSubquery` (default: True). For example: ```sql select (select c1) from t ``` Analyzed plan: ``` Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22] : +- Project [outer(c1#18)] : +- OneRowRelation +- LocalRelation [c1#18, c2#19] ``` Optimized plan before this PR: ``` Project [c1#18#25 AS scalarsubquery(c1)#22] +- Join LeftOuter, (c1#24 <=> c1#18) :- LocalRelation [c1#18] +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24] +- LocalRelation [c1#18] ``` Optimized plan after this PR: ``` LocalRelation [scalarsubquery(c1)#22] ``` ### Why are the changes needed? To optimize query plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit tests. Closes #33284 from allisonwang-db/spark-36063-optimize-subquery-one-row-relation. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 10:48:32 +08:00
Kousuke Saruta	dcb7db5370	[SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation ### What changes were proposed in this pull request? This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`. `1.5.0-3` was released few days ago. This release resolves an issue about buffer size calculation, which can affect usage in Spark. https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3 ### Why are the changes needed? It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-21 19:37:05 -07:00
Fu Chen	09bebc8bde	[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` ### What changes were proposed in this pull request? Rework [PR](https://github.com/apache/spark/pull/33212) with suggestions. This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default. Here is an example: ```scala val schema = StructType(Seq(StructField("value", StructType(Seq( StructField("x", IntegerType, nullable = false), StructField("y", IntegerType, nullable = false) )), nullable = true ))) val testDS = Seq("""{"value":{"x":1}}""").toDS spark.read .schema(schema) .json(testDS) .printSchema() spark.read .schema(schema) .format("json") .load("/tmp/json/t1") .printSchema() // root // \|-- value: struct (nullable = true) // \| \|-- x: integer (nullable = true) // \| \|-- y: integer (nullable = true) ``` Before this pr: ``` // output of spark.read.json() root \|-- value: struct (nullable = true) \| \|-- x: integer (nullable = false) \| \|-- y: integer (nullable = false) ``` After this pr: ``` // output of spark.read.json() root \|-- value: struct (nullable = true) \| \|-- x: integer (nullable = true) \| \|-- y: integer (nullable = true) ``` - `spark.read.csv()` also has the same problem. - Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation. `c77acf0bbc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L415-L421)` ### Does this PR introduce _any_ user-facing change? Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default. ### How was this patch tested? New test. Closes #33436 from cfmcgrady/SPARK-35912-v3. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 11:12:36 +09:00
shane knapp	ad528a007a	[SPARK-32797][SPARK-32391][SPARK-33242][SPARK-32666][ANSIBLE] updating a bunch of python packages ### What changes were proposed in this pull request? updating the anaconda py36 environment file ### Why are the changes needed? see: https://issues.apache.org/jira/browse/SPARK-32666 https://issues.apache.org/jira/browse/SPARK-33242 https://issues.apache.org/jira/browse/SPARK-32391 https://issues.apache.org/jira/browse/SPARK-32797 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? jenkins will test this Closes #33469 from shaneknapp/updating-python-paks. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2021-07-21 15:22:06 -07:00
Takuya UESHIN	d506815a92	[SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add categories setter to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use categories setter. ### How was this patch tested? Added some tests. Closes #33448 from ueshin/issues/SPARK-36188/categories_setter. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-21 11:31:30 -07:00
Kent Yao	4cd6cfc773	[SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec ### What changes were proposed in this pull request? This fixes a case sensitivity issue for desc table commands with partition specified. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, but it's a bugfix ### How was this patch tested? new tests #### before ``` +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +Partition spec is invalid. The spec (C, D) must match the partition spec (c, d) defined in table '`default`.`t`' + ``` #### after https://github.com/apache/spark/pull/33424/files#diff-554189c49950974a948f99fa9b7436f615052511660c6a0ae3062fa8ca0a327cR328 Closes #33424 from yaooqinn/SPARK-36213. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-07-22 00:52:31 +08:00
Shardul Mahadik	685c3fd05b	[SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables ### What changes were proposed in this pull request? For non-datasource Hive tables, e.g. tables written outside of Spark (through Hive or Trino), we have certain optimzations in Spark where we use Spark ORC and Parquet datasources to read these tables ([Ref](`fbf53dee37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (L128)`)) rather than using the Hive serde. If such a table contains a `path` property, Spark will try to list this path property in addition to the table location when creating an `InMemoryFileIndex`. ([Ref](`fbf53dee37/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L575)`)) This can lead to wrong data if `path` property points to a directory location or an error if `path` is not a location. A concrete example is provided in [SPARK-28266 (comment)](https://issues.apache.org/jira/browse/SPARK-28266?focusedCommentId=17380170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17380170). Since these tables were not written through Spark, Spark should not interpret this `path` property as it can be set by an external system with a different meaning. ### Why are the changes needed? For better compatibility with Hive tables generated by other platforms (non-Spark) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test Closes #33328 from shardulm94/spark-28266. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 22:40:39 +08:00
Wenchen Fan	9c8a3d3975	[SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed ### What changes were proposed in this pull request? Sometimes, AQE skew join optimization can fail with NPE. This is because AQE tries to get the shuffle block sizes, but some map outputs are missing due to the executor lost or something. This PR fixes this bug by skipping skew join handling if some map outputs are missing in the `MapOutputTracker`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT Closes #33445 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 22:17:56 +08:00
Kousuke Saruta	f56c7b71ff	[SPARK-36208][SQL] SparkScriptTransformation should support ANSI interval types ### What changes were proposed in this pull request? This PR changes `BaseScriptTransformationExec` for `SparkScriptTransformationExec` to support ANSI interval types. ### Why are the changes needed? `SparkScriptTransformationExec` support `CalendarIntervalType` so it's better to support ANSI interval types as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33419 from sarutak/script-transformation-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-21 15:13:01 +03:00
Wenchen Fan	94aece4325	[SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the LOGICAL_PLAN_TAG tag ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33222 . https://github.com/apache/spark/pull/33222 made a mistake that, `RemoveRedundantProjects` may lose the `LOGICAL_PLAN_TAG` tag, even though the logical plan link is retained. This was actually caught by the test `LogicalPlanTagInSparkPlanSuite`, but was not being taken care of. There is no problem so far, but losing information can always lead to potential bugs. ### Why are the changes needed? fix a mistake ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33442 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 14:03:06 +08:00
Rahul Mahadev	efcce23b91	[SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState ### What changes were proposed in this pull request? Adding support for accepting an initial state with flatMapGroupsWithState in batch mode. ### Why are the changes needed? SPARK-35897 added support for accepting an initial state for streaming queries using flatMapGroupsWithState. the code flow is separate for batch and streaming and required a different PR. ### Does this PR introduce _any_ user-facing change? Yes as discussed above flatMapGroupsWithState in batch mode can accept an initialState, previously this would throw an UnsupportedOperationException ### How was this patch tested? Added relevant unit tests in FlatMapGroupsWithStateSuite and modified the tests `JavaDatasetSuite` Closes #33336 from rahulsmahadev/flatMapGroupsWithStateBatch. Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2021-07-21 01:48:58 -04:00
Liang-Chi Hsieh	df798ed301	[SPARK-36030][SQL][FOLLOW-UP] Remove duplicated test suite ### What changes were proposed in this pull request? Removes `FileFormatDataWriterMetricSuite` which duplicated. ### Why are the changes needed? `FileFormatDataWriterMetricSuite` should be renamed to `InMemoryTableMetricSuite`. But it was wrongly copied. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33453 from viirya/SPARK-36030-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 22:12:21 -07:00
Hyukjin Kwon	99006e515b	[SPARK-36030][SQL][FOLLOW-UP] Avoid procedure syntax deprecated in Scala 2.13 ### What changes were proposed in this pull request? This PR avoid using procedure syntax deprecated in Scala 2.13. https://github.com/apache/spark/runs/3120481756?check_suite_focus=true ``` [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type [error] private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) { [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/InMemoryTableMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type [error] private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) { [error] ^ [warn] 100 warnings found [error] two errors found [error] (sql / Test / compileIncremental) Compilation failed [error] Total time: 579 s (09:39), completed Jul 21, 2021 4:14:26 AM ``` ### Why are the changes needed? To make the build compatible with Scala 2.13 in Spark. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested: ```bash ./dev/change-scala-version.sh 2.13 ./build/mvn -DskipTests -Phive-2.3 -Phive clean package -Pscala-2.13 ``` Closes #33452 from HyukjinKwon/SPARK-36030. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-21 14:09:27 +09:00
Liang-Chi Hsieh	2653201b0a	[SPARK-36030][SQL] Support DS v2 metrics at writing path ### What changes were proposed in this pull request? We add the interface for DS v2 metrics in SPARK-34366. It is only added for reading path, though. This patch extends the metrics interface to writing path. ### Why are the changes needed? Complete DS v2 metrics interface support in writing path. ### Does this PR introduce _any_ user-facing change? No. For developer, yes, as this adds metrics support at DS v2 writing path. ### How was this patch tested? Added test. Closes #33239 from viirya/v2-write-metrics. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 20:20:35 -07:00
Angerszhuuuu	305d563329	[SPARK-36153][SQL][DOCS] Update transform doc to match the current code ### What changes were proposed in this pull request? Update trasform's doc to latest code. ![image](https://user-images.githubusercontent.com/46485123/126175747-672cccbc-4e42-440f-8f1e-f00b6dc1be5f.png) ### Why are the changes needed? keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #33362 from AngersZhuuuu/SPARK-36153. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-20 21:38:37 -05:00
Gidon Gershinsky	7ceefcace9	[SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL ### What changes were proposed in this pull request? Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark. ### Why are the changes needed? - To provide information on how to use Parquet column encryption ### Does this PR introduce _any_ user-facing change? Yes, documents a new feature. ### How was this patch tested? bundle exec jekyll build Closes #32895 from ggershinsky/parquet-encryption-doc. Authored-by: Gidon Gershinsky <ggershinsky@apple.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-20 21:35:47 -05:00
Jie	1a8c6755a1	[SPARK-35027][CORE] Close the inputStream in FileAppender when writin… ### What changes were proposed in this pull request? 1. add "closeStreams" to FileAppender and RollingFileAppender 2. set "closeStreams" to "true" in ExecutorRunner ### Why are the changes needed? The executor will hang when due disk full or other exceptions which happened in writting to outputStream: the root cause is the "inputStream" is not closed after the error happens: 1. ExecutorRunner creates two files appenders for pipe: one for stdout, one for stderr 2. FileAppender.appendStreamToFile exits the loop when writing to outputStream 3. FileAppender closes the outputStream, but left the inputStream which refers the pipe's stdout and stderr opened 4. The executor will hang when printing the log message if the pipe is full (no one consume the outputs) 5. From the driver side, you can see the task can't be completed for ever With this fix, the step 4 will throw an exception, the driver can catch up the exception and reschedule the failed task to other executors. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new tests for the "closeStreams" in FileAppenderSuite Closes #33263 from jhu-chang/SPARK-35027. Authored-by: Jie <gt.hu.chang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-20 21:23:51 -05:00
Jungtaek Lim	0eb31a06d6	[SPARK-36172][SS] Document session window into Structured Streaming guide doc ### What changes were proposed in this pull request? This PR documents a new feature "native support of session window" into Structured Streaming guide doc. Screenshots are following: ![스크린샷 2021-07-20 오후 5 04 20](https://user-images.githubusercontent.com/1317309/126284848-526ec056-1028-4a70-a1f4-ae275d4b5437.png) ![스크린샷 2021-07-20 오후 3 34 38](https://user-images.githubusercontent.com/1317309/126276763-763cf841-aef7-412a-aa03-d93273f0c850.png) ### Why are the changes needed? This change is needed to explain a new feature to the end users. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes. Closes #33433 from HeartSaVioR/SPARK-36172. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-07-21 10:45:31 +09:00
Takuya UESHIN	376fadc89c	[SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `as_ordered`/`as_unordered`. ### How was this patch tested? Added some tests. Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-20 18:23:54 -07:00
gengjiaan	c0d84e6cf1	[SPARK-36222][SQL] Step by days in the Sequence expression for dates ### What changes were proposed in this pull request? The current implement of `Sequence` expression not support step by days for dates. ``` spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' day); Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch: sequence uses the wrong parameter type. The parameter type must conform to: 1. The start and stop expressions must resolve to the same type. 2. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' or 'interval year to month' or 'interval day to second' type, otherwise to the same type as the start and stop expressions. ; line 1 pos 7; 'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' DAY), Some(Europe/Moscow)), None)] +- OneRowRelation ``` ### Why are the changes needed? `DayTimeInterval` has day granularity should as step for dates. ### Does this PR introduce _any_ user-facing change? 'Yes'. Sequence expression will supports step by `DayTimeInterval` has day granularity for dates. ### How was this patch tested? New tests. Closes #33439 from beliefer/SPARK-36222. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-20 19:16:56 +03:00
Koert Kuipers	bf680bf25a	[SPARK-36210][SQL] Preserve column insertion order in Dataset.withColumns ### What changes were proposed in this pull request? Preserve the insertion order of columns in Dataset.withColumns ### Why are the changes needed? It is the expected behavior. We preserve insertion order in all other places. ### Does this PR introduce _any_ user-facing change? No. Currently Dataset.withColumns is not actually used anywhere to insert more than one column. This change is to make sure it behaves as expected when it is used for that purpose in future. ### How was this patch tested? Added test in DatasetSuite Closes #33423 from koertkuipers/feat-withcolumns-preserve-order. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 09:09:22 -07:00
Karen Feng	ddc61e62b9	[SPARK-36079][SQL] Null-based filter estimate should always be in the range [0, 1] ### What changes were proposed in this pull request? Forces the selectivity estimate for null-based filters to be in the range `[0,1]`. ### Why are the changes needed? I noticed in a few TPC-DS query tests that the column statistic null count can be higher than the table statistic row count. In the current implementation, the selectivity estimate for `IsNotNull` is negative. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33286 from karenfeng/bound-selectivity-est. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 21:32:13 +08:00
gengjiaan	033a5731b4	[SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/33299 and implement `prettyName` for `MakeTimestampNTZ` and `MakeTimestampLTZ` based on the discussion show below https://github.com/apache/spark/pull/33299/files#r668423810 ### Why are the changes needed? This PR fix the incorrect alias usecase. ### Does this PR introduce _any_ user-facing change? 'No'. Modifications are transparent to users. ### How was this patch tested? Jenkins test. Closes #33430 from beliefer/SPARK-36046-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-20 21:31:00 +08:00
Hyukjin Kwon	801b369bd0	[SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build ### What changes were proposed in this pull request? Scala 2.13 daily job was added but ideally we should deduplicate it. This PR targets to deduplicate it by creating one more job (`configure-jobs`) that the main job depends on. `configure-jobs` will properly set the branch, envs, etc. to run the main build properly. ### Why are the changes needed? To make the maintenance easier ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? See - https://github.com/HyukjinKwon/spark/actions/runs/1044636792 for a PR - https://github.com/HyukjinKwon/spark/actions/runs/1048542984 for a cron job Closes #33410 from HyukjinKwon/SPARK-36204. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-20 22:21:27 +09:00
Dominik Gehl	463fcb3723	[SPARK-36207][PYTHON] Expose databaseExists in pyspark.sql.catalog ### What changes were proposed in this pull request? Expose databaseExists in pyspark.sql.catalog ### Why are the changes needed? Was available in scala, but not in pyspark ### Does this PR introduce _any_ user-facing change? New method databaseExists ### How was this patch tested? Unit tests in codebase Closes #33416 from dominikgehl/feature/SPARK-36207. Lead-authored-by: Dominik Gehl <dog@open.ch> Co-authored-by: Dominik Gehl <gehl@fastmail.fm> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-20 22:10:06 +09:00

1 2 3 4 5 ...

30841 commits