ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Min Shen	c6b683e5a2	[SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client We have run performance evaluations on the version of push-based shuffle committed to upstream so far, and have identified a few places for further improvements: 1. On the server side, we have noticed that the usage of `String.format`, especially when receiving a block push request, has a much higher overhead compared with string concatenation. 2. On the server side, the usage of `Throwables.getStackTraceAsString` in the `ErrorHandler.shouldRetryError` and `ErrorHandler.shouldLogError` has generated quite some overhead. These 2 issues are related to how we are currently handling certain common block push failures. We are communicating such failures via `RPCFailure` by transmitting the exception stack trace. This generates the overhead on both server and client side for creating these exceptions and makes checking the type of failures fragile and inefficient with string matching of exception stack trace. To address these, this PR also proposes to encode the common block push failure as an error code and send that back to the client with a proper RPC message. Improve shuffle service efficiency for push-based shuffle. Improve code robustness for handling block push failures. No Existing unit tests. Closes #33613 from Victsm/SPARK-36378. Lead-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Min Shen <victor.nju@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `3f09093a21`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-10 16:54:21 -05:00
Kazuyuki Tanimura	c230ca95e6	[SPARK-36464][CORE] Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data ### What changes were proposed in this pull request? The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` This PR proposes to change the underlying `_size` variable from `Int` to `Long` at the initialization ### Why are the changes needed? Be cause the `size` method of `ChunkedByteBufferOutputStream` incorrectly returns a negative value when over 2GB data is written. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passed existing tests ``` build/sbt "core/testOnly ChunkedByteBufferOutputStreamSuite" ``` Also added a new unit test ``` build/sbt "core/testOnly ChunkedByteBufferOutputStreamSuite – -z SPARK-36464" ``` Closes #33690 from kazuyukitanimura/SPARK-36464. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `c888bad6a1`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-10 10:30:07 -07:00
gengjiaan	6018d44280	[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33684 from beliefer/SPARK-36429-new. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `186815be1c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 22:52:39 +08:00
Angerszhuuuu	fb6f3792af	[SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields ### What changes were proposed in this pull request? Support TypeCoercion of ANSI intervals with different fields ### Why are the changes needed? Support TypeCoercion of ANSI intervals with different fields ### Does this PR introduce _any_ user-facing change? After this pr user can - use comparison function with different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH` - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)` - support different field of ansi interval type in `coalesce` etc.. ### How was this patch tested? Added UT Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `89d8a4eacf`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-10 14:22:47 +03:00
Cheng Pan	45acd00dd6	[SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly ### What changes were proposed in this pull request? Retain `spark.sql.catalog.*` confs when resolving view. ### Why are the changes needed? Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Add UT. Closes #33692 from pan3793/SPARK-36466. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `7f56b73cad`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 17:31:36 +08:00
Terry Kim	882ef6dd73	[SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. ### Why are the changes needed? To check the duplicate columns during analysis. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the following: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33676 from imback82/replace_cols_duplicates. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e1a5d94117`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 13:20:50 +08:00
Venkata krishnan Sowrirajan	1a432fe6bb	[SPARK-36460][SHUFFLE] Pull out NoOpMergedShuffleFileManager inner class outside ### What changes were proposed in this pull request? Pull out NoOpMergedShuffleFileManager inner class outside. This is required since passing dollar sign ($) for the config (`spark.shuffle.server.mergedShuffleFileManagerImpl`) value can be an issue. Currently `spark.shuffle.server.mergedShuffleFileManagerImpl` is by default set to `org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager`. After this change the default value be set to `org.apache.spark.network.shuffle.NoOpMergedShuffleFileManager` ### Why are the changes needed? Passing `$` for the config value can be an issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified existing unit tests. Closes #33688 from venkata91/SPARK-36460. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `df0de83c46`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-10 10:19:41 +08:00
Venkata krishnan Sowrirajan	38dd42a50c	[SPARK-36332][SHUFFLE] Cleanup RemoteBlockPushResolver log messages ### What changes were proposed in this pull request? Cleanup `RemoteBlockPushResolver` log messages by using `AppShufflePartitionInfo#toString()` to avoid duplications. Currently this is based off of https://github.com/apache/spark/pull/33034 will remove those changes once it is merged and remove the WIP at that time. ### Why are the changes needed? Minor cleanup to make code more readable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No tests, just changing log messages Closes #33561 from venkata91/SPARK-36332. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit `ab897109a3`) Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-08-10 09:54:54 +08:00
Huaxin Gao	10f7f6e62b	[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 ### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == (1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- (1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. https://github.com/apache/spark/pull/30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `b04330cd38`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-09 10:47:23 -07:00
Jungtaek Lim	9dc8d0c3ae	[SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState ### What changes were proposed in this pull request? This PR proposes to add a new example of complex sessionization, which leverages flatMapGroupsWithState. ### Why are the changes needed? We have replaced an example of sessionization from flatMapGroupsWithState to native support of session window. Given there are still use cases on sessionization which native support of session window cannot cover, it would be nice if we can demonstrate such case. It will also be used as an example of flatMapGroupsWithState. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Example data is given in class doc. Closes #33681 from HeartSaVioR/SPARK-36455. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `004b87d9a1`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-09 19:40:37 +09:00
Wenchen Fan	5bddafe3e0	[SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE ### What changes were proposed in this pull request? This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172 Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs. After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable. This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33655 from cloud-fan/minor. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `9a539d5846`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 17:26:11 +08:00
Angerszhuuuu	2a46bf6a3b	[SPARK-36271][SQL] Unify V1 insert check field name before prepare writter ### What changes were proposed in this pull request? Unify DataSource V1 insert schema check field name before prepare writer. And in this PR we add check for avro V1 insert too. ### Why are the changes needed? Unify code and add check for avro V1 insert too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33566 from AngersZhuuuu/SPARK-36271. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f3e079b09b`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 17:18:20 +08:00
Angerszhuuuu	a5ecf2a490	[SPARK-36352][SQL] Spark should check result plan's output schema name ### What changes were proposed in this pull request? Spark should check result plan's output schema name ### Why are the changes needed? In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema. For example, for SchemaPruning, if we have a plan ``` Project[a, B] \|--Scan[A, b, c] ``` the origin output schema is `a, B`, after SchemaPruning. it become ``` Project[A, b] \|--Scan[A, b] ``` It change the plan's schema. when we use CTAS, the schema is same as query plan's output. Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed UT Closes #33583 from AngersZhuuuu/SPARK-36352. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e051a540a1`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 16:48:10 +08:00
Wenchen Fan	94dc3c77c2	[SPARK-35881][SQL][FOLLOWUP] Add a boolean flag in AdaptiveSparkPlanExec to ask for columnar output ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/33140 to propose a simpler idea for integrating columnar execution into AQE. Instead of making the `ColumnarToRowExec` and `RowToColumnarExec` dynamic to handle `AdaptiveSparkPlanExec`, it's simpler to let the consumer decide if it needs columnar output or not, and pass a boolean flag to `AdaptiveSparkPlanExec`. For Spark vendors, they can set the flag differently in their custom columnar parquet writing command when the input plan is `AdaptiveSparkPlanExec`. One argument is if we need to look at the final plan of AQE and consume the data differently (either row or columnar format). I can't think of a use case and I think we can always statically know if the AQE plan should output row or columnar data. ### Why are the changes needed? code simplification. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test Closes #33624 from cloud-fan/aqe. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `8714eefe6f`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 16:34:10 +08:00
Sajith Ariyarathna	f0844f70b1	[SPARK-36432][BUILD] Upgrade Jetty version to 9.4.43 ### What changes were proposed in this pull request? This PR upgrades Jetty version to `9.4.43.v20210629`. ### Why are the changes needed? To address vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429 which affects Jetty `9.4.42.v20210604`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Closes #33656 from this/upgrade-jetty-9.4.43. Lead-authored-by: Sajith Ariyarathna <sajith.janaprasad@gmail.com> Co-authored-by: Sajith Ariyarathna <this@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `5a22f9ceaf`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-09 10:14:16 +09:00
Min Shen	552a332dd4	[SPARK-36423][SHUFFLE] Randomize order of blocks in a push request to improve block merge ratio for push-based shuffle ### What changes were proposed in this pull request? On the client side, we are currently randomizing the order of push requests before processing each request. In addition we can further randomize the order of blocks within each push request before pushing them. In our benchmark, this has resulted in a 60%-70% reduction of blocks that fail to be merged due to bock collision (the existing block merge ratio is already pretty good in general, and this further improves it). ### Why are the changes needed? Improve block merge ratio for push-based shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Straightforward small change, no additional test needed. Closes #33649 from Victsm/SPARK-36423. Lead-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Min Shen <victor.nju@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `6e729515fd`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-06 09:48:31 -05:00
Yuto Akutsu	a5d0eafa32	[SPARK-595][DOCS] Add local-cluster mode option in Documentation ### What changes were proposed in this pull request? Add local-cluster mode option to submitting-applications.md ### Why are the changes needed? Help users to find/use this option for unit tests. ### Does this PR introduce _any_ user-facing change? Yes, docs changed. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` <img width="460" alt="docchange" src="https://user-images.githubusercontent.com/87687356/127125380-6beb4601-7cf4-4876-b2c6-459454ce2a02.png"> Closes #33537 from yutoacts/SPARK-595. Lead-authored-by: Yuto Akutsu <yuto.akutsu@jp.nttdata.com> Co-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit `41b011e416`) Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-08-06 09:27:24 -05:00
Kousuke Saruta	586eb5d4c6	Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported" ### What changes were proposed in this pull request? This PR reverts the change in SPARK-36429 (#33654). See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037). ### Why are the changes needed? To recover CIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33670 from sarutak/revert-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `e17612d0bf`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-08-06 20:56:40 +09:00
gaoyajun02	33e4ce562a	[SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced ### What changes were proposed in this pull request? Currently, references to grouping sets are reported as errors after aggregated expressions, e.g. ``` SELECT count(name) c, name FROM VALUES ('Alice'), ('Bob') people(name) GROUP BY name GROUPING SETS(name); ``` Error in query: expression 'people.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ### Why are the changes needed? Fix the map anonymous function in the constructAggregateExprs function does not use underscores to avoid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33574 from gaoyajun02/SPARK-36339. Lead-authored-by: gaoyajun02 <gaoyajun02@gmail.com> Co-authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `888f8f03c8`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 16:35:01 +08:00
Kousuke Saruta	f3761bdb76	[SPARK-36429][SQL][FOLLOWUP] Update a golden file to comply with the change in SPARK-36429 ### What changes were proposed in this pull request? This PR updates a golden to comply with the change in SPARK-36429 (#33654). ### Why are the changes needed? To recover GA failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #33663 from sarutak/followup-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `63c7d1847d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 15:21:29 +08:00
gengjiaan	be19270880	[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33654 from beliefer/SPARK-36429. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 14:01:13 +08:00
Wenchen Fan	f719e9c200	[SPARK-36409][SQL][TESTS] Splitting test cases from datetime.sql ### What changes were proposed in this pull request? Currently `datetime.sql` contains a lot of tests and will be run 3 times: default mode, ansi mode, ntz mode. It wastes the test time and also large test files are hard to read. This PR proposes to split it into smaller ones: 1. `date.sql`, which contains date literals, functions and operations. It will be run twice with default and ansi mode. 2. `timestamp.sql`, which contains timestamp (no ltz or ntz suffix) literals, functions and operations. It will be run 4 times: default mode + ans off, defaul mode + ansi on, ntz mode + ansi off, ntz mode + ansi on. 3. `datetime_special.sql`, which create datetime values whose year is outside of [0, 9999]. This is a separated file as JDBC doesn't support them and need to ignore this test file. It will be run 4 times as well. 4. `timestamp_ltz.sql`, which contains timestamp_ltz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ltz are tested by `timestamp.sql` 5. `timestamp_ntz.sql`, which contains timestamp_ntz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ntz are tested by `timestamp.sql` ### Why are the changes needed? reduce test run time. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33640 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-06 12:55:31 +08:00
Gengliang Wang	87291dced1	[SPARK-36415][SQL][DOCS] Add docs for try_cast/try_add/try_divide ### What changes were proposed in this pull request? Add documentation for new functions try_cast/try_add/try_divide ### Why are the changes needed? Better documentation. These new functions are useful when migrating to the ANSI dialect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/128209312-34a6cc6a-a73d-4aed-8646-22b1cb7ce702.png) Closes #33638 from gengliangwang/addDocForTry. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a35243fa7`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 12:33:19 +09:00
Liang-Chi Hsieh	bcf2169bed	[SPARK-36393][BUILD][FOLLOW-UP] Try to raise memory for GHA ### What changes were proposed in this pull request? As followup to raise memory for two places forgotten. ### Why are the changes needed? Raise memory for GHA. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #33658 from viirya/increasing-mem-ga-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `cd070f1b9c`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-05 20:09:40 -07:00
Kent Yao	0bb88c99f7	[SPARK-36421][SQL][DOCS] Use ConfigEntry.key to fix docs and set command results ### What changes were proposed in this pull request? This PR fixes the issue that `ConfigEntry` to be introduced to the doc field directly without calling `.key`, which causes malformed documents on the web site and in the result of `SET -v` 1. https://spark.apache.org/docs/3.1.2/configuration.html#static-sql-configuration - spark.sql.hive.metastore.jars 2. set -v ![image](https://user-images.githubusercontent.com/8326978/128292412-85100f95-24fd-4b40-a14f-d31a256dab7d.png) ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no, but contains doc fix ### How was this patch tested? new tests Closes #33647 from yaooqinn/SPARK-36421. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c7fa3c9090`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 11:01:54 +09:00
Kousuke Saruta	efd33b72ca	[SPARK-36441][INFRA] Fix GA failure related to downloading lintr dependencies ### What changes were proposed in this pull request? This PR fixes a GA failure which is related to downloading lintr dependencies. ``` * installing source package ‘devtools’ ... package ‘devtools’ successfully unpacked and MD5 sums checked using staged installation R inst byte-compile and prepare package for lazy loading help * installing help indices * copying figures building package indices installing vignettes testing if installed package can be loaded from temporary location testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (devtools) The downloaded source packages are in ‘/tmp/Rtmpv53Ix4/downloaded_packages’ Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT` Error: Failed to install 'unknown package' from GitHub: HTTP error 401. Bad credentials ``` I re-triggered the GA job but it still fail with the same error. https://github.com/apache/spark/runs/3257853825 The issue seems to happen when downloading lintr dependencies from GitHub. So, the solution is to change the way to download them. ### Why are the changes needed? To recover GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #33660 from sarutak/fix-r-package-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `e13bd586f1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 10:49:37 +09:00
Kent Yao	1785ead733	[SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE ### What changes were proposed in this pull request? This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually. ![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png) The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something ### Why are the changes needed? AQE is default now, we can make it more stable with this PR ### Does this PR introduce _any_ user-facing change? yes, broadcast timeout now is not used for AQE ### How was this patch tested? modified test Closes #33636 from yaooqinn/SPARK-36414. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0c94e47aec`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-05 21:15:48 +08:00
Angerszhuuuu	bf4edb5f5a	[SPARK-36353][SQL] RemoveNoopOperators should keep output schema ### What changes were proposed in this pull request? RemoveNoopOperators should keep output schema ### Why are the changes needed? Expand function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33587 from AngersZhuuuu/SPARK-36355. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `02810eecbf`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-05 20:43:48 +08:00
Liang-Chi Hsieh	712c311736	[SPARK-36393][BUILD] Try to raise memory for GHA ### What changes were proposed in this pull request? According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA. ### Why are the changes needed? Trying higher memory settings for GA. It could speed up the testing time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #33623 from viirya/increasing-mem-ga. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `7d13ac177b`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-05 01:31:45 -07:00
Angerszhuuuu	b4c065b7db	[SPARK-36391][SHUFFLE] When state is remove will throw NPE, and we should improve the error message ### What changes were proposed in this pull request? When channel terminated will call `connectionTerminated` and remove corresponding StreamState, then all coming request on this StreamState will throw NPE like ``` 2021-07-31 22:00:24,810 ERROR server.ChunkFetchRequestHandler (ChunkFetchRequestHandler.java:lambda$respond$1(146)) - Error sending result ChunkFetchFailure[streamChunkId=StreamChunkId[streamId=1119950114515,chunkIndex=0],errorString=java.lang.NullPointerException at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:80) at org.apache.spark.network.server.ChunkFetchRequestHandler.processFetchRequest(ChunkFetchRequestHandler.java:101) at org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:82) at org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51) at org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370) at org.sparkproject.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.sparkproject.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.sparkproject.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.sparkproject.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ] to /ip:53818; closing connection java.nio.channels.ClosedChannelException at org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) at org.sparkproject.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702) at org.sparkproject.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:112) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702) at org.sparkproject.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) at org.sparkproject.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.sparkproject.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.sparkproject.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.sparkproject.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` Since JVM will not show stack of NPE exception if it happen many times. ``` 021-07-28 08:25:44,720 ERROR server.ChunkFetchRequestHandler (ChunkFetchRequestHandler.java:lambda$respond$1(146)) - Error sending result ChunkFetchFailure[streamChunkId=StreamChunkId[streamId=1187623335353,chunkIndex=11],errorString=java.lang.NullPoint erException ] to /10.130.10.5:42148; closing connection java.nio.channels.ClosedChannelException ``` Makes user confused. We should improved this error message? ### Why are the changes needed? Improve error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33622 from AngersZhuuuu/SPARK-36391. Lead-authored-by: Angerszhuuuu <angers.zhu@gmaihu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit `b377ea26e2`) Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-08-05 15:32:05 +08:00
yi.wu	cc2a5abf7d	[SPARK-36384][CORE][DOC] Add doc for shuffle checksum ### What changes were proposed in this pull request? Add doc for the shuffle checksum configs in `configuration.md`. ### Why are the changes needed? doc ### Does this PR introduce _any_ user-facing change? No, since Spark 3.2 hasn't been released. ### How was this patch tested? Pass existed tests. Closes #33637 from Ngone51/SPARK-36384. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3b92c721b5`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-05 10:16:53 +09:00
Kousuke Saruta	070169ea41	[SPARK-36068][BUILD][TEST] No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly ### What changes were proposed in this pull request? This PR fixes an issue that no tests in `hadoop-cloud` are compiled and run unless `hadoop-3.2` profile is activated explicitly. The root cause seems similar to SPARK-36067 (#33276) so the solution is to activate `hadoop-3.2` profile in `hadoop-cloud/pom.xml` by default. This PR introduced an empty profile for `hadoop-2.7`. Without this, building with `hadoop-2.7` fails. ### Why are the changes needed? `hadoop-3.2` profile should be activated by default so tests in `hadoop-cloud` also should be compiled and run without activating `hadoop-3.2` profile explicitly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed tests in `hadoop-cloud` ran with both SBT and Maven. ``` build/sbt -Phadoop-cloud "hadoop-cloud/test" ... [info] CommitterBindingSuite: [info] - BindingParquetOutputCommitter binds to the inner committer (258 milliseconds) [info] - committer protocol can be serialized and deserialized (11 milliseconds) [info] - local filesystem instantiation (3 milliseconds) [info] - reject dynamic partitioning (1 millisecond) [info] Run completed in 1 second, 234 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. build/mvn -Phadoop-cloud -pl hadoop-cloud test ... CommitterBindingSuite: - BindingParquetOutputCommitter binds to the inner committer - committer protocol can be serialized and deserialized - local filesystem instantiation - reject dynamic partitioning Run completed in 560 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` I also confirmed building with `-Phadoop-2.7` successfully finishes with both SBT and Maven. ``` build/sbt -Phadoop-cloud -Phadoop-2.7 "hadoop-cloud/Test/compile" build/mvn -Phadoop-cloud -Phadoop-2.7 -pl hadoop-cloud testCompile ``` Closes #33277 from sarutak/fix-hadoop-3.2-cloud. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0f5c3a4fd6`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-05 09:39:37 +09:00
Dongjoon Hyun	7d685dfd6f	[SPARK-36354][CORE] EventLogFileReader should skip rolling event log directories with no logs ### What changes were proposed in this pull request? This PR aims to skip rolling event log directories which has only `appstatus` file. ### Why are the changes needed? Currently, Spark History server shows `IllegalArgumentException` warning, but the event log might arrive later. The situation also can happen when the job is killed before uploading its first log to the remote storages like S3. ``` 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771 java.lang.IllegalArgumentException: requirement failed: Log directory must contain at least one event log file! ... at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216) ``` ### Does this PR introduce _any_ user-facing change? Yes. Users will not see `IllegalArgumentException` warnings. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #33586 from dongjoon-hyun/SPARK-36354. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `28a2a2238f`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-04 20:26:24 +09:00
Venkata krishnan Sowrirajan	0a98e51b84	[SPARK-32923][FOLLOW-UP] Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received ### What changes were proposed in this pull request? Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of https://github.com/apache/spark/pull/33034#discussion_r680610872. ### Why are the changes needed? Without this change, older shuffleMergeId files won't be cleaned up properly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added changes to existing unit test to address this case. Closes #33605 from venkata91/SPARK-32923-follow-on. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `d8169493b6`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-04 03:30:56 -05:00
itholic	f7ab2bfc8c	[SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io ### What changes were proposed in this pull request? This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message. ### Why are the changes needed? To improve the warning message. ### Does this PR introduce _any_ user-facing change? The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead." ### How was this patch tested? Manually run `dev/lint-python` Closes #33631 from itholic/SPARK-35811-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3d72c20e64`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 16:20:37 +09:00
Kousuke Saruta	6e8187b7b5	[MINOR][DOC] Remove obsolete `contributing-to-spark.md` ### What changes were proposed in this pull request? This PR removes obsolete `contributing-to-spark.md` which is not referenced from anywhere. ### Why are the changes needed? Just clean up. ### Does this PR introduce _any_ user-facing change? No. Users can't have access to contributing-to-spark.html unless they directly point to the URL. ### How was this patch tested? Built the document and confirmed that this change doesn't affect the result. Closes #33619 from sarutak/remove-obsolete-contribution-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c31b653806`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:19:32 +09:00
PengLei	8c42232638	[SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table ### What changes were proposed in this pull request? Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive. ### Why are the changes needed? At now the resolver is `_ == _` of `findNestedField` called by `checkColumnNotExists` Add `alter.conf.resolver` to it. [SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Closes #33618 from Peng-Lei/sensitive-cloumn-name. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `87d49cbcb1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:04:25 +09:00
Max Gekk	bd33408b4b	[SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources ### What changes were proposed in this pull request? In the PR, I propose to ban `YearMonthIntervalType` and `DayTimeIntervalType` at the analysis phase while creating a table using a built-in filed-based datasource or writing a dataset to such datasource. In particular, add the following case: ```scala case _: DayTimeIntervalType \| _: YearMonthIntervalType => false ``` to all methods that override either: - V2 `FileTable.supportsDataType()` - V1 `FileFormat.supportDataType()` ### Why are the changes needed? To improve user experience with Spark SQL, and output a proper error message at the analysis phase. ### Does this PR introduce _any_ user-facing change? Yes but ANSI interval types haven't released yet. So, for users this is new behavior. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite" ``` Closes #33580 from MaxGekk/interval-ban-in-ds. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `67cbc93263`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 20:30:33 +03:00
yi.wu	144040ab28	[SPARK-36383][CORE][3.2] Avoid NullPointerException during executor shutdown ### What changes were proposed in this pull request? Fix `NullPointerException` in `Executor.stop()`. ### Why are the changes needed? Some initialization steps could fail before the initialization of `metricsPoller`, `heartbeater`, `threadPool`, which results in the null of `metricsPoller`, `heartbeater`, `threadPool`. For example, I encountered a failure of: `c20af53580/core/src/main/scala/org/apache/spark/executor/Executor.scala (L137)` where the executor itself failed to register at the driver. This PR helps to eliminate the error messages when the issue happens to not confuse users: <details> <summary><mark><font color=darkred>[click to see the detailed error message]</font></mark></summary> <pre> 21/07/23 16:04:10 WARN Executor: Unable to stop executor metrics poller java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:318) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 WARN Executor: Unable to stop heartbeater java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:324) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.NullPointerException at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:334) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231) at org.apache.spark.executor.Executor.stop(Executor.scala:334) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) </pre> </details> ### Does this PR introduce _any_ user-facing change? Yes, users won't see error messages of `NullPointerException` after this fix. ### How was this patch tested? Pass existing tests. Closes #33620 from Ngone51/spark-36383-3.2. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-03 08:50:11 -07:00
Kousuke Saruta	cc75618e87	[SPARK-35815][SQL][FOLLOWUP] Add test considering the case spark.sql.legacy.interval.enabled is true ### What changes were proposed in this pull request? This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815. ### Why are the changes needed? SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`. We can't detect this issue because there is no test which considers the legacy interval type at that time. In SPARK-36323 (#33551), this issue was resolved but it's better to add test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33606 from sarutak/test-watermark-with-legacy-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `92cdb17d1a`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 13:48:53 +03:00
Wenchen Fan	8d817dcf30	[SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it breaks the distribution requirement ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30494 This PR proposes a new way to optimize the final query stage in AQE. We first collect the effective user-specified repartition (semantic-wise, user-specified repartition is only effective if it's the root node or under a few simple nodes), and get the required distribution for the final plan. When we optimize the final query stage, we skip certain `AQEShuffleReadRule` if it breaks the required distribution. ### Why are the changes needed? The current solution for optimizing the final query stage is pretty hacky and overkill. As an example, the newly added rule `OptimizeSkewInRebalancePartitions` can hardly apply as it's very common that the query plan has shuffles with origin `ENSURE_REQUIREMENTS`, which is not supported by `OptimizeSkewInRebalancePartitions`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #33541 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `dd80457ffb`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 18:29:07 +08:00
Wenchen Fan	7c586842d7	[SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN ### What changes were proposed in this pull request? This a followup of the recent work such as https://github.com/apache/spark/pull/33200 For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands. This PR also moves these AlterTable commands to a individual file and give them a base trait. ### Why are the changes needed? name simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33609 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `7cb9c1c241`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 10:43:15 +03:00
Xinrong Meng	c22a25b76a	[SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists ### What changes were proposed in this pull request? Better error messages for DataTypeOps against lists. ### Why are the changes needed? Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead. ### Does this PR introduce _any_ user-facing change? Yes. A TypeError message will be showed rather than a Py4JJavaError. From: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt. : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1] ... ``` To: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... TypeError: The operation can not be applied to list. ``` ### How was this patch tested? Unit tests. Closes #33581 from xinrong-databricks/data_type_ops_list. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8ca11fe39f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 16:25:59 +09:00
Chandni Singh	369781a331	[SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative mapId in ShuffleBlockId ### What changes were proposed in this pull request? With SPARK-32922, we added a change that ShuffleBlockId can have a negative mapId. This was to support push-based shuffle where -1 as mapId indicated a push-merged block. However with SPARK-32923, a different type of BlockId was introduced - ShuffleMergedId, but reverting the change to ShuffleBlockId was missed. ### Why are the changes needed? This reverts the changes to `ShuffleBlockId` which will never have a negative mapId. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified the unit test to verify the newly added ShuffleMergedBlockId. Closes #33616 from otterc/SPARK-36389. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `2712343a27`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-03 00:18:06 -07:00
Takuya UESHIN	71e4c56974	[SPARK-36367][3.2][PYTHON] Partially backport to avoid unexpected error with pandas 1.3 ### What changes were proposed in this pull request? Partially backport from #33598 to avoid unexpected error caused by pandas 1.3. ### Why are the changes needed? If uses tries to use pandas 1.3 as the underlying pandas, it will raise unexpected errors caused by removed APIs or behavior change. Note that pandas API on Spark 3.2 will still follow the pandas 1.2 behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33614 from ueshin/issues/SPARK-36367/3.2/partially_backport. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 14:03:35 +09:00
Karen Feng	de837a0ebb	[SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines ### What changes were proposed in this pull request? Adds ANSI/ISO SQLSTATE standards to the error guidelines. ### Why are the changes needed? Provides visibility and consistency to the SQLSTATEs assigned to error classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed; docs only Closes #33560 from karenfeng/sqlstate-manual. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `63517eb430`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 13:58:07 +09:00
Hyukjin Kwon	9eec11b956	[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: Permissive mode: ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` To make the permissive mode to proceed and parse without throwing an exception. Permissive mode: ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `0bbcbc6508`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 10:02:09 -07:00
Angerszhuuuu	ea559adc2e	[SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name * FAILED * (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZhuuuu/SPARK-36086. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f3173956cb`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 00:08:30 +08:00
Linhong Liu	e26cb968bd	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2f700773c2`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:20:11 +08:00
yi.wu	df43300227	[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `a98d919da4`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-02 09:59:30 -05:00

1 2 3 4 5 ...

30918 commits