ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	343e0bb3ad	[SPARK-33286][SQL] Improve the error message about schema parsing by `from_json/from_csv` # What changes were proposed in this pull request? In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers: - DataType.fromJson (except CSV) - CatalystSqlParser.parseDataType - CatalystSqlParser.parseTableSchema Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users. ### Why are the changes needed? Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example: ```scala val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}""" df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show() ``` The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out: ``` mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '{' expecting {'ADD', 'AFTER', ... }(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes for the example above: ``` Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field. Failed fallback parsing: Cannot parse the data type: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ Failed fallback parsing: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### How was this patch tested? - By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`. - Add new test to `JsonFunctionsSuite`. - Re-gen results for `json-functions.sql`. Closes #30183 from MaxGekk/fromDDL-error-msg. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-30 11:18:47 +09:00
luluorta	cbd3fdea62	[SPARK-33008][SQL] Division by zero on divide-like operations returns incorrect result ### What changes were proposed in this pull request? In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #29882 from luluorta/SPARK-33008. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-29 16:44:17 +00:00
Liang-Chi Hsieh	056b62264b	[SPARK-33263][SS] Configurable StateStore compression codec ### What changes were proposed in this pull request? This patch proposes to make StateStore compression codec configurable. ### Why are the changes needed? Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. For example, we can choose zstd codec and zstd is configurable with different compression level. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can config different codec for StateStore. ### How was this patch tested? Unit test. Closes #30162 from viirya/SPARK-33263. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-29 07:44:44 -07:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
Jungtaek Lim (HeartSaVioR)	a744fea3be	[SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values contains null ### What changes were proposed in this pull request? This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue. ### Why are the changes needed? This is an obvious bug as `In` filter doesn't care about null value when calculating hash code. ### Does this PR introduce _any_ user-facing change? Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail. ### How was this patch tested? UT added. The new UT fails without the PR and passes with the PR. Closes #30170 from HeartSaVioR/SPARK-33267. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 10:00:29 -07:00
zky.zhoukeyong	b26ae98407	[SPARK-33208][SQL] Update the document of SparkSession#sql Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b ### What changes were proposed in this pull request? Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries. ### Why are the changes needed? To clarify the behavior of SparkSession#sql. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No needed. Closes #30168 from waitinfuture/master. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 13:17:28 +00:00
gengjiaan	3c3ad5f7c0	[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction ### What changes were proposed in this pull request? Spark SQL supports some window function like `NTH_VALUE`. If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations. For example: if we execute the SQL show below: ``` SELECT NTH_VALUE(col, 2) OVER(ORDER BY rank UNBOUNDED PRECEDING AND CURRENT ROW) FROM tab; ``` The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2. `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler. ### Why are the changes needed? Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29800 from beliefer/optimize-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 06:40:23 +00:00
allisonwang-db	9fb45361fd	[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts ### What changes were proposed in this pull request? This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix. ### Why are the changes needed? A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts. ``` Sort(orders, global = True, ...) Sort(orders, global = False, ...) ``` Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit tests Closes #30093 from allisonwang-db/fix-sort. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:51:47 +00:00
Terry Kim	528160f001	[SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("DROP TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("DROP TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior. ### How was this patch tested? Added a new test Closes #30079 from imback82/drop_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:44:55 +00:00
Jungtaek Lim (HeartSaVioR)	fcf8aa59b5	[SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 session catalog ### What changes were proposed in this pull request? This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog. ### Why are the changes needed? The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation. This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list. ### Does this PR introduce _any_ user-facing change? Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog. ### How was this patch tested? New UT added. Closes #30147 from HeartSaVioR/SPARK-33240. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 03:31:11 +00:00
Ankur Dave	3f2a2b5fe6	[SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream ### What changes were proposed in this pull request? The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`. ```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 ``` When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes. Closes #30160 from ankurdave/SPARK-33260. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-27 13:20:22 -07:00
Huaxin Gao	f284218dae	[SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (Postgres dialect) ### What changes were proposed in this pull request? Override the default SQL strings in Postgres Dialect for: - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala` ### Why are the changes needed? supports Postgres specific ALTER TABLE syntax. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test `PostgreSQLIntegrationSuite` Closes #30089 from huaxingao/postgres_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-27 15:04:53 +00:00
xuewei.linxuewei	537a49fc09	[SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of Rule[QueryPlan] ### What changes were proposed in this pull request? Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan]. In this PR. * remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SQLConf.get to replace the original SQLConf instance. * remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SparkSession.active to replace the original SparkSession instance. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test Closes #30097 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-27 12:40:57 +00:00
angerszhu	e43cd8ccef	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:25:53 +09:00
Steve Loughran	02fa19f102	[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes #30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-26 12:31:05 -07:00
Cheng Su	1042d49bf9	[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) ### What changes were proposed in this pull request? This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate. ### Why are the changes needed? Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work. Closes #30138 from c21/enable-auto-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 20:23:24 +09:00
Cheng Su	d87a0bb2ca	[SPARK-32862][SS] Left semi stream-stream join ### What changes were proposed in this pull request? This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`): * For left side input row, check if there's a match on right side state store. * if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store). * if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store). * For right side input row, check if there's a match on left side state store. * For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics. * State store eviction: evict rows from left/right side state store below watermark, same as inner join. Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup. ### Why are the changes needed? Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30076 from c21/stream-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-10-26 13:33:06 +09:00
HyukjinKwon	369cc614f3	Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive" This reverts commit `56ab60fb7a`.	2020-10-26 11:38:48 +09:00
angerszhu	56ab60fb7a	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-26 11:20:29 +09:00
Takeshi Yamamuro	87b498462b	[SPARK-33228][SQL] Don't uncache data when replacing a view having the same logical plan ### What changes were proposed in this pull request? SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows; ``` // Spark v2.4.6+ scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // Until v2.4.6 scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() 20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30140 from maropu/FixBugInReplaceView. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-25 16:15:55 -07:00
Jungtaek Lim (HeartSaVioR)	0c66a88d1d	[SPARK-29438][SS][FOLLOWUP] Add regression tests for Streaming Aggregation and flatMapGroupsWithState ### What changes were proposed in this pull request? This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here: https://github.com/apache/spark/pull/26162#issuecomment-576929692 > LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations? For now they're not broken, but even better if we have UTs to prevent the case for the future. ### Why are the changes needed? New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #27333 from HeartSaVioR/SPARK-29438-add-regression-test. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-24 15:36:41 -07:00
Kent Yao	e21bb710e5	[SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET ### What changes were proposed in this pull request? #### case the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`, and the specially handled config `spark.sql.warehouse.dir` the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala ```scala import java.lang.reflect.Field import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} object WarehouseSCBeforeSS extends App { val wh = "spark.sql.warehouse.dir" val td = "spark.sql.globalTempDatabase" val custom = "spark.sql.custom" val conf = new SparkConf() .setMaster("local") .setAppName("SPARK-32991") .set(wh, "./data1") .set(td, "bob") val sc = new SparkContext(conf) val spark = SparkSession.builder() .config(wh, "./data2") .config(td, "alice") .config(custom, "kyao") .getOrCreate() val confField: Field = spark.sharedState.getClass.getDeclaredField("conf") confField.setAccessible(true) private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf] println() println(s"=====> SharedState: $wh=${shared.get(wh)}") println(s"=====> SharedState: $td=${shared.get(td)}") println(s"=====> SharedState: $custom=${shared.get(custom, "")}") println(s"=====> SessionState: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState: $td=${spark.conf.get(td)}") println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}") val spark2 = SparkSession.builder().config(td, "fred").getOrCreate() println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}") SparkSession.setActiveSession(spark) spark.sql("RESET") println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}") println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}") val spark3 = SparkSession.builder().getOrCreate() println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}") } ``` #### outputs and analysis ``` // 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession //=====> SharedState: spark.sql.warehouse.dir=./data1 // 2. ⏬ //=====> SharedState: spark.sql.globalTempDatabase=alice //=====> SharedState: spark.sql.custom=kyao //=====> SessionState: spark.sql.warehouse.dir=./data2 //=====> SessionState: spark.sql.globalTempDatabase=alice //=====> SessionState: spark.sql.custom=kyao //=====> SessionState 2: spark.sql.warehouse.dir=./data2 //=====> SessionState 2: spark.sql.globalTempDatabase=alice //=====> SessionState 2: spark.sql.custom=kyao // 2'.🔼 OK until here // 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed //=====> SessionState RESET: spark.sql.warehouse.dir=./data1 //=====> SessionState RESET: spark.sql.globalTempDatabase=bob //=====> SessionState RESET: spark.sql.custom= // 4. Then the SparkSessions created after RESET will be corrected. //=====> SessionState 3: spark.sql.warehouse.dir=./data1 //=====> SessionState 3: spark.sql.globalTempDatabase=bob //=====> SessionState 3: spark.sql.custom= ``` In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command. `SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem. ### Why are the changes needed? bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later. ### Does this PR introduce _any_ user-facing change? yes, before this change when you use programming API and call RESET, all configs will be reset to SparkContext.conf, now they go to SparkSession.sharedState.conf ### How was this patch tested? new tests Closes #30045 from yaooqinn/SPARK-32991. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-23 05:52:38 +00:00
Max Gekk	a03d77d326	[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 15:57:03 +00:00
yangjie01	b38f3a5557	[SPARK-32978][SQL] Make sure the number of dynamic part metric is correct ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32978. The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side. The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct. ### Why are the changes needed? The the number of dynamic part metric we display on the UI should be correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test case refer to described in SPARK-32978 Closes #30026 from LuciferYang/SPARK-32978. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 14:01:07 +00:00
Prashant Sharma	8cae7f88b0	[SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. ### Why are the changes needed? Improved code coverage and support mysql dialect for jdbc. ### Does this PR introduce _any_ user-facing change? Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### How was this patch tested? Added tests. Closes #30025 from ScrapCodes/mysql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:51:42 +00:00
gengjiaan	eb33bcb4b2	[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE ### What changes were proposed in this pull request? `REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string. But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore: AddressLine1 -> address_line_1 If we support the parameter position, we can do like this(e.g. Oracle): ``` WITH strings as ( SELECT 'AddressLine1' s FROM dual union all SELECT 'ZipCode' s FROM dual union all SELECT 'Country' s FROM dual ) SELECT s "STRING", lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING" FROM strings; ``` The output: ``` STRING MODIFIED_STRING -------------------- -------------------- AddressLine1 address_line_1 ZipCode zip_code Country country ``` There are some mainstream database support the syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490 Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace Redshift https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html ### Why are the changes needed? The parameter position for `REGEXP_REPLACE` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #29891 from beliefer/add-position-for-regex_replace. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 07:59:49 +00:00
Max Gekk	ba13b94f6b	[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 03:04:29 +00:00
Max Gekk	bbf2d6f6df	[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing ### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-22 10:03:41 +09:00
Gabor Somogyi	fbb6843620	[SPARK-32229][SQL] Fix PostgresConnectionProvider and MSSQLConnectionProvider by accessing wrapped driver ### What changes were proposed in this pull request? Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`. ### Why are the changes needed? `DriverWrapper` is not considered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Closes #30024 from gaborgsomogyi/SPARK-32229. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-20 15:14:38 +09:00
Max Gekk	a44e008de3	[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing ### What changes were proposed in this pull request? 1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0. 2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`. 3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one. 4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag: - No rebasing in testing when the SQL config `spark.test.forceNoRebase` is set to `true` - No rebasing if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`. - With rebasing if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`. - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`. New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of: - To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED. - To have different modes for INT96 and DATE in load (or in save). - To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default. ### Why are the changes needed? 1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed. 2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date). 3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps. ### Does this PR introduce _any_ user-facing change? It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`. ### How was this patch tested? - Added a test to check the metadata key `org.apache.spark.int96NoRebase` - By `ParquetIOSuite` Closes #30056 from MaxGekk/parquet-rebase-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 14:58:59 +09:00
Nan Zhu	35133901f7	[SPARK-32351][SQL] Show partially pushed down partition filters in explain() ### What changes were proposed in this pull request? Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output. This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain(). ### Why are the changes needed? without the patch, the explained plan is inconsistent with what is actually executed <b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1))) +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]] ``` <b> with change </b> the plan looks like (the actually executed partition filters are exhibited) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1))) +- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))] ``` ### Does this PR introduce _any_ user-facing change no ### How was this patch tested? Unit test. Closes #29831 from CodingCat/SPARK-32351. Lead-authored-by: Nan Zhu <nanzhu@uber.com> Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 11:13:16 +09:00
Max Gekk	26b13c70c3	[SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources ### What changes were proposed in this pull request? 1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources. 2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs. 3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`. 4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`. ### Why are the changes needed? To improve test coverage and test all built-in file-based datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #30067 from MaxGekk/ds-options-common-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 17:47:49 +09:00
Liang-Chi Hsieh	3010e9044e	[SPARK-33170][SQL] Add SQL config to control fast-fail behavior in FileFormatWriter ### What changes were proposed in this pull request? This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default. ### Why are the changes needed? In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry. Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases. We are going to add a config we can control this behavior and set it false for fast-fail by default. ### Does this PR introduce _any_ user-facing change? Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it. ### How was this patch tested? Unit test. Closes #30073 from viirya/SPARK-33170. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-17 21:02:25 -07:00
Liang-Chi Hsieh	e574fcd230	[SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns ### What changes were proposed in this pull request? SPARK-29358 added support for `unionByName` to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns. The behavior before this PR: ```scala scala> val df1 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2, 'a', id + 3) c1") scala> val df2 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2) c1") scala> df1.unionByName(df2, true).printSchema org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c:bigint,b:bigint> <> struct<c:bigint,b:bigint,a:bigint> at the second column of the second table;; 'Union false, false :- Project [id#0L AS c0#2L, named_struct(c, (id#0L + cast(1 as bigint)), b, (id#0L + cast(2 as bigint)), a, (id#0L + cast(3 as bigint))) AS c1#3] : +- Range (0, 1, step=1, splits=Some(12)) +- Project [c0#8L, c1#9] +- Project [id#6L AS c0#8L, named_struct(c, (id#6L + cast(1 as bigint)), b, (id#6L + cast(2 as bigint))) AS c1#9] +- Range (0, 1, step=1, splits=Some(12)) ``` The behavior after this PR: ```scala scala> df1.unionByName(df2, true).printSchema root \|-- c0: long (nullable = false) \|-- c1: struct (nullable = false) \| \|-- a: long (nullable = true) \| \|-- b: long (nullable = false) \| \|-- c: long (nullable = false) scala> df1.unionByName(df2, true).show() +---+-------------+ \| c0\| c1\| +---+-------------+ \| 0\| {3, 2, 1}\| \| 0\|{ null, 2, 1}\| +---+-------------+ ``` ### Why are the changes needed? The `allowMissingColumns` of `unionByName` is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can union two datasets with different schema with different structs. ### How was this patch tested? Unit tests. Closes #29587 from viirya/SPARK-32376. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-16 14:48:14 -07:00
Max Gekk	acb79f52db	[MINOR][SQL] Re-use `binaryToSQLTimestamp()` in `ParquetRowConverter` ### What changes were proposed in this pull request? The function `binaryToSQLTimestamp()` is used by Parquet Vectorized reader. Parquet MR reader has similar code for de-serialization of INT96 timestamps. In this PR, I propose to de-duplicate code and re-use `binaryToSQLTimestamp()`. ### Why are the changes needed? This should improve maintenance, and should allow to avoid errors while changing Vectorized and regular parquet readers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites, for instance `ParquetIOSuite`. Closes #30069 from MaxGekk/int96-common-serde. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 14:27:27 -07:00
Dongjoon Hyun	ab0bad9544	[SPARK-33171][INFRA] Mark ParquetVFilterSuite/ParquetVSchemaPruningSuite as ExtendedSQLTest ### What changes were proposed in this pull request? This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`. - ParquetV1FilterSuite/ParquetV2FilterSuite - ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite ### Why are the changes needed? Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job. BEFORE - https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes) AFTER - https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Github Action with the reduced time. Closes #30068 from dongjoon-hyun/MOVE3. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 12:52:45 -07:00
neko	e029e891ab	[SPARK-33145][WEBUI] Fix when `Succeeded Jobs` has many child url elements,they will extend over the edge of the page ### What changes were proposed in this pull request? In Execution web page, when `Succeeded Job`(or Failed Jobs) has many child url elements,they will extend over the edge of the page. ### Why are the changes needed? To make the page more friendly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Munual test result shows as below: ![fixed](https://user-images.githubusercontent.com/52202080/95977319-50734600-0e4b-11eb-93c0-b8deb565bcd8.png) Closes #30035 from akiyamaneko/sql_execution_job_overflow. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-10-16 23:13:22 +08:00
ulysses	3ae1520185	[SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name ### What changes were proposed in this pull request? Correct the resolution of having clause. ### Why are the changes needed? Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example: ``` -- Works resolved by `ResolveReferences` select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 ``` It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? add test. Closes #30029 from ulysses-you/SPARK-33131. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:26:27 +00:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Takeshi Yamamuro	a5c17de241	[SPARK-33165][SQL][TEST] Remove dependencies(scalatest,scalactic) from Benchmark ### What changes were proposed in this pull request? This PR proposes to remove `assert` from `Benchmark` for making it easier to run benchmark codes via `spark-submit`. ### Why are the changes needed? Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we need to pass the proper jars of `scalatest` and `scalactic`; - scalatest-core_2.12-3.2.0.jar - scalatest-compatible-3.2.0.jar - scalactic_2.12-3.0.jar ``` ./bin/spark-submit --jars scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` This update can make developers submit benchmark codes without these dependencies; ``` ./bin/spark-submit --jars ./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #30064 from maropu/RemoveDepInBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 11:39:09 +09:00
Huaxin Gao	bf594a9788	[SPARK-32402][SQL][FOLLOW-UP] Add case sensitivity tests for column resolution in ALTER TABLE ### What changes were proposed in this pull request? Add case sensitivity tests for column resolution in ALTER TABLE ### Why are the changes needed? To make sure `spark.sql.caseSensitive` works for `ResolveAlterTableChanges` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #30063 from huaxingao/caseSensitivity. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 11:04:35 +09:00
Max Gekk	38c05af1d5	[SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up https://github.com/apache/spark/pull/28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes #30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 10:28:15 +09:00
Denis Pyshev	ba69d68d91	[SPARK-33080][BUILD] Replace fatal warnings snippet ### What changes were proposed in this pull request? Current solution in build file to enable build failure on compilation warnings with exclusion of deprecation ones is not portable after SBT version 1.3.13 (build import fails with compilation error with SBT 1.4) and could be replaced with more robust and maintainable, especially since Scala 2.13.2 with similar built-in functionality. Additionally, warnings were fixed to pass the build, with as few changes as possible: warnings in 2.12 compilation fixed in code, warnings in 2.13 compilation covered by configuration to be addressed separately ### Why are the changes needed? Unblocks upgrade to SBT after 1.3.13. Enhances build file maintainability. Allows fine tune of warnings configuration in scope of Scala 2.13 compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/sbt`'s `compile` and `Test/compile` for both Scala 2.12 and 2.13 profiles. Closes #29995 from gemelen/feature/warnings-reporter. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-15 14:49:43 -05:00
Huaxin Gao	31f7097ce0	[SPARK-32402][SQL][FOLLOW-UP] Use quoted column name for JDBCTableCatalog.alterTable ### What changes were proposed in this pull request? I currently have unquoted column names in alter table, e.g. ```ALTER TABLE "test"."alt_table" DROP COLUMN c1``` should change to quoted column name ```ALTER TABLE "test"."alt_table" DROP COLUMN "c1"``` ### Why are the changes needed? We should always use quoted identifiers in JDBC SQLs, e.g. ```CREATE TABLE "test"."abc" ("col" INTEGER ) ``` or ```INSERT INTO "test"."abc" ("col") VALUES (?)```. Using unquoted column name in alterTable causes problems, for example: ``` sql("CREATE TABLE h2.test.alt_table (c1 INTEGER, c2 INTEGER) USING _") sql("ALTER TABLE h2.test.alt_table DROP COLUMN c1") org.apache.spark.sql.AnalysisException: Failed table altering: test.alt_table; ...... Caused by: org.h2.jdbc.JdbcSQLException: Column "C1" not found; SQL statement: ALTER TABLE "test"."alt_table" DROP COLUMN c1 [42122-195] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30041 from huaxingao/alter_table_followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-15 15:33:23 +00:00
manuzhang	77a8efbc05	[SPARK-32932][SQL] Do not use local shuffle reader at final stage on write command ### What changes were proposed in this pull request? Do not use local shuffle reader at final stage if the root node is write command. ### Why are the changes needed? Users usually repartition with partition column on dynamic partition overwrite. AQE could break it by removing physical shuffle with local shuffle reader. That could lead to a large number of output files, even exceeding the file system limit. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #29797 from manuzhang/spark-32932. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-15 05:53:32 +00:00
Wenchen Fan	f3ad32f4b6	[SPARK-33026][SQL][FOLLOWUP] metrics name should be numOutputRows ### What changes were proposed in this pull request? Follow the convention and rename the metrics `numRows` to `numOutputRows` ### Why are the changes needed? `FilterExec`, `HashAggregateExec`, etc. all use `numOutputRows` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30039 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-14 16:17:28 +00:00
Jungtaek Lim (HeartSaVioR)	8e5cb1d276	[SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes #30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 08:30:03 -07:00
Richard Penney	d8c4a47ea1	[SPARK-33061][SQL] Expose inverse hyperbolic trig functions through sql.functions API This patch is a small extension to change-request SPARK-28133, which added inverse hyperbolic functions to the SQL interpreter, but did not include those methods within the Scala `sql.functions._` API. This patch makes `acosh`, `asinh` and `atanh` functions available through the Scala API. Unit-tests have been added to `sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala`. Manual testing has been done via `spark-shell`, using the following recipe: ``` val df = spark.range(0, 11) .toDF("x") .withColumn("x", ($"x" - 5) / 2.0) val hyps = df.withColumn("tanh", tanh($"x")) .withColumn("sinh", sinh($"x")) .withColumn("cosh", cosh($"x")) val invhyps = hyps.withColumn("atanh", atanh($"tanh")) .withColumn("asinh", asinh($"sinh")) .withColumn("acosh", acosh($"cosh")) invhyps.show ``` which produces the following output: ``` +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ \| x\| tanh\| sinh\| cosh\| atanh\| asinh\| acosh\| +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ \|-2.5\| -0.9866142981514303\|-6.0502044810397875\| 6.132289479663686\| -2.500000000000001\|-2.4999999999999956\| 2.5\| \|-2.0\| -0.9640275800758169\| -3.626860407847019\|3.7621956910836314\|-2.0000000000000004\|-1.9999999999999991\| 2.0\| \|-1.5\| -0.9051482536448664\|-2.1292794550948173\| 2.352409615243247\|-1.4999999999999998\|-1.4999999999999998\| 1.5\| \|-1.0\| -0.7615941559557649\|-1.1752011936438014\| 1.543080634815244\| -1.0\| -1.0\| 1.0\| \|-0.5\|-0.46211715726000974\|-0.5210953054937474\|1.1276259652063807\| -0.5\|-0.5000000000000002\|0.4999999999999998\| \| 0.0\| 0.0\| 0.0\| 1.0\| 0.0\| 0.0\| 0.0\| \| 0.5\| 0.46211715726000974\| 0.5210953054937474\|1.1276259652063807\| 0.5\| 0.5\|0.4999999999999998\| \| 1.0\| 0.7615941559557649\| 1.1752011936438014\| 1.543080634815244\| 1.0\| 1.0\| 1.0\| \| 1.5\| 0.9051482536448664\| 2.1292794550948173\| 2.352409615243247\| 1.4999999999999998\| 1.5\| 1.5\| \| 2.0\| 0.9640275800758169\| 3.626860407847019\|3.7621956910836314\| 2.0000000000000004\| 2.0\| 2.0\| \| 2.5\| 0.9866142981514303\| 6.0502044810397875\| 6.132289479663686\| 2.500000000000001\| 2.5\| 2.5\| +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ ``` Closes #29938 from rwpenney/fix/inverse-hyperbolics. Authored-by: Richard Penney <rwp@rwpenney.uk> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-14 08:48:55 -05:00
Max Gekk	05a62dcada	[SPARK-33134][SQL] Return partial results only for root JSON objects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default PERMISSIVE mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ \|event\| +-----+ \| null\| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes #30031 from MaxGekk/json-skip-row-wrong-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-14 12:13:54 +09:00

1 2 3 4 5 ...

7242 commits