ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com	bb6d6b5602	[SPARK-33964][SQL] Combine distinct unions in more cases ### What changes were proposed in this pull request? Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent. ### Why are the changes needed? In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them. The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs and the output of `PlanStabilitySuite` Closes #30996 from tanelk/SPARK-33964_combine_unions. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 11:01:31 +09:00
Max Gekk	fc3f22645e	[SPARK-33990][SQL][TESTS] Remove partition data by v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests. ### Why are the changes needed? This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests suites for v1 and v2 catalogs: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31014 from MaxGekk/fix-drop-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:26:39 -08:00
Terry Kim	ddc0d5148a	[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables ### What changes were proposed in this pull request? This PR proposes to implement `DESCRIBE COLUMN` for v2 tables. Note that `isExnteded` option is not implemented in this PR. ### Why are the changes needed? Parity with v1 tables. ### Does this PR introduce _any_ user-facing change? Yes, now, `DESCRIBE COLUMN` works for v2 tables. ```scala sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo") sql("DESCRIBE testcat.tbl data").show ``` ``` +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| data\| \|data_type\| string\| \| comment\| hello\| +---------+----------+ ``` Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.` ### How was this patch tested? Added new test. Closes #30881 from imback82/describe_col_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 16:14:33 +00:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Yuming Wang	2a68ed71e4	[SPARK-33954][SQL] Some operator missing rowCount when enable CBO ### What changes were proposed in this pull request? This pr fix some operator missing rowCount when enable CBO, e.g.: ```scala spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("set spark.sql.cbo.planStats.enabled=true") spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` After this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30987 from wangyum/SPARK-33954. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:53:14 +00:00
gengjiaan	b037930952	[SPARK-33951][SQL] Distinguish the error between filter and distinct ### What changes were proposed in this pull request? The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use. ### Why are the changes needed? increase readability and ease of use. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test Closes #30982 from beliefer/SPARK-33951. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:44:00 +00:00
Liang-Chi Hsieh	963c60fe49	[SPARK-33955][SS] Add latest offsets to source progress ### What changes were proposed in this pull request? This patch proposes to add latest offset to source progress for streaming queries. ### Why are the changes needed? Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress. ### Does this PR introduce _any_ user-facing change? Yes, for new metric about latest source offset in source progress. ### How was this patch tested? Unit test. Manually test in Spark cluster: ``` "description" : "KafkaV2[Subscribe[page_view_events]]", "startOffset" : { "page_view_events" : { "2" : 582370921, "4" : 391910836, "1" : 631009201, "3" : 406601346, "0" : 195799112 } }, "endOffset" : { "page_view_events" : { "2" : 583764414, "4" : 392338002, "1" : 632183480, "3" : 407101489, "0" : 197304028 } }, "latestOffset" : { "page_view_events" : { "2" : 589852545, "4" : 394204277, "1" : 637313869, "3" : 409286602, "0" : 203878962 } }, "numInputRows" : 4999997, "inputRowsPerSecond" : 29287.70501405811, ``` Closes #30988 from viirya/latest-offset. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-03 01:31:38 -08:00
Max Gekk	fc7d0165d2	[SPARK-33963][SQL] Canonicalize `HiveTableRelation` w/o table stats ### What changes were proposed in this pull request? Skip table stats in canonicalizing of `HiveTableRelation`. ### Why are the changes needed? The changes fix a regression comparing to Spark 3.0, see SPARK-33963. ### Does this PR introduce _any_ user-facing change? Yes. After changes Spark behaves as in the version 3.0.1. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30995 from MaxGekk/fix-caching-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 11:23:46 +09:00
Yuming Wang	6c5ba8169a	[SPARK-33959][SQL] Improve the statistics estimation of the Tail ### What changes were proposed in this pull request? This pr improve the statistics estimation of the `Tail`: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1") println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats) ``` Before this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=3.8 KiB) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` After this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` ### Why are the changes needed? Import statistics estimation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30991 from wangyum/SPARK-33959. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 10:59:12 +09:00
Yuming Wang	4cd680581a	[SPARK-33956][SQL] Add rowCount for Range operator ### What changes were proposed in this pull request? This pr add rowCount for `Range` operator: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.sql("select id from range(100)").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B) ``` After this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30989 from wangyum/SPARK-33956. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-02 08:58:48 -08:00
Liang-Chi Hsieh	f38265ddda	[SPARK-33907][SQL] Only prune columns of from_json if parsing options is empty ### What changes were proposed in this pull request? As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing. This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed. ### Why are the changes needed? It is to avoid unexpected behavior change regarding parsing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30970 from viirya/SPARK-33907-3.2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-12-30 09:57:15 -08:00
gengjiaan	ba974ea8e4	[SPARK-30789][SQL] Support (IGNORE \| RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE ### What changes were proposed in this pull request? All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS \| RESPECT NULLS. For example: ``` LEAD (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` LAG (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` NTH_VALUE (expr, offset) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] [ ORDER BY window_ordering frame_clause ] ) ``` The mainstream database or engine supports this syntax contains: Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html Presto https://prestodb.io/docs/current/functions/window.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html https://docs.snowflake.com/en/sql-reference/functions/nth_value.html https://docs.snowflake.com/en/sql-reference/functions/first_value.html https://docs.snowflake.com/en/sql-reference/functions/last_value.html Exasol https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm ### Why are the changes needed? Support `(IGNORE \| RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Jenkins test Closes #30943 from beliefer/SPARK-30789. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 13:14:31 +00:00
Max Gekk	2afd1fb492	[SPARK-33904][SQL] Recognize `spark_catalog` in `saveAsTable()` and `insertInto()` ### What changes were proposed in this pull request? In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names. ### Why are the changes needed? 1. To simplify writing of unified v1 and v2 tests 2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`: ```scala scala> sql("CREATE NAMESPACE spark_catalog.ns") scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629) ... 47 elided scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498) ... 47 elided ``` but `INSERT INTO` succeed: ```sql spark-sql> create table spark_catalog.ns.tbl (c int); spark-sql> insert into spark_catalog.ns.tbl select 0; spark-sql> select * from spark_catalog.ns.tbl; 0 ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```scala scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl") scala> spark.table("spark_catalog.ns.tbl").show(false) +-----+ \|value\| +-----+ \|0 \| \|1 \| +-----+ ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .FileFormatWriterSuite" ``` Closes #30919 from MaxGekk/insert-into-spark_catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 07:56:34 +00:00
gengjiaan	687f465244	[SPARK-33890][SQL] Improve the implement of trim/trimleft/trimright ### What changes were proposed in this pull request? The current implement of trim/trimleft/trimright have somewhat redundant. ### Why are the changes needed? Improve the implement of trim/trimleft/trimright ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #30905 from beliefer/SPARK-33890. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 06:06:17 +00:00
Max Gekk	2b6836cdc2	[SPARK-33936][SQL] Add the version when connector's methods and interfaces were updated ### What changes were proposed in this pull request? Add the `since` tag to methods and interfaces added recently. ### Why are the changes needed? 1. To follow the existing convention for Spark API. 2. To inform devs when Spark API was changed. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? `dev/scalastyle` Closes #30966 from MaxGekk/spark-23889-interfaces-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-29 12:26:25 -08:00
Yuming Wang	c42502493a	[SPARK-33847][SQL][FOLLOWUP] Remove the CaseWhen should consider deterministic ### What changes were proposed in this pull request? This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30960 from wangyum/SPARK-33847-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:35:01 +00:00
Max Gekk	16c594de79	[SPARK-33859][SQL][FOLLOWUP] Add version to `SupportsPartitionManagement.renamePartition()` ### What changes were proposed in this pull request? Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface. ### Why are the changes needed? To inform Spark devs when the method appears in the interface. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #30964 from MaxGekk/alter-table-rename-partition-v2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:30:37 +00:00
Yuming Wang	872107f67f	[SPARK-33848][SQL][FOLLOWUP] Introduce allowList for push into (if / case) branches ### What changes were proposed in this pull request? Introduce allowList push into (if / case) branches to fix potential bug. ### Why are the changes needed? Fix potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30955 from wangyum/SPARK-33848-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:34:43 +00:00
ulysses-you	3b1b209e90	[SPARK-33909][SQL] Check rand functions seed is legal at analyer side ### What changes were proposed in this pull request? Move seed is legal check to `CheckAnalysis`. ### Why are the changes needed? It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30923 from ulysses-you/SPARK-33909. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:33:06 +00:00
Max Gekk	e0d2ffec31	[SPARK-33859][SQL] Support V2 ALTER TABLE .. RENAME PARTITION ### What changes were proposed in this pull request? 1. Add `renamePartition()` to the `SupportsPartitionManagement` 2. Implement `renamePartition()` in `InMemoryPartitionTable` 3. Add v2 execution node `AlterTableRenamePartitionExec` 4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement` 5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running the unified tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #30935 from MaxGekk/alter-table-rename-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:29:48 +00:00
Liang-Chi Hsieh	f9fe742442	[SPARK-32968][SQL] Prune unnecessary columns from CsvToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it. ### Why are the changes needed? `CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30912 from viirya/SPARK-32968. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 21:37:17 +09:00
Yuming Wang	f7bdea334a	[SPARK-33884][SQL] Simplify CaseWhenclauses with (true and false) and (false and true) ### What changes were proposed in this pull request? This pr simplify `CaseWhen`clauses with (true and false) and (false and true): Expression \| cond.nullable \| After simplify -- \| -- \| -- case when cond then true else false end \| true \| cond <=> true case when cond then true else false end \| false \| cond case when cond then false else true end \| true \| !(cond <=> true) case when cond then false else true end \| false \| !cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30898 from wangyum/SPARK-33884. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 07:09:11 +00:00
Max Gekk	379afcd2ce	[SPARK-33924][SQL][TESTS] Preserve partition metadata by INSERT INTO in v2 table catalog ### What changes were proposed in this pull request? For `InMemoryPartitionTable` used in tests, set empty partition metadata only when a partition doesn't exists. ### Why are the changes needed? This bug fix is needed to use `INSERT INTO .. PARTITION` in other tests. ### Does this PR introduce _any_ user-facing change? No. It affects only the v2 table catalog used in tests. ### How was this patch tested? Added new UT to `DataSourceV2SQLSuite`, and run the affected test suite by: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.connector.DataSourceV2SQLSuite" ``` Closes #30952 from MaxGekk/fix-insert-into-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 06:49:26 +00:00
Wenchen Fan	c2eac1de02	[SPARK-33845][SQL][FOLLOWUP] fix SimplifyConditionals ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30849, to fix a correctness issue caused by null value handling. ### Why are the changes needed? Fix a correctness issue. `If(null, true, false)` should return false, not true. ### Does this PR introduce _any_ user-facing change? Yes, but the bug only exist in the master branch. ### How was this patch tested? updated tests. Closes #30953 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 16:44:57 -08:00
Kent Yao	3fdbc48373	[SPARK-33901][SQL] Fix Char and Varchar display error after DDLs ### What changes were proposed in this pull request? After CTAS / CREATE TABLE LIKE / CVAS/ alter table add columns, the target tables will display string instead of char/varchar ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30918 from yaooqinn/SPARK-33901. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 06:48:27 +00:00
yangjie01	1be9e7e40b	[SPAKR-33801][CORE][SQL] Fix compilation warnings about 'Unicode escapes in triple quoted strings are deprecated' ### What changes were proposed in this pull request? There are total 15 compilation warnings about `Unicode escapes in triple quoted strings are deprecated` in Spark code now: ``` [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2930: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2931: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2932: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2933: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2934: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2935: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2936: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2937: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala:82: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:32: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:79: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:97: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:101: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:76: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:83: Unicode escapes in triple quoted strings are deprecated, use the literal character instead ``` This pr try to fix these warnnings. ### Why are the changes needed? Cleanup compilation warnings about `Unicode escapes in triple quoted strings are deprecated` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30926 from LuciferYang/SPARK-33801. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 15:29:09 +09:00
Terry Kim	fe33262c91	[SPARK-33918][SQL] UnresolvedView should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("DROP VIEW unknown") org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `10`. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` DROP VIEW v ALTER VIEW v SET TBLPROPERTIES ('k'='v') ALTER VIEW v UNSET TBLPROPERTIES ('k') ALTER VIEW v AS SELECT 1 ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 10; ``` ### How was this patch tested? Add a new suite of tests. Closes #30936 from imback82/position_view_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 05:45:40 +00:00
kozakana	2553d53dc8	[SPARK-33897][SQL] Can't set option 'cross' in join method ### What changes were proposed in this pull request? [The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti." However, I get the following error when I set the cross option. ``` scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show() java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106) at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) ... 53 elided ``` ### Why are the changes needed? The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException. ### Does this PR introduce _any_ user-facing change? Accepting this PR fix will behave the same as the documentation. ### How was this patch tested? There is already a test for [JoinTypes](`1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala`), but I can't find a test for the join option itself. Closes #30803 from kozakana/allow_cross_option. Authored-by: kozakana <goki727@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-26 16:30:50 +09:00
Takeshi Yamamuro	65a9ac2ff4	[SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec ### What changes were proposed in this pull request? This pr intends to support code generation for `HashAggregateExec` with filters. Quick benchmark results: ``` $ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t") scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save() >> Before this PR Elapsed time: 16.170697619s >> After this PR Elapsed time: 6.7825313s ``` The query above is compiled into code below; ``` ... /* 285 / private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException { / 286 / // evaluate aggregate function for avg / 287 / boolean agg_isNull_10 = true; / 288 / double agg_value_12 = -1.0; / 289 / boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0); / 290 / double agg_value_13 = agg_isNull_11 ? / 291 / -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0)); / 292 / if (!agg_isNull_11) { / 293 / agg_agg_isNull_12_0 = true; / 294 / double agg_value_14 = -1.0; / 295 / do { / 296 / if (!agg_exprIsNull_2_0) { / 297 / agg_agg_isNull_12_0 = false; / 298 / agg_value_14 = agg_expr_2_0; / 299 / continue; / 300 / } / 301 / / 302 / if (!false) { / 303 / agg_agg_isNull_12_0 = false; / 304 / agg_value_14 = 0.0D; / 305 / continue; / 306 / } / 307 / / 308 / } while (false); / 309 / / 310 / agg_isNull_10 = false; // resultCode could change nullability. / 311 / / 312 / agg_value_12 = agg_value_13 + agg_value_14; / 313 / / 314 / } / 315 / boolean agg_isNull_15 = false; / 316 / long agg_value_17 = -1L; / 317 / if (!false && agg_exprIsNull_2_0) { / 318 / boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 319 / long agg_value_20 = agg_isNull_18 ? / 320 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 321 / agg_isNull_15 = agg_isNull_18; / 322 / agg_value_17 = agg_value_20; / 323 / } else { / 324 / boolean agg_isNull_19 = true; / 325 / long agg_value_21 = -1L; / 326 / boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 327 / long agg_value_22 = agg_isNull_20 ? / 328 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 329 / if (!agg_isNull_20) { / 330 / agg_isNull_19 = false; // resultCode could change nullability. / 331 / / 332 / agg_value_21 = agg_value_22 + 1L; / 333 / / 334 / } / 335 / agg_isNull_15 = agg_isNull_19; / 336 / agg_value_17 = agg_value_21; / 337 / } / 338 / // update unsafe row buffer / 339 / if (!agg_isNull_10) { / 340 / agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12); / 341 / } else { / 342 / agg_unsafeRowAggBuffer_0.setNullAt(0); / 343 / } / 344 / / 345 / if (!agg_isNull_15) { / 346 / agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17); / 347 / } else { / 348 / agg_unsafeRowAggBuffer_0.setNullAt(1); / 349 / } / 350 */ } ... ``` ### Why are the changes needed? For high performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27019 from maropu/AggregateFilterCodegen. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:44:16 -08:00
ulysses-you	9c30116fb4	[SPARK-33857][SQL] Unify the default seed of random functions ### What changes were proposed in this pull request? Unify the seed of random functions 1. Add a hold place expression `UnresolvedSeed ` as the defualt seed. 2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `. 3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule. ### Why are the changes needed? `Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing. It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test and add test. Closes #30864 from ulysses-you/SPARK-33857. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:30:34 -08:00
gengjiaan	3e9821edfd	[SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS \| RESPECT NULLS ] ### What changes were proposed in this pull request? The mainstream database support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`. But the current implement of `LEAD`/`LAG` don't support this syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77 Presto https://prestodb.io/docs/current/functions/window.html Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html ### Why are the changes needed? Support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #30387 from beliefer/SPARK-33443. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:13:48 +00:00
Yuming Wang	32d4a2b062	[SPARK-33861][SQL] Simplify conditional in predicate ### What changes were proposed in this pull request? This pr simplify conditional in predicate, after this change we can push down the filter to datasource: Expression \| After simplify -- \| -- IF(cond, trueVal, false) \| AND(cond, trueVal) IF(cond, trueVal, true) \| OR(NOT(cond), trueVal) IF(cond, false, falseVal) \| AND(NOT(cond), elseVal) IF(cond, true, falseVal) \| OR(cond, elseVal) CASE WHEN cond THEN trueVal ELSE false END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE null END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE true END \| OR(NOT(cond), trueVal) CASE WHEN cond THEN false ELSE elseVal END \| AND(NOT(cond), elseVal) CASE WHEN cond THEN false END \| false CASE WHEN cond THEN true ELSE elseVal END \| OR(cond, elseVal) CASE WHEN cond THEN true END \| cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30865 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:10:28 +00:00
Terry Kim	f1d3797291	[SPARK-33886][SQL] UnresolvedTable should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("MSCK REPAIR TABLE unknown") org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be 18. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` MSCK REPAIR TABLE t LOAD DATA LOCAL INPATH 'filepath' INTO TABLE t TRUNCATE TABLE t SHOW PARTITIONS t ALTER TABLE t RECOVER PARTITIONS ALTER TABLE t ADD PARTITION (p=1) ALTER TABLE t PARTITION (p=1) RENAME TO PARTITION (p=2) ALTER TABLE t DROP PARTITION (p=1) ALTER TABLE t SET SERDEPROPERTIES ('a'='b') COMMENT ON TABLE t IS 'hello'" ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 18; ``` ### How was this patch tested? Add a new suite of tests. Closes #30900 from imback82/position_Fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 05:21:39 +00:00
Yuming Wang	7ffcfcf7db	[SPARK-33847][SQL] Simplify CaseWhen if elseValue is None ### What changes were proposed in this pull request? 1. Enhance `ReplaceNullWithFalseInPredicate` to replace None of elseValue inside `CaseWhen` with `FalseLiteral` if all branches are `FalseLiteral` . The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN 'a' WHEN id = 3 THEN 'b' end) = 'c'; ``` Before this pr: ``` == Physical Plan == (1) Filter CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` 2. Enhance `SimplifyConditionals` if elseValue is None and all outputs are null. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30852 from wangyum/SPARK-33847. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:35:46 +00:00
Max Gekk	cc23581e26	[SPARK-33858][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. RENAME PARTITION` parsing tests to `AlterTableRenamePartitionParserSuite` 2. Place the v1 tests for `ALTER TABLE .. RENAME PARTITION` from `DDLSuite` to `v1.AlterTableRenamePartitionSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to `v2.AlterTableRenamePartitionSuite`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. RENAME PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionSuite" ``` Closes #30863 from MaxGekk/unify-rename-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 12:19:07 +00:00
ulysses-you	f421c172d9	[SPARK-33497][SQL] Override maxRows in some LogicalPlan ### What changes were proposed in this pull request? This PR aims to override maxRows method in these follow `LogicalPlan`: * `ReturnAnswer` * `Join` * `Range` * `Sample` * `RepartitionOperation` * `Deduplicate` * `LocalRelation` * `Window` ### Why are the changes needed? 1. Logically, we know the max rows info with these `LogicalPlan`. 2. Before this PR, we already have some max rows with `LogicalPlan`, so we can eliminate limit with more case if we expand more. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30443 from ulysses-you/SPARK-33497. Lead-authored-by: ulysses-you <youxiduo@weidian.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:20:49 +00:00
Max Gekk	34bfb3a31d	[SPARK-33787][SQL] Allow partition purge for v2 tables ### What changes were proposed in this pull request? 1. Add new methods `purgePartition()`/`purgePartitions()` to the interfaces `SupportsPartitionManagement`/`SupportsAtomicPartitionManagement`. 2. Default implementation of new methods throw the exception `UnsupportedOperationException`. 3. Add tests for new methods to `SupportsPartitionManagementSuite`/`SupportsAtomicPartitionManagementSuite`. 4. Add `ALTER TABLE .. DROP PARTITION` tests for DS v1 and v2. Closes #30776 Closes #30821 ### Why are the changes needed? Currently, the `PURGE` option that user can set in `ALTER TABLE .. DROP PARTITION` is completely ignored. We should pass this flag to the catalog implementation, so, the catalog should decide how to handle the flag. ### Does this PR introduce _any_ user-facing change? The changes can impact on behavior of `ALTER TABLE .. DROP PARTITION` for v2 tables. ### How was this patch tested? By running the affected test suites, for instance: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30886 from MaxGekk/purge-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:09:48 +00:00
Kent Yao	2287f56a3e	[SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns ### What changes were proposed in this pull request? ```sql spark-sql> select * from t10 where c0='abcd'; 20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd'] scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:102) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156) ``` c0 is a partition column, it fails in the partition pruning rule In this PR, we relace char/varchar w/ string type before the CAST happends ### Why are the changes needed? bugfix, see the case above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, new tests Closes #30887 from yaooqinn/SPARK-33879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 16:14:27 +09:00
ulysses-you	e853f068f6	[SPARK-33526][SQL][FOLLOWUP] Fix flaky test due to timeout and fix docs ### What changes were proposed in this pull request? Make test stable and fix docs. ### Why are the changes needed? Query timeout sometime since we set an another config after set query timeout. ``` sbt.ForkMain$ForkError: java.sql.SQLTimeoutException: Query timed out after 0 seconds at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13(ThriftServerWithSparkContextSuite.scala:107) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13$adapted(ThriftServerWithSparkContextSuite.scala:106) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12(ThriftServerWithSparkContextSuite.scala:106) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12$adapted(ThriftServerWithSparkContextSuite.scala:89) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4(SharedThriftServer.scala:95) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4$adapted(SharedThriftServer.scala:95) ``` The reason is: 1. we execute `set spark.sql.thriftServer.queryTimeout = 1`, then all the option will be limited in 1s. 2. we execute `set spark.sql.thriftServer.interruptOnCancel = false/true`. This sql will get timeout exception if there is something hung within 1s. It's not our expected. Reset the timeout before we do the step2 can avoid this problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fix test. Closes #30897 from ulysses-you/SPARK-33526-followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 22:43:03 -08:00
Wenchen Fan	ec1560af25	[SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30267 Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`. ### Why are the changes needed? 1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset. 2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing tests Closes #30890 from cloud-fan/purgeTable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 11:47:13 +09:00
Erik Krogen	303b8c8773	[SPARK-23862][SQL] Support Java enums from Scala Dataset API ### What changes were proposed in this pull request? Add support for Java Enums (`java.lang.Enum`) from the Scala typed Dataset APIs. This involves adding an implicit for `Encoder` creation in `SQLImplicits`, and updating `ScalaReflection` to handle Java Enums on the serialization and deserialization pathways. Enums are mapped to a `StringType` which is just the name of the Enum value. ### Why are the changes needed? In [SPARK-21255](https://issues.apache.org/jira/browse/SPARK-21255), support for (de)serialization of Java Enums was added, but only when called from Java code. It is common for Scala code to rely on Java libraries that are out of control of the Scala developer. Today, if there is a dependency on some Java code which defines an Enum, it would be necessary to define a corresponding Scala class. This change brings closer feature parity between Scala and Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, previously something like: ``` val ds = Seq(MyJavaEnum.VALUE1, MyJavaEnum.VALUE2).toDS // or val ds = Seq(CaseClass(MyJavaEnum.VALUE1), CaseClass(MyJavaEnum.VALUE2)).toDS ``` would fail. Now, it will succeed. ### How was this patch tested? Additional unit tests are added in `DatasetSuite`. Tests include validating top-level enums, enums inside of case classes, enums inside of arrays, and validating that the Enum is stored as the expected string. Closes #30877 from xkrogen/xkrogen-SPARK-23862-scalareflection-java-enums. Lead-authored-by: Erik Krogen <xkrogen@apache.org> Co-authored-by: Fangshi Li <fli@linkedin.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 09:55:33 -08:00
Kent Yao	6da5cdf1db	[SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location ### What changes were proposed in this pull request? This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ. ### Why are the changes needed? ```sql spark-sql> INSERT INTO t2 VALUES ('1', 'b12345'); Time taken: 0.141 seconds spark-sql> alter table t set location '/tmp/hive_one/t2'; Time taken: 0.095 seconds spark-sql> select * from t; 1 b1234 ``` the above case should fail rather than implicitly applying truncation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30882 from yaooqinn/SPARK-33876. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 14:24:12 +00:00
Jacob Kim	43a562035c	[SPARK-33846][SQL] Include Comments for a nested schema in StructType.toDDL ### What changes were proposed in this pull request? ```scala val nestedStruct = new StructType() .add(StructField("b", StringType).withComment("Nested comment")) val struct = new StructType() .add(StructField("a", nestedStruct).withComment("comment")) struct.toDDL ``` Currently, returns: ``` `a` STRUCT<`b`: STRING> COMMENT 'comment'` ``` With this PR, the code above returns: ``` `a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'` ``` ### Why are the changes needed? My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns. ### Does this PR introduce _any_ user-facing change? Now, when users call something like this, ```scala spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ") ``` they will get comments for the nested columns. ### How was this patch tested? I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string. Closes #30851 from jacobhjkim/structtype-toddl. Authored-by: Jacob Kim <me@jacobkim.io> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 17:55:16 +09:00
Anton Okolnychyi	7bbcbb84c2	[SPARK-33784][SQL] Rename dataSourceRewriteRules batch ### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:29:22 +00:00
Anton Okolnychyi	2562183987	[SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer ### What changes were proposed in this pull request? This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30806 from aokolnychyi/spark-33808. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:23:56 +00:00
ulysses-you	1dd63dccd8	[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 15:10:46 +09:00
Kent Yao	f5fd10b1bc	[SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 03:07:26 +00:00
angerszhu	7466031632	[SPARK-32106][SQL] Implement script transform in sql/core ### What changes were proposed in this pull request? * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec` * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode) * Add `SparkScriptTransformationSuite` test spark spec case * add test in `SQLQueryTestSuite` And we will close #29085 . ### Why are the changes needed? Support user use Script Transform without Hive ### Does this PR introduce _any_ user-facing change? User can use Script Transformation without hive in no serde mode. Such as : default no serde ``` SELECT TRANSFORM(a, b, c) USING 'cat' AS (a int, b string, c long) FROM testData ``` no serde with spec ROW FORMAT DELIMITED ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0002' MAP KEYS TERMINATED BY '\u0003' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0004' MAP KEYS TERMINATED BY '\u0005' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` ### How was this patch tested? Added UT Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-22 11:37:59 +09:00
Yuming Wang	1c77605682	[SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches ### What changes were proposed in this pull request? This pr push the `UnaryExpression` into (if / case) branches. The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3; ``` Before this pr: ``` == Physical Plan == (1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` This change can also improve this case: `a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30853 from wangyum/SPARK-33848. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 10:25:23 -08:00

1 2 3 4 5 ...

4972 commits