ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Terry Kim	15a863fd54	[SPARK-34001][SQL][TESTS] Remove unused runShowTablesSql() in DataSourceV2SQLSuite.scala ### What changes were proposed in this pull request? After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method. ### Why are the changes needed? To remove unused method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #31022 from imback82/33382-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:32:49 -08:00
Terry Kim	6b00fdc756	[SPARK-33998][SQL] Provide an API to create an InternalRow in V2CommandExec ### What changes were proposed in this pull request? There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code. ### Why are the changes needed? To clean up duplicate code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test since this is just refactoring. Closes #31020 from imback82/refactor_v2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:32:36 +00:00
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com	bb6d6b5602	[SPARK-33964][SQL] Combine distinct unions in more cases ### What changes were proposed in this pull request? Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent. ### Why are the changes needed? In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them. The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs and the output of `PlanStabilitySuite` Closes #30996 from tanelk/SPARK-33964_combine_unions. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 11:01:31 +09:00
Max Gekk	84c1f43669	[SPARK-33987][SQL] Refresh cache in v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? 1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command. 2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog. ### Why are the changes needed? The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987. ### Does this PR introduce _any_ user-facing change? Yes, it could if users have v2 table catalogs. ### How was this patch tested? By running unified tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31017 from MaxGekk/drop-partition-refresh-cache-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:00:48 -08:00
Kent Yao	ac4651a7d1	[SPARK-33980][SS] Invalidate char/varchar in spark.readStream.schema ### What changes were proposed in this pull request? invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in `da72b87374` ### Why are the changes needed? bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false` ### Does this PR introduce _any_ user-facing change? yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false` ### How was this patch tested? new tests Closes #31003 from yaooqinn/SPARK-33980. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 12:59:45 -08:00
Takeshi Yamamuro	414d323d6c	[SPARK-33988][SQL][TEST] Add an option to enable CBO in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled. ### Why are the changes needed? To monitor performance chaneges with CBO enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #31011 from maropu/AddOptionForCBOInTPCDSBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:31:20 -08:00
Max Gekk	fc3f22645e	[SPARK-33990][SQL][TESTS] Remove partition data by v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests. ### Why are the changes needed? This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests suites for v1 and v2 catalogs: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31014 from MaxGekk/fix-drop-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:26:39 -08:00
Terry Kim	ddc0d5148a	[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables ### What changes were proposed in this pull request? This PR proposes to implement `DESCRIBE COLUMN` for v2 tables. Note that `isExnteded` option is not implemented in this PR. ### Why are the changes needed? Parity with v1 tables. ### Does this PR introduce _any_ user-facing change? Yes, now, `DESCRIBE COLUMN` works for v2 tables. ```scala sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo") sql("DESCRIBE testcat.tbl data").show ``` ``` +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| data\| \|data_type\| string\| \| comment\| hello\| +---------+----------+ ``` Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.` ### How was this patch tested? Added new test. Closes #30881 from imback82/describe_col_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 16:14:33 +00:00
angerszhu	8583a4605f	[SPARK-33844][SQL] InsertIntoHiveDir command should check col name too ### What changes were proposed in this pull request? In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma. When we use spark 2.4 with UT ``` test("insert overwrite directory with comma col name") { withTempDir { dir => val path = dir.toURI.getPath val v1 = s""" \| INSERT OVERWRITE DIRECTORY '${path}' \| STORED AS TEXTFILE \| SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false") """.stripMargin sql(v1).explain(true) sql(v1).show() } } ``` failed with as below since column name contains `,` then column names and column types size not equal. ``` 19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ','： `6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)` `6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)` And in script transform, we parse column name to avoid this problem `554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)` So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well. ### Why are the changes needed? More save use serde ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #30850 from AngersZhuuuu/SPARK-33844. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 09:43:15 +00:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Max Gekk	8b3fb43f40	[SPARK-33965][SQL][TESTS] Recognize `spark_catalog` by `CACHE TABLE` in Hive table names ### What changes were proposed in this pull request? Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below: ```scala case UnresolvedRelation(ident, _, _) => if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) { ``` added by https://github.com/apache/spark/pull/30883. ### Why are the changes needed? 1. To have feature parity with v1 In-Memory catalog. 2. To be able to write unified tests for In-Memory and Hive external catalogs. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite with new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30997 from MaxGekk/cache-table-spark_catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 08:28:26 +00:00
Hoa	0b647fe69c	[SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis ### What changes were proposed in this pull request? JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now. ### Why are the changes needed? Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Close #30902 Closes #30902 from saikocat/SPARK-33888. Lead-authored-by: Hoa <hoameomu@gmail.com> Co-authored-by: Hoa <saikocatz@gmail.com> Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn> Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 06:53:12 +00:00
angerszhu	adac633f93	[SPARK-33934][SQL] Add SparkFile's root dir to env property PATH ### What changes were proposed in this pull request? In hive we always use ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`. This pr add a SparkFile's root dir path to execution env property `PATH`, then sub-processor will find `scrip.py` as program under `PATH`. ### Why are the changes needed? Support SQL migration form Hive to Spark. ### Does this PR introduce _any_ user-facing change? User can direct use script file name as program in script transform SQL. ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` ### How was this patch tested? UT Closes #30973 from AngersZhuuuu/SPARK-33934. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-04 15:46:49 +09:00
Yuming Wang	2a68ed71e4	[SPARK-33954][SQL] Some operator missing rowCount when enable CBO ### What changes were proposed in this pull request? This pr fix some operator missing rowCount when enable CBO, e.g.: ```scala spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("set spark.sql.cbo.planStats.enabled=true") spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` After this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30987 from wangyum/SPARK-33954. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:53:14 +00:00
gengjiaan	b037930952	[SPARK-33951][SQL] Distinguish the error between filter and distinct ### What changes were proposed in this pull request? The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use. ### Why are the changes needed? increase readability and ease of use. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test Closes #30982 from beliefer/SPARK-33951. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:44:00 +00:00
Max Gekk	67195d0d97	[SPARK-33950][SQL] Refresh cache in v1 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 0` since it was deleted by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30983 from MaxGekk/drop-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 04:11:39 +00:00
Liang-Chi Hsieh	963c60fe49	[SPARK-33955][SS] Add latest offsets to source progress ### What changes were proposed in this pull request? This patch proposes to add latest offset to source progress for streaming queries. ### Why are the changes needed? Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress. ### Does this PR introduce _any_ user-facing change? Yes, for new metric about latest source offset in source progress. ### How was this patch tested? Unit test. Manually test in Spark cluster: ``` "description" : "KafkaV2[Subscribe[page_view_events]]", "startOffset" : { "page_view_events" : { "2" : 582370921, "4" : 391910836, "1" : 631009201, "3" : 406601346, "0" : 195799112 } }, "endOffset" : { "page_view_events" : { "2" : 583764414, "4" : 392338002, "1" : 632183480, "3" : 407101489, "0" : 197304028 } }, "latestOffset" : { "page_view_events" : { "2" : 589852545, "4" : 394204277, "1" : 637313869, "3" : 409286602, "0" : 203878962 } }, "numInputRows" : 4999997, "inputRowsPerSecond" : 29287.70501405811, ``` Closes #30988 from viirya/latest-offset. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-03 01:31:38 -08:00
Max Gekk	fc7d0165d2	[SPARK-33963][SQL] Canonicalize `HiveTableRelation` w/o table stats ### What changes were proposed in this pull request? Skip table stats in canonicalizing of `HiveTableRelation`. ### Why are the changes needed? The changes fix a regression comparing to Spark 3.0, see SPARK-33963. ### Does this PR introduce _any_ user-facing change? Yes. After changes Spark behaves as in the version 3.0.1. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30995 from MaxGekk/fix-caching-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 11:23:46 +09:00
Yuming Wang	6c5ba8169a	[SPARK-33959][SQL] Improve the statistics estimation of the Tail ### What changes were proposed in this pull request? This pr improve the statistics estimation of the `Tail`: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1") println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats) ``` Before this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=3.8 KiB) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` After this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` ### Why are the changes needed? Import statistics estimation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30991 from wangyum/SPARK-33959. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 10:59:12 +09:00
Yuming Wang	4cd680581a	[SPARK-33956][SQL] Add rowCount for Range operator ### What changes were proposed in this pull request? This pr add rowCount for `Range` operator: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.sql("select id from range(100)").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B) ``` After this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30989 from wangyum/SPARK-33956. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-02 08:58:48 -08:00
Kent Yao	ed9f728801	[SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options ### What changes were proposed in this pull request? While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled. ### Why are the changes needed? bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30978 from yaooqinn/SPARK-33944. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:20:31 -08:00
angerszhu	771c538620	[SPARK-33084][SQL][TESTS][FOLLOW-UP] Fix Scala 2.13 UT failure ### What changes were proposed in this pull request? Fix UT according to https://github.com/apache/spark/pull/29966#issuecomment-752830046 Change StructType construct from ``` def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil) ``` to ``` def inputSchema: StructType = new StructType().add("inputColumn", LongType) ``` The whole udf class is : ``` package org.apache.spark.examples.sql import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row class Spark33084 extends UserDefinedAggregateFunction { // Data types of input arguments of this aggregate function def inputSchema: StructType = new StructType().add("inputColumn", LongType) // Data types of values in the aggregation buffer def bufferSchema: StructType = new StructType().add("sum", LongType).add("count", LongType) // The data type of the returned value def dataType: DataType = DoubleType // Whether this function always returns the same output on the identical input def deterministic: Boolean = true // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides // the opportunity to update its values. Note that arrays and maps inside the buffer are still // immutable. def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0L buffer(1) = 0L } // Updates the given aggregation buffer `buffer` with new input data from `input` def update(buffer: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buffer(0) = buffer.getLong(0) + input.getLong(0) buffer(1) = buffer.getLong(1) + 1 } } // Merges two aggregation buffers and stores the updated buffer values back to `buffer1` def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) } // Calculates the final result def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1) } ``` ### Why are the changes needed? Fix UT for scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30980 from AngersZhuuuu/spark-33084-followup. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:18:31 -08:00
Liang-Chi Hsieh	f38265ddda	[SPARK-33907][SQL] Only prune columns of from_json if parsing options is empty ### What changes were proposed in this pull request? As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing. This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed. ### Why are the changes needed? It is to avoid unexpected behavior change regarding parsing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30970 from viirya/SPARK-33907-3.2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-12-30 09:57:15 -08:00
gengjiaan	ba974ea8e4	[SPARK-30789][SQL] Support (IGNORE \| RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE ### What changes were proposed in this pull request? All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS \| RESPECT NULLS. For example: ``` LEAD (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` LAG (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` NTH_VALUE (expr, offset) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] [ ORDER BY window_ordering frame_clause ] ) ``` The mainstream database or engine supports this syntax contains: Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html Presto https://prestodb.io/docs/current/functions/window.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html https://docs.snowflake.com/en/sql-reference/functions/nth_value.html https://docs.snowflake.com/en/sql-reference/functions/first_value.html https://docs.snowflake.com/en/sql-reference/functions/last_value.html Exasol https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm ### Why are the changes needed? Support `(IGNORE \| RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Jenkins test Closes #30943 from beliefer/SPARK-30789. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 13:14:31 +00:00
Max Gekk	2afd1fb492	[SPARK-33904][SQL] Recognize `spark_catalog` in `saveAsTable()` and `insertInto()` ### What changes were proposed in this pull request? In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names. ### Why are the changes needed? 1. To simplify writing of unified v1 and v2 tests 2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`: ```scala scala> sql("CREATE NAMESPACE spark_catalog.ns") scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629) ... 47 elided scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498) ... 47 elided ``` but `INSERT INTO` succeed: ```sql spark-sql> create table spark_catalog.ns.tbl (c int); spark-sql> insert into spark_catalog.ns.tbl select 0; spark-sql> select * from spark_catalog.ns.tbl; 0 ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```scala scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl") scala> spark.table("spark_catalog.ns.tbl").show(false) +-----+ \|value\| +-----+ \|0 \| \|1 \| +-----+ ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .FileFormatWriterSuite" ``` Closes #30919 from MaxGekk/insert-into-spark_catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 07:56:34 +00:00
Max Gekk	0eb4961ca8	[SPARK-33926][SQL] Improve the error message from resolving of v1 database name ### What changes were proposed in this pull request? 1. Replace `SessionCatalogAndNamespace` by `DatabaseInSessionCatalog` in resolving database name from v1 session catalog. 2. Throw more precise errors from `DatabaseInSessionCatalog` 3. Fix expected error messages in `v1.ShowTablesSuiteBase` Closes #30947 ### Why are the changes needed? Current error message "multi-part identifier cannot be empty" may confuse users. And this error message is just a consequence of "incorrectly" applied an implicit class. For example, `SHOW TABLES IN spark_catalog`: 1. Spark cuts off `spark_catalog` from namespaces in `SessionCatalogAndNamespace`, so, `ns == Seq.empty` here: `0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L365)` 2. Then `ns.length != 1` is `true` and Spark tries to raise the exception at `0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L367)` 3. ... but `ns.quoted` triggers implicit wrapping `Seq.empty` by `MultipartIdentifierHelper`, and hit to the second check `if (parts.isEmpty)` at `156704ba0d/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala (L120-L122)` So, Spark throws the exception at third step instead of `new AnalysisException(s"The database name is not valid: $quoted")` on the second step. And even on the second step, the exception doesn't show actual reason as it is pretty generic. ### Does this PR introduce _any_ user-facing change? Yes in the case of v1 DDL commands when a database is not specified or nested databases is set. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DDLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowTablesSuite" ``` Closes #30963 from MaxGekk/database-in-session-catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 07:52:34 +00:00
gengjiaan	687f465244	[SPARK-33890][SQL] Improve the implement of trim/trimleft/trimright ### What changes were proposed in this pull request? The current implement of trim/trimleft/trimright have somewhat redundant. ### Why are the changes needed? Improve the implement of trim/trimleft/trimright ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #30905 from beliefer/SPARK-33890. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 06:06:17 +00:00
angerszhu	49aa6ebef1	[SPARK-32684][SQL][TESTS] Add a test case to check if null value is same as Hive's '\\N' in script transformation ### What changes were proposed in this pull request? In hive script transform serde mode, NULL format default is `\\N` ``` String nullString = tbl.getProperty( serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); nullSequence = new Text(nullString); ``` I make a mistake that in Spark's code we need to fix and keep same with hive too. So add some test case to show this issue. ### Why are the changes needed? add UT ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30946 from AngersZhuuuu/SPARK-32684. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 05:28:01 +00:00
Max Gekk	2b6836cdc2	[SPARK-33936][SQL] Add the version when connector's methods and interfaces were updated ### What changes were proposed in this pull request? Add the `since` tag to methods and interfaces added recently. ### Why are the changes needed? 1. To follow the existing convention for Spark API. 2. To inform devs when Spark API was changed. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? `dev/scalastyle` Closes #30966 from MaxGekk/spark-23889-interfaces-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-29 12:26:25 -08:00
Yuming Wang	c42502493a	[SPARK-33847][SQL][FOLLOWUP] Remove the CaseWhen should consider deterministic ### What changes were proposed in this pull request? This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30960 from wangyum/SPARK-33847-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:35:01 +00:00
Max Gekk	16c594de79	[SPARK-33859][SQL][FOLLOWUP] Add version to `SupportsPartitionManagement.renamePartition()` ### What changes were proposed in this pull request? Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface. ### Why are the changes needed? To inform Spark devs when the method appears in the interface. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #30964 from MaxGekk/alter-table-rename-partition-v2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:30:37 +00:00
angerszhu	aadda4b561	[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde ### What changes were proposed in this pull request? For same SQL ``` SELECT TRANSFORM(a, b, c, null) ROW FORMAT DELIMITED USING 'cat' ROW FORMAT DELIMITED FIELDS TERMINATED BY '&' FROM (select 1 as a, 2 as b, 3 as c) t ``` In hive: ``` hive> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; OK 123\N NULL Time taken: 14.519 seconds, Fetched: 1 row(s) hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe ``` In Spark ``` Spark master: local[*], Application Id: local-1609225830376 spark-sql> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; 1 2 3 null NULL Time taken: 4.297 seconds, Fetched 1 row(s) spark-sql> ``` We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001` In hive default value is '1' to char is '\u0001' ``` bucket_count -1 column.name.delimiter , columns columns.comments columns.types file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat ``` ### Why are the changes needed? Keep same behavior with hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30958 from AngersZhuuuu/SPARK-33930. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 23:26:27 +09:00
Yuming Wang	872107f67f	[SPARK-33848][SQL][FOLLOWUP] Introduce allowList for push into (if / case) branches ### What changes were proposed in this pull request? Introduce allowList push into (if / case) branches to fix potential bug. ### Why are the changes needed? Fix potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30955 from wangyum/SPARK-33848-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:34:43 +00:00
ulysses-you	3b1b209e90	[SPARK-33909][SQL] Check rand functions seed is legal at analyer side ### What changes were proposed in this pull request? Move seed is legal check to `CheckAnalysis`. ### Why are the changes needed? It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30923 from ulysses-you/SPARK-33909. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:33:06 +00:00
Max Gekk	e0d2ffec31	[SPARK-33859][SQL] Support V2 ALTER TABLE .. RENAME PARTITION ### What changes were proposed in this pull request? 1. Add `renamePartition()` to the `SupportsPartitionManagement` 2. Implement `renamePartition()` in `InMemoryPartitionTable` 3. Add v2 execution node `AlterTableRenamePartitionExec` 4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement` 5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running the unified tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #30935 from MaxGekk/alter-table-rename-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:29:48 +00:00
Liang-Chi Hsieh	f9fe742442	[SPARK-32968][SQL] Prune unnecessary columns from CsvToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it. ### Why are the changes needed? `CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30912 from viirya/SPARK-32968. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 21:37:17 +09:00
Yuming Wang	f7bdea334a	[SPARK-33884][SQL] Simplify CaseWhenclauses with (true and false) and (false and true) ### What changes were proposed in this pull request? This pr simplify `CaseWhen`clauses with (true and false) and (false and true): Expression \| cond.nullable \| After simplify -- \| -- \| -- case when cond then true else false end \| true \| cond <=> true case when cond then true else false end \| false \| cond case when cond then false else true end \| true \| !(cond <=> true) case when cond then false else true end \| false \| !cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30898 from wangyum/SPARK-33884. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 07:09:11 +00:00
Max Gekk	379afcd2ce	[SPARK-33924][SQL][TESTS] Preserve partition metadata by INSERT INTO in v2 table catalog ### What changes were proposed in this pull request? For `InMemoryPartitionTable` used in tests, set empty partition metadata only when a partition doesn't exists. ### Why are the changes needed? This bug fix is needed to use `INSERT INTO .. PARTITION` in other tests. ### Does this PR introduce _any_ user-facing change? No. It affects only the v2 table catalog used in tests. ### How was this patch tested? Added new UT to `DataSourceV2SQLSuite`, and run the affected test suite by: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.connector.DataSourceV2SQLSuite" ``` Closes #30952 from MaxGekk/fix-insert-into-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 06:49:26 +00:00
HyukjinKwon	b33fa53385	[SPARK-33925][CORE] Remove unused SecurityManager in Utils.fetchFile ### What changes were proposed in this pull request? This is kind of a followup of https://github.com/apache/spark/pull/24033. The first and last usage of that argument `SecurityManager` was removed in https://github.com/apache/spark/pull/24033. After that, we don't need to pass `SecurityManager` anymore in `Utils.fetchFile` and related code paths. This PR proposes to remove it out. ### Why are the changes needed? For better readability of codes. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually complied. GitHub Actions and Jenkins build should test it out as well. Closes #30945 from HyukjinKwon/SPARK-33925. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 16:58:42 -08:00
Wenchen Fan	c2eac1de02	[SPARK-33845][SQL][FOLLOWUP] fix SimplifyConditionals ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30849, to fix a correctness issue caused by null value handling. ### Why are the changes needed? Fix a correctness issue. `If(null, true, false)` should return false, not true. ### Does this PR introduce _any_ user-facing change? Yes, but the bug only exist in the master branch. ### How was this patch tested? updated tests. Closes #30953 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 16:44:57 -08:00
Max Gekk	0617dfce7b	[SPARK-33899][SQL] Fix assert failure in v1 SHOW TABLES/VIEWS on `spark_catalog` ### What changes were proposed in this pull request? Remove `assert(ns.nonEmpty)` in `ResolveSessionCatalog` for: - `SHOW TABLES` - `SHOW TABLE EXTENDED` - `SHOW VIEWS` ### Why are the changes needed? Spark SQL shouldn't fail with internal assert failures even for invalid user inputs. For instance: ```sql spark-sql> show tables in spark_catalog; 20/12/24 11:19:46 ERROR SparkSQLDriver: Failed in [show tables in spark_catalog] java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:366) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for the example above: ```sql spark-sql> show tables in spark_catalog; Error in query: multi-part identifier cannot be empty. ``` ### How was this patch tested? Added new UT to `v1/ShowTablesSuite`. Closes #30915 from MaxGekk/remove-assert-ns-nonempty. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 09:07:21 +00:00
angerszhu	fc508d1898	[SPARK-32685][SQL] When specify serde, default filed.delim is '\t' ### What changes were proposed in this pull request? In hive script transform, when we use specified serde, the `filed.delim` is '\t' ![image](https://user-images.githubusercontent.com/46485123/103187960-7dd77800-4901-11eb-8241-f4636e66fbc8.png) And change to other serde and explain query plan, `filed.delim` is same. In spark current code, the result is as below: ![image](https://user-images.githubusercontent.com/46485123/103187999-95aefc00-4901-11eb-9850-5c385000b78c.png) We should keep same as hive. Notic: the result's NULL value is different is another issue https://issues.apache.org/jira/browse/SPARK-32684 ### Why are the changes needed? Keep same with hive serde ### Does this PR introduce _any_ user-facing change? In script transform, is not specified, `field.delim` keep same with hive as `\t` ### How was this patch tested? UT added Closes #30942 from AngersZhuuuu/SPARK-32685. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 08:23:01 +00:00
yi.wu	00fa49aeaa	[SPARK-33923][SQL][TESTS] Fix some tests with AQE enabled ### What changes were proposed in this pull request? * Remove the explicit AQE disable confs * Use `AdaptiveSparkPlanHelper` to check plans * No longer extending `DisableAdaptiveExecutionSuite` for `BucketedReadSuite` but only disable AQE for two certain tests there. ### Why are the changes needed? Some tests that are fixed in https://github.com/apache/spark/pull/30655 doesn't really require AQE off. Instead, they could use `AdaptiveSparkPlanHelper` to pass when AQE on. It's better to run tests with AQE on since we've turned it on by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass all tests and the updated tests. Closes #30941 from Ngone51/SPARK-33680-follow-up. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 00:03:45 -08:00
Liang-Chi Hsieh	c75f779fd7	[SPARK-33827][SS] Unload inactive state store as soon as possible ### What changes were proposed in this pull request? This patch proposes to unload inactive state store as soon as possible. The timing of unload inactive state stores, happens when we get to load active state store provider at executors. At the time, state store coordinator will return back the state store provider list including loaded stores that are already loaded by other executors in new batch. Each state store provider in the list will go to unload. ### Why are the changes needed? Per the discussion at #30770, it makes sense to me we should unload inactive state store asap. Now we run a maintenance task periodically to unload inactive state stores. So there will be some delays between a state store becomes inactive and it is unloaded. However, we can force Spark to always allocate a state store to same executor, by using task locality configuration. This can reduce the possibility to have inactive state store. Normally, with locality configuration, we might not able to see inactive state store generally. There is still chance an executor can be failed and reallocated, but in this case, inactive state store is also lost too. So it is not an issue. Making driver-executor bi-directional for unloading inactive state store looks non-trivial, and seems to me, it is not worth, after considering what we can do with locality. This proposes a simpler but effective approach. We can check if loaded state store is already loaded at other executor during reporting active state store to the coordinator. If so, it means the loaded store is inactive now, and it is going to be unload by the next maintenance task. Then we unload that store immediately. How do we make sure the loaded state store in previous batch is loaded at other executor in this batch before reporting in this executor? With task locality and preferred location, once an executor is ready to be scheduled, Spark should assign the state store provider previously loaded at the executor. So when this executor gets a new assignment other than previously loaded state store, it means the previously loaded one is already assigned to other executor. There is still a delay between the state store is loaded at other executor, and unloading it when reporting active state store at this executor. But it should be minimized now. And there won't be multiple state store belonging to same operator are loaded at the same time at one single executor, because once the executor reports any active store, it will unload all inactive stores. This should not be an issue IMHO. This is a minimal change to unload inactive state store asap without significant change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30827 from viirya/SPARK-33827. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-28 16:52:56 +09:00
Max Gekk	4a61fc1a92	[SPARK-33914][SQL][DOCS] Describe the structure of unified DS v1 and v2 tests ### What changes were proposed in this pull request? Add comments for the unified datasource tests, describe what kind of tests they contain, and put refs to other test suits. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30929 from MaxGekk/doc-unified-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 07:03:29 +00:00
angerszhu	0a3f3d609d	[SPARK-33908][CORE] Refactor SparkSubmitUtils.resolveMavenCoordinates() 's return parameter ### What changes were proposed in this pull request? Per discuss in https://github.com/apache/spark/pull/29966#discussion_r531917374 We'd better change `SparkSubmitUtils.resolveMavenCoordinates()` 's return value as `Seq[String]` ### Why are the changes needed? refactor code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30922 from AngersZhuuuu/SPARK-33908. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 16:00:24 +09:00
Kent Yao	3fdbc48373	[SPARK-33901][SQL] Fix Char and Varchar display error after DDLs ### What changes were proposed in this pull request? After CTAS / CREATE TABLE LIKE / CVAS/ alter table add columns, the target tables will display string instead of char/varchar ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30918 from yaooqinn/SPARK-33901. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 06:48:27 +00:00
yangjie01	1be9e7e40b	[SPAKR-33801][CORE][SQL] Fix compilation warnings about 'Unicode escapes in triple quoted strings are deprecated' ### What changes were proposed in this pull request? There are total 15 compilation warnings about `Unicode escapes in triple quoted strings are deprecated` in Spark code now: ``` [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2930: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2931: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2932: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2933: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2934: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2935: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2936: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2937: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala:82: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:32: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:79: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:97: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:101: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:76: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:83: Unicode escapes in triple quoted strings are deprecated, use the literal character instead ``` This pr try to fix these warnnings. ### Why are the changes needed? Cleanup compilation warnings about `Unicode escapes in triple quoted strings are deprecated` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30926 from LuciferYang/SPARK-33801. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 15:29:09 +09:00
Terry Kim	fe33262c91	[SPARK-33918][SQL] UnresolvedView should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("DROP VIEW unknown") org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `10`. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` DROP VIEW v ALTER VIEW v SET TBLPROPERTIES ('k'='v') ALTER VIEW v UNSET TBLPROPERTIES ('k') ALTER VIEW v AS SELECT 1 ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 10; ``` ### How was this patch tested? Add a new suite of tests. Closes #30936 from imback82/position_view_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 05:45:40 +00:00
yangjie01	e6f019836c	[SPARK-33532][SQL] Add comments to a unreachable branch in SpecificParquetRecordReaderBase.initialize method ### What changes were proposed in this pull request? This pr mainly adds a comment for the 'rowgroupoffsets! = null' branch in `SpecificParquetRecordReaderBase.init(InputSplit, TaskAttemptContext)` to indicate that spark read parquet process will not enter this branch after SPARK-13883 and SPARK-13989. It is not deleted because PARQUET-131 wants to move `SpecificParquetRecordReaderBase` into the parquet-mr project. ### Why are the changes needed? Add a useful comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30484 from LuciferYang/SPARK-33532. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 14:07:50 +09:00
yangjie01	37ae0a6086	[SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process ### What changes were proposed in this pull request? Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error. For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1 ### Why are the changes needed? Let Maven build also check for unused imports as compilation error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Local manual test：add an unused import intentionally to trigger maven compilation error. Closes #30784 from LuciferYang/SPARK-33560. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-26 17:40:19 -06:00
kozakana	2553d53dc8	[SPARK-33897][SQL] Can't set option 'cross' in join method ### What changes were proposed in this pull request? [The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti." However, I get the following error when I set the cross option. ``` scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show() java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106) at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) ... 53 elided ``` ### Why are the changes needed? The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException. ### Does this PR introduce _any_ user-facing change? Accepting this PR fix will behave the same as the documentation. ### How was this patch tested? There is already a test for [JoinTypes](`1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala`), but I can't find a test for the join option itself. Closes #30803 from kozakana/allow_cross_option. Authored-by: kozakana <goki727@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-26 16:30:50 +09:00
angerszhu	10b6466e91	[SPARK-33084][CORE][SQL] Add jar support ivy path ### What changes were proposed in this pull request? Support add jar with ivy path ### Why are the changes needed? Since submit app can support ivy, add jar we can also support ivy now. ### Does this PR introduce _any_ user-facing change? User can add jar with sql like ``` add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false ``` core api ``` sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true") sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false") ``` #### Doc Update snapshot ![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png) ### How was this patch tested? Added UT Closes #29966 from AngersZhuuuu/support-add-jar-ivy. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-25 09:07:48 +09:00
Takeshi Yamamuro	65a9ac2ff4	[SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec ### What changes were proposed in this pull request? This pr intends to support code generation for `HashAggregateExec` with filters. Quick benchmark results: ``` $ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t") scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save() >> Before this PR Elapsed time: 16.170697619s >> After this PR Elapsed time: 6.7825313s ``` The query above is compiled into code below; ``` ... /* 285 / private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException { / 286 / // evaluate aggregate function for avg / 287 / boolean agg_isNull_10 = true; / 288 / double agg_value_12 = -1.0; / 289 / boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0); / 290 / double agg_value_13 = agg_isNull_11 ? / 291 / -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0)); / 292 / if (!agg_isNull_11) { / 293 / agg_agg_isNull_12_0 = true; / 294 / double agg_value_14 = -1.0; / 295 / do { / 296 / if (!agg_exprIsNull_2_0) { / 297 / agg_agg_isNull_12_0 = false; / 298 / agg_value_14 = agg_expr_2_0; / 299 / continue; / 300 / } / 301 / / 302 / if (!false) { / 303 / agg_agg_isNull_12_0 = false; / 304 / agg_value_14 = 0.0D; / 305 / continue; / 306 / } / 307 / / 308 / } while (false); / 309 / / 310 / agg_isNull_10 = false; // resultCode could change nullability. / 311 / / 312 / agg_value_12 = agg_value_13 + agg_value_14; / 313 / / 314 / } / 315 / boolean agg_isNull_15 = false; / 316 / long agg_value_17 = -1L; / 317 / if (!false && agg_exprIsNull_2_0) { / 318 / boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 319 / long agg_value_20 = agg_isNull_18 ? / 320 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 321 / agg_isNull_15 = agg_isNull_18; / 322 / agg_value_17 = agg_value_20; / 323 / } else { / 324 / boolean agg_isNull_19 = true; / 325 / long agg_value_21 = -1L; / 326 / boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 327 / long agg_value_22 = agg_isNull_20 ? / 328 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 329 / if (!agg_isNull_20) { / 330 / agg_isNull_19 = false; // resultCode could change nullability. / 331 / / 332 / agg_value_21 = agg_value_22 + 1L; / 333 / / 334 / } / 335 / agg_isNull_15 = agg_isNull_19; / 336 / agg_value_17 = agg_value_21; / 337 / } / 338 / // update unsafe row buffer / 339 / if (!agg_isNull_10) { / 340 / agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12); / 341 / } else { / 342 / agg_unsafeRowAggBuffer_0.setNullAt(0); / 343 / } / 344 / / 345 / if (!agg_isNull_15) { / 346 / agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17); / 347 / } else { / 348 / agg_unsafeRowAggBuffer_0.setNullAt(1); / 349 / } / 350 */ } ... ``` ### Why are the changes needed? For high performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27019 from maropu/AggregateFilterCodegen. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:44:16 -08:00
ulysses-you	9c30116fb4	[SPARK-33857][SQL] Unify the default seed of random functions ### What changes were proposed in this pull request? Unify the seed of random functions 1. Add a hold place expression `UnresolvedSeed ` as the defualt seed. 2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `. 3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule. ### Why are the changes needed? `Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing. It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test and add test. Closes #30864 from ulysses-you/SPARK-33857. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:30:34 -08:00
Kent Yao	29cca68e9e	[SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE ### What changes were proposed in this pull request? Display char/varchar in - DESC table - DESC column - SHOW CREATE TABLE ### Why are the changes needed? show the correct definition for users ### Does this PR introduce _any_ user-facing change? yes, char/varchar column's will print char/varchar instead of string ### How was this patch tested? new tests Closes #30908 from yaooqinn/SPARK-33892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:56:02 +00:00
Max Gekk	54a67842e6	[SPARK-33881][SQL][TESTS] Check null and empty string as partition values in DS v1 and v2 tests ### What changes were proposed in this pull request? Add tests to check handling `null` and `''` (empty string) as partition values in commands `SHOW PARTITIONS`, `ALTER TABLE .. ADD PARTITION`, `ALTER TABLE .. DROP PARTITION`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite" ``` Closes #30893 from MaxGekk/partition-value-empty-string. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:54:53 +00:00
gengjiaan	3e9821edfd	[SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS \| RESPECT NULLS ] ### What changes were proposed in this pull request? The mainstream database support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`. But the current implement of `LEAD`/`LAG` don't support this syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77 Presto https://prestodb.io/docs/current/functions/window.html Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html ### Why are the changes needed? Support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #30387 from beliefer/SPARK-33443. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:13:48 +00:00
Yuming Wang	32d4a2b062	[SPARK-33861][SQL] Simplify conditional in predicate ### What changes were proposed in this pull request? This pr simplify conditional in predicate, after this change we can push down the filter to datasource: Expression \| After simplify -- \| -- IF(cond, trueVal, false) \| AND(cond, trueVal) IF(cond, trueVal, true) \| OR(NOT(cond), trueVal) IF(cond, false, falseVal) \| AND(NOT(cond), elseVal) IF(cond, true, falseVal) \| OR(cond, elseVal) CASE WHEN cond THEN trueVal ELSE false END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE null END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE true END \| OR(NOT(cond), trueVal) CASE WHEN cond THEN false ELSE elseVal END \| AND(NOT(cond), elseVal) CASE WHEN cond THEN false END \| false CASE WHEN cond THEN true ELSE elseVal END \| OR(cond, elseVal) CASE WHEN cond THEN true END \| cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30865 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:10:28 +00:00
Kent Yao	d7dc42d5f6	[SPARK-33895][SQL] Char and Varchar fail in MetaOperation of ThriftServer ### What changes were proposed in this pull request? ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: CHAR(10) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:187) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:203) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:195) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4(SparkGetColumnsOperation.scala:99) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4$adapted(SparkGetColumnsOperation.scala:98) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ``` meta operation is targeting raw table schema, we need to handle these types there. ### Why are the changes needed? bugfix, see the above case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests locally ![image](https://user-images.githubusercontent.com/8326978/103069196-cdfcc480-45f9-11eb-9c6a-d4c42123c6e3.png) Closes #30914 from yaooqinn/SPARK-33895. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 07:40:38 +00:00
Terry Kim	f1d3797291	[SPARK-33886][SQL] UnresolvedTable should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("MSCK REPAIR TABLE unknown") org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be 18. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` MSCK REPAIR TABLE t LOAD DATA LOCAL INPATH 'filepath' INTO TABLE t TRUNCATE TABLE t SHOW PARTITIONS t ALTER TABLE t RECOVER PARTITIONS ALTER TABLE t ADD PARTITION (p=1) ALTER TABLE t PARTITION (p=1) RENAME TO PARTITION (p=2) ALTER TABLE t DROP PARTITION (p=1) ALTER TABLE t SET SERDEPROPERTIES ('a'='b') COMMENT ON TABLE t IS 'hello'" ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 18; ``` ### How was this patch tested? Add a new suite of tests. Closes #30900 from imback82/position_Fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 05:21:39 +00:00
Yuanjian Li	86c1cfc579	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-24 12:44:37 +09:00
Takuya UESHIN	5c9b421c37	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? This is a retry of #30177. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 14:48:01 -08:00
Yuming Wang	7ffcfcf7db	[SPARK-33847][SQL] Simplify CaseWhen if elseValue is None ### What changes were proposed in this pull request? 1. Enhance `ReplaceNullWithFalseInPredicate` to replace None of elseValue inside `CaseWhen` with `FalseLiteral` if all branches are `FalseLiteral` . The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN 'a' WHEN id = 3 THEN 'b' end) = 'c'; ``` Before this pr: ``` == Physical Plan == (1) Filter CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` 2. Enhance `SimplifyConditionals` if elseValue is None and all outputs are null. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30852 from wangyum/SPARK-33847. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:35:46 +00:00
Max Gekk	303df64b46	[SPARK-33889][SQL] Fix NPE from `SHOW PARTITIONS` on V2 tables ### What changes were proposed in this pull request? At `ShowPartitionsExec.run()`, check that a row returned by `listPartitionIdentifiers()` contains a `null` field, and convert it to `"null"`. ### Why are the changes needed? Because `SHOW PARTITIONS` throws NPE on V2 table with `null` partition values. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new UT to `v2.ShowPartitionsSuite`. Closes #30904 from MaxGekk/fix-npe-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:34:01 +00:00
Max Gekk	cc23581e26	[SPARK-33858][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. RENAME PARTITION` parsing tests to `AlterTableRenamePartitionParserSuite` 2. Place the v1 tests for `ALTER TABLE .. RENAME PARTITION` from `DDLSuite` to `v1.AlterTableRenamePartitionSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to `v2.AlterTableRenamePartitionSuite`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. RENAME PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionSuite" ``` Closes #30863 from MaxGekk/unify-rename-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 12:19:07 +00:00
ulysses-you	f421c172d9	[SPARK-33497][SQL] Override maxRows in some LogicalPlan ### What changes were proposed in this pull request? This PR aims to override maxRows method in these follow `LogicalPlan`: * `ReturnAnswer` * `Join` * `Range` * `Sample` * `RepartitionOperation` * `Deduplicate` * `LocalRelation` * `Window` ### Why are the changes needed? 1. Logically, we know the max rows info with these `LogicalPlan`. 2. Before this PR, we already have some max rows with `LogicalPlan`, so we can eliminate limit with more case if we expand more. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30443 from ulysses-you/SPARK-33497. Lead-authored-by: ulysses-you <youxiduo@weidian.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:20:49 +00:00
Max Gekk	34bfb3a31d	[SPARK-33787][SQL] Allow partition purge for v2 tables ### What changes were proposed in this pull request? 1. Add new methods `purgePartition()`/`purgePartitions()` to the interfaces `SupportsPartitionManagement`/`SupportsAtomicPartitionManagement`. 2. Default implementation of new methods throw the exception `UnsupportedOperationException`. 3. Add tests for new methods to `SupportsPartitionManagementSuite`/`SupportsAtomicPartitionManagementSuite`. 4. Add `ALTER TABLE .. DROP PARTITION` tests for DS v1 and v2. Closes #30776 Closes #30821 ### Why are the changes needed? Currently, the `PURGE` option that user can set in `ALTER TABLE .. DROP PARTITION` is completely ignored. We should pass this flag to the catalog implementation, so, the catalog should decide how to handle the flag. ### Does this PR introduce _any_ user-facing change? The changes can impact on behavior of `ALTER TABLE .. DROP PARTITION` for v2 tables. ### How was this patch tested? By running the affected test suites, for instance: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30886 from MaxGekk/purge-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:09:48 +00:00
Kent Yao	2287f56a3e	[SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns ### What changes were proposed in this pull request? ```sql spark-sql> select * from t10 where c0='abcd'; 20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd'] scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:102) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156) ``` c0 is a partition column, it fails in the partition pruning rule In this PR, we relace char/varchar w/ string type before the CAST happends ### Why are the changes needed? bugfix, see the case above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, new tests Closes #30887 from yaooqinn/SPARK-33879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 16:14:27 +09:00
ulysses-you	e853f068f6	[SPARK-33526][SQL][FOLLOWUP] Fix flaky test due to timeout and fix docs ### What changes were proposed in this pull request? Make test stable and fix docs. ### Why are the changes needed? Query timeout sometime since we set an another config after set query timeout. ``` sbt.ForkMain$ForkError: java.sql.SQLTimeoutException: Query timed out after 0 seconds at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13(ThriftServerWithSparkContextSuite.scala:107) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13$adapted(ThriftServerWithSparkContextSuite.scala:106) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12(ThriftServerWithSparkContextSuite.scala:106) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12$adapted(ThriftServerWithSparkContextSuite.scala:89) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4(SharedThriftServer.scala:95) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4$adapted(SharedThriftServer.scala:95) ``` The reason is: 1. we execute `set spark.sql.thriftServer.queryTimeout = 1`, then all the option will be limited in 1s. 2. we execute `set spark.sql.thriftServer.interruptOnCancel = false/true`. This sql will get timeout exception if there is something hung within 1s. It's not our expected. Reset the timeout before we do the step2 can avoid this problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fix test. Closes #30897 from ulysses-you/SPARK-33526-followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 22:43:03 -08:00
Takeshi Yamamuro	ea37717f7c	[SPARK-32106][SQL][FOLLOWUP] Fix flaky tests in transform.sql ### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 13:50:05 +09:00
Wenchen Fan	ec1560af25	[SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30267 Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`. ### Why are the changes needed? 1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset. 2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing tests Closes #30890 from cloud-fan/purgeTable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 11:47:13 +09:00
Erik Krogen	303b8c8773	[SPARK-23862][SQL] Support Java enums from Scala Dataset API ### What changes were proposed in this pull request? Add support for Java Enums (`java.lang.Enum`) from the Scala typed Dataset APIs. This involves adding an implicit for `Encoder` creation in `SQLImplicits`, and updating `ScalaReflection` to handle Java Enums on the serialization and deserialization pathways. Enums are mapped to a `StringType` which is just the name of the Enum value. ### Why are the changes needed? In [SPARK-21255](https://issues.apache.org/jira/browse/SPARK-21255), support for (de)serialization of Java Enums was added, but only when called from Java code. It is common for Scala code to rely on Java libraries that are out of control of the Scala developer. Today, if there is a dependency on some Java code which defines an Enum, it would be necessary to define a corresponding Scala class. This change brings closer feature parity between Scala and Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, previously something like: ``` val ds = Seq(MyJavaEnum.VALUE1, MyJavaEnum.VALUE2).toDS // or val ds = Seq(CaseClass(MyJavaEnum.VALUE1), CaseClass(MyJavaEnum.VALUE2)).toDS ``` would fail. Now, it will succeed. ### How was this patch tested? Additional unit tests are added in `DatasetSuite`. Tests include validating top-level enums, enums inside of case classes, enums inside of arrays, and validating that the Enum is stored as the expected string. Closes #30877 from xkrogen/xkrogen-SPARK-23862-scalareflection-java-enums. Lead-authored-by: Erik Krogen <xkrogen@apache.org> Co-authored-by: Fangshi Li <fli@linkedin.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 09:55:33 -08:00
Kent Yao	6da5cdf1db	[SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location ### What changes were proposed in this pull request? This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ. ### Why are the changes needed? ```sql spark-sql> INSERT INTO t2 VALUES ('1', 'b12345'); Time taken: 0.141 seconds spark-sql> alter table t set location '/tmp/hive_one/t2'; Time taken: 0.095 seconds spark-sql> select * from t; 1 b1234 ``` the above case should fail rather than implicitly applying truncation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30882 from yaooqinn/SPARK-33876. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 14:24:12 +00:00
Max Gekk	84bf07bbd7	[SPARK-33878][SQL][TESTS] Fix resolving of `spark_catalog` in v1 Hive catalog tests ### What changes were proposed in this pull request? 1. Recognize `spark_catalog` as the default session catalog in the checks of `TestHiveQueryExecution`. 2. Move v2 and v1 in-memory catalog test `"SPARK-33305: DROP TABLE should also invalidate cache"` to the common trait `command/DropTableSuiteBase`, and run it with v1 Hive external catalog. ### Why are the changes needed? To run In-memory catalog tests in Hive catalog. ### Does this PR introduce _any_ user-facing change? No, the changes influence only on tests. ### How was this patch tested? By running the affected test suites for `DROP TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite" ``` Closes #30883 from MaxGekk/fix-spark_catalog-hive-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 12:37:16 +00:00
Jacob Kim	43a562035c	[SPARK-33846][SQL] Include Comments for a nested schema in StructType.toDDL ### What changes were proposed in this pull request? ```scala val nestedStruct = new StructType() .add(StructField("b", StringType).withComment("Nested comment")) val struct = new StructType() .add(StructField("a", nestedStruct).withComment("comment")) struct.toDDL ``` Currently, returns: ``` `a` STRUCT<`b`: STRING> COMMENT 'comment'` ``` With this PR, the code above returns: ``` `a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'` ``` ### Why are the changes needed? My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns. ### Does this PR introduce _any_ user-facing change? Now, when users call something like this, ```scala spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ") ``` they will get comments for the nested columns. ### How was this patch tested? I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string. Closes #30851 from jacobhjkim/structtype-toddl. Authored-by: Jacob Kim <me@jacobkim.io> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 17:55:16 +09:00
Anton Okolnychyi	7bbcbb84c2	[SPARK-33784][SQL] Rename dataSourceRewriteRules batch ### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:29:22 +00:00
Anton Okolnychyi	2562183987	[SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer ### What changes were proposed in this pull request? This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30806 from aokolnychyi/spark-33808. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:23:56 +00:00
ulysses-you	1dd63dccd8	[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 15:10:46 +09:00
yangjie01	b88745565b	[SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Orc ### What changes were proposed in this pull request? Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty. This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true ### Why are the changes needed? Orc filters push down operation should only triggered when `filters.nonEmpty` is true ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30663 from LuciferYang/pushdownfilter-when-filter-nonempty. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 20:24:23 -08:00
Kent Yao	f5fd10b1bc	[SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 03:07:26 +00:00
angerszhu	7466031632	[SPARK-32106][SQL] Implement script transform in sql/core ### What changes were proposed in this pull request? * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec` * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode) * Add `SparkScriptTransformationSuite` test spark spec case * add test in `SQLQueryTestSuite` And we will close #29085 . ### Why are the changes needed? Support user use Script Transform without Hive ### Does this PR introduce _any_ user-facing change? User can use Script Transformation without hive in no serde mode. Such as : default no serde ``` SELECT TRANSFORM(a, b, c) USING 'cat' AS (a int, b string, c long) FROM testData ``` no serde with spec ROW FORMAT DELIMITED ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0002' MAP KEYS TERMINATED BY '\u0003' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0004' MAP KEYS TERMINATED BY '\u0005' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` ### How was this patch tested? Added UT Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-22 11:37:59 +09:00
Yuming Wang	1c77605682	[SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches ### What changes were proposed in this pull request? This pr push the `UnaryExpression` into (if / case) branches. The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3; ``` Before this pr: ``` == Physical Plan == (1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` This change can also improve this case: `a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30853 from wangyum/SPARK-33848. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 10:25:23 -08:00
Max Gekk	661ac10901	[SPARK-33838][SQL][DOCS] Comment the `PURGE` option in the DropTable and in AlterTableDropPartition commands ### What changes were proposed in this pull request? Add comments for the `PURGE` option to the logical nodes `DropTable` and `AlterTableDropPartition`. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30837 from MaxGekk/comment-purge-logical-node. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 14:06:31 +00:00
Takeshi Yamamuro	69aa727ff4	[SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions ### What changes were proposed in this pull request? This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions. New groups below are added in this PR: - binary_funcs - bitwise_funcs - collection_funcs - predicate_funcs - conditional_funcs - conversion_funcs - csv_funcs - generator_funcs - hash_funcs - lambda_funcs - math_funcs - misc_funcs - string_funcs - struct_funcs - xml_funcs A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in `hash.scala` are categorized into `hash_funcs`. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes: - All the aggregate functions are categorized into `agg_funcs`. - `array_funcs` and `map_funcs` are sub-groups of `collection_funcs`. For example, `array_contains` is used only for arrays, so it is assigned to `array_funcs`. On the other hand, `reverse` is used for both arrays and strings, so it is assigned to `collection_funcs`. - Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, `schema_of_csv` can be grouped into both `csv_funcs` and `struct_funcs` in terms of input types, but it is assigned to `csv_funcs` because it belongs to the `csvExpressions.scala` file that holds the other CSV-related functions. - Functions in `nullExpressions.scala`, `complexTypeCreator.scala`, `randomExpressions.scala`, and `regexExpressions.scala` are categorized based on their functionalities. For example: - `isnull` in `nullExpressions` is assigned to `predicate_funcs` because this is a predicate function. - `array` in `complexTypeCreator.scala` is assigned to `array_funcs`based on its output type (The other functions in `array_funcs` are categorized based on their input types though). A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master): \|group\|name\|class\| \|-----\|----\|-----\| \|agg_funcs\|any\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|approx_count_distinct\|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus\| \|agg_funcs\|approx_percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|avg\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|bit_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg\| \|agg_funcs\|bit_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg\| \|agg_funcs\|bit_xor\|org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg\| \|agg_funcs\|bool_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|bool_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|collect_list\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectList\| \|agg_funcs\|collect_set\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet\| \|agg_funcs\|corr\|org.apache.spark.sql.catalyst.expressions.aggregate.Corr\| \|agg_funcs\|count_if\|org.apache.spark.sql.catalyst.expressions.aggregate.CountIf\| \|agg_funcs\|count_min_sketch\|org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg\| \|agg_funcs\|count\|org.apache.spark.sql.catalyst.expressions.aggregate.Count\| \|agg_funcs\|covar_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation\| \|agg_funcs\|covar_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.CovSample\| \|agg_funcs\|cube\|org.apache.spark.sql.catalyst.expressions.Cube\| \|agg_funcs\|every\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|first_value\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|first\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|grouping_id\|org.apache.spark.sql.catalyst.expressions.GroupingID\| \|agg_funcs\|grouping\|org.apache.spark.sql.catalyst.expressions.Grouping\| \|agg_funcs\|kurtosis\|org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis\| \|agg_funcs\|last_value\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|last\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|max_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy\| \|agg_funcs\|max\|org.apache.spark.sql.catalyst.expressions.aggregate.Max\| \|agg_funcs\|mean\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|min_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MinBy\| \|agg_funcs\|min\|org.apache.spark.sql.catalyst.expressions.aggregate.Min\| \|agg_funcs\|percentile_approx\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.Percentile\| \|agg_funcs\|rollup\|org.apache.spark.sql.catalyst.expressions.Rollup\| \|agg_funcs\|skewness\|org.apache.spark.sql.catalyst.expressions.aggregate.Skewness\| \|agg_funcs\|some\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|stddev_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop\| \|agg_funcs\|stddev_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|stddev\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|std\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|sum\|org.apache.spark.sql.catalyst.expressions.aggregate.Sum\| \|agg_funcs\|var_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop\| \|agg_funcs\|var_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|agg_funcs\|variance\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|array_funcs\|array_contains\|org.apache.spark.sql.catalyst.expressions.ArrayContains\| \|array_funcs\|array_distinct\|org.apache.spark.sql.catalyst.expressions.ArrayDistinct\| \|array_funcs\|array_except\|org.apache.spark.sql.catalyst.expressions.ArrayExcept\| \|array_funcs\|array_intersect\|org.apache.spark.sql.catalyst.expressions.ArrayIntersect\| \|array_funcs\|array_join\|org.apache.spark.sql.catalyst.expressions.ArrayJoin\| \|array_funcs\|array_max\|org.apache.spark.sql.catalyst.expressions.ArrayMax\| \|array_funcs\|array_min\|org.apache.spark.sql.catalyst.expressions.ArrayMin\| \|array_funcs\|array_position\|org.apache.spark.sql.catalyst.expressions.ArrayPosition\| \|array_funcs\|array_remove\|org.apache.spark.sql.catalyst.expressions.ArrayRemove\| \|array_funcs\|array_repeat\|org.apache.spark.sql.catalyst.expressions.ArrayRepeat\| \|array_funcs\|array_union\|org.apache.spark.sql.catalyst.expressions.ArrayUnion\| \|array_funcs\|arrays_overlap\|org.apache.spark.sql.catalyst.expressions.ArraysOverlap\| \|array_funcs\|arrays_zip\|org.apache.spark.sql.catalyst.expressions.ArraysZip\| \|array_funcs\|array\|org.apache.spark.sql.catalyst.expressions.CreateArray\| \|array_funcs\|flatten\|org.apache.spark.sql.catalyst.expressions.Flatten\| \|array_funcs\|sequence\|org.apache.spark.sql.catalyst.expressions.Sequence\| \|array_funcs\|shuffle\|org.apache.spark.sql.catalyst.expressions.Shuffle\| \|array_funcs\|slice\|org.apache.spark.sql.catalyst.expressions.Slice\| \|array_funcs\|sort_array\|org.apache.spark.sql.catalyst.expressions.SortArray\| \|bitwise_funcs\|&\|org.apache.spark.sql.catalyst.expressions.BitwiseAnd\| \|bitwise_funcs\|^\|org.apache.spark.sql.catalyst.expressions.BitwiseXor\| \|bitwise_funcs\|bit_count\|org.apache.spark.sql.catalyst.expressions.BitwiseCount\| \|bitwise_funcs\|shiftrightunsigned\|org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned\| \|bitwise_funcs\|shiftright\|org.apache.spark.sql.catalyst.expressions.ShiftRight\| \|bitwise_funcs\|~\|org.apache.spark.sql.catalyst.expressions.BitwiseNot\| \|collection_funcs\|cardinality\|org.apache.spark.sql.catalyst.expressions.Size\| \|collection_funcs\|concat\|org.apache.spark.sql.catalyst.expressions.Concat\| \|collection_funcs\|reverse\|org.apache.spark.sql.catalyst.expressions.Reverse\| \|collection_funcs\|size\|org.apache.spark.sql.catalyst.expressions.Size\| \|conditional_funcs\|coalesce\|org.apache.spark.sql.catalyst.expressions.Coalesce\| \|conditional_funcs\|ifnull\|org.apache.spark.sql.catalyst.expressions.IfNull\| \|conditional_funcs\|if\|org.apache.spark.sql.catalyst.expressions.If\| \|conditional_funcs\|nanvl\|org.apache.spark.sql.catalyst.expressions.NaNvl\| \|conditional_funcs\|nullif\|org.apache.spark.sql.catalyst.expressions.NullIf\| \|conditional_funcs\|nvl2\|org.apache.spark.sql.catalyst.expressions.Nvl2\| \|conditional_funcs\|nvl\|org.apache.spark.sql.catalyst.expressions.Nvl\| \|conditional_funcs\|when\|org.apache.spark.sql.catalyst.expressions.CaseWhen\| \|conversion_funcs\|bigint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|binary\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|boolean\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|cast\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|date\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|decimal\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|double\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|float\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|int\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|smallint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|string\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|timestamp\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|tinyint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|csv_funcs\|from_csv\|org.apache.spark.sql.catalyst.expressions.CsvToStructs\| \|csv_funcs\|schema_of_csv\|org.apache.spark.sql.catalyst.expressions.SchemaOfCsv\| \|csv_funcs\|to_csv\|org.apache.spark.sql.catalyst.expressions.StructsToCsv\| \|datetime_funcs\|add_months\|org.apache.spark.sql.catalyst.expressions.AddMonths\| \|datetime_funcs\|current_date\|org.apache.spark.sql.catalyst.expressions.CurrentDate\| \|datetime_funcs\|current_timestamp\|org.apache.spark.sql.catalyst.expressions.CurrentTimestamp\| \|datetime_funcs\|current_timezone\|org.apache.spark.sql.catalyst.expressions.CurrentTimeZone\| \|datetime_funcs\|date_add\|org.apache.spark.sql.catalyst.expressions.DateAdd\| \|datetime_funcs\|date_format\|org.apache.spark.sql.catalyst.expressions.DateFormatClass\| \|datetime_funcs\|date_from_unix_date\|org.apache.spark.sql.catalyst.expressions.DateFromUnixDate\| \|datetime_funcs\|date_part\|org.apache.spark.sql.catalyst.expressions.DatePart\| \|datetime_funcs\|date_sub\|org.apache.spark.sql.catalyst.expressions.DateSub\| \|datetime_funcs\|date_trunc\|org.apache.spark.sql.catalyst.expressions.TruncTimestamp\| \|datetime_funcs\|datediff\|org.apache.spark.sql.catalyst.expressions.DateDiff\| \|datetime_funcs\|dayofmonth\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|dayofweek\|org.apache.spark.sql.catalyst.expressions.DayOfWeek\| \|datetime_funcs\|dayofyear\|org.apache.spark.sql.catalyst.expressions.DayOfYear\| \|datetime_funcs\|day\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|extract\|org.apache.spark.sql.catalyst.expressions.Extract\| \|datetime_funcs\|from_unixtime\|org.apache.spark.sql.catalyst.expressions.FromUnixTime\| \|datetime_funcs\|from_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp\| \|datetime_funcs\|hour\|org.apache.spark.sql.catalyst.expressions.Hour\| \|datetime_funcs\|last_day\|org.apache.spark.sql.catalyst.expressions.LastDay\| \|datetime_funcs\|make_date\|org.apache.spark.sql.catalyst.expressions.MakeDate\| \|datetime_funcs\|make_interval\|org.apache.spark.sql.catalyst.expressions.MakeInterval\| \|datetime_funcs\|make_timestamp\|org.apache.spark.sql.catalyst.expressions.MakeTimestamp\| \|datetime_funcs\|minute\|org.apache.spark.sql.catalyst.expressions.Minute\| \|datetime_funcs\|months_between\|org.apache.spark.sql.catalyst.expressions.MonthsBetween\| \|datetime_funcs\|month\|org.apache.spark.sql.catalyst.expressions.Month\| \|datetime_funcs\|next_day\|org.apache.spark.sql.catalyst.expressions.NextDay\| \|datetime_funcs\|now\|org.apache.spark.sql.catalyst.expressions.Now\| \|datetime_funcs\|quarter\|org.apache.spark.sql.catalyst.expressions.Quarter\| \|datetime_funcs\|second\|org.apache.spark.sql.catalyst.expressions.Second\| \|datetime_funcs\|timestamp_micros\|org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp\| \|datetime_funcs\|timestamp_millis\|org.apache.spark.sql.catalyst.expressions.MillisToTimestamp\| \|datetime_funcs\|timestamp_seconds\|org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp\| \|datetime_funcs\|to_date\|org.apache.spark.sql.catalyst.expressions.ParseToDate\| \|datetime_funcs\|to_timestamp\|org.apache.spark.sql.catalyst.expressions.ParseToTimestamp\| \|datetime_funcs\|to_unix_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp\| \|datetime_funcs\|to_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp\| \|datetime_funcs\|trunc\|org.apache.spark.sql.catalyst.expressions.TruncDate\| \|datetime_funcs\|unix_date\|org.apache.spark.sql.catalyst.expressions.UnixDate\| \|datetime_funcs\|unix_micros\|org.apache.spark.sql.catalyst.expressions.UnixMicros\| \|datetime_funcs\|unix_millis\|org.apache.spark.sql.catalyst.expressions.UnixMillis\| \|datetime_funcs\|unix_seconds\|org.apache.spark.sql.catalyst.expressions.UnixSeconds\| \|datetime_funcs\|unix_timestamp\|org.apache.spark.sql.catalyst.expressions.UnixTimestamp\| \|datetime_funcs\|weekday\|org.apache.spark.sql.catalyst.expressions.WeekDay\| \|datetime_funcs\|weekofyear\|org.apache.spark.sql.catalyst.expressions.WeekOfYear\| \|datetime_funcs\|year\|org.apache.spark.sql.catalyst.expressions.Year\| \|generator_funcs\|explode_outer\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|explode\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|inline_outer\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|inline\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|posexplode_outer\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|posexplode\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|stack\|org.apache.spark.sql.catalyst.expressions.Stack\| \|hash_funcs\|crc32\|org.apache.spark.sql.catalyst.expressions.Crc32\| \|hash_funcs\|hash\|org.apache.spark.sql.catalyst.expressions.Murmur3Hash\| \|hash_funcs\|md5\|org.apache.spark.sql.catalyst.expressions.Md5\| \|hash_funcs\|sha1\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|sha2\|org.apache.spark.sql.catalyst.expressions.Sha2\| \|hash_funcs\|sha\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|xxhash64\|org.apache.spark.sql.catalyst.expressions.XxHash64\| \|json_funcs\|from_json\|org.apache.spark.sql.catalyst.expressions.JsonToStructs\| \|json_funcs\|get_json_object\|org.apache.spark.sql.catalyst.expressions.GetJsonObject\| \|json_funcs\|json_array_length\|org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray\| \|json_funcs\|json_object_keys\|org.apache.spark.sql.catalyst.expressions.JsonObjectKeys\| \|json_funcs\|json_tuple\|org.apache.spark.sql.catalyst.expressions.JsonTuple\| \|json_funcs\|schema_of_json\|org.apache.spark.sql.catalyst.expressions.SchemaOfJson\| \|json_funcs\|to_json\|org.apache.spark.sql.catalyst.expressions.StructsToJson\| \|lambda_funcs\|aggregate\|org.apache.spark.sql.catalyst.expressions.ArrayAggregate\| \|lambda_funcs\|array_sort\|org.apache.spark.sql.catalyst.expressions.ArraySort\| \|lambda_funcs\|exists\|org.apache.spark.sql.catalyst.expressions.ArrayExists\| \|lambda_funcs\|filter\|org.apache.spark.sql.catalyst.expressions.ArrayFilter\| \|lambda_funcs\|forall\|org.apache.spark.sql.catalyst.expressions.ArrayForAll\| \|lambda_funcs\|map_filter\|org.apache.spark.sql.catalyst.expressions.MapFilter\| \|lambda_funcs\|map_zip_with\|org.apache.spark.sql.catalyst.expressions.MapZipWith\| \|lambda_funcs\|transform_keys\|org.apache.spark.sql.catalyst.expressions.TransformKeys\| \|lambda_funcs\|transform_values\|org.apache.spark.sql.catalyst.expressions.TransformValues\| \|lambda_funcs\|transform\|org.apache.spark.sql.catalyst.expressions.ArrayTransform\| \|lambda_funcs\|zip_with\|org.apache.spark.sql.catalyst.expressions.ZipWith\| \|map_funcs\|element_at\|org.apache.spark.sql.catalyst.expressions.ElementAt\| \|map_funcs\|map_concat\|org.apache.spark.sql.catalyst.expressions.MapConcat\| \|map_funcs\|map_entries\|org.apache.spark.sql.catalyst.expressions.MapEntries\| \|map_funcs\|map_from_arrays\|org.apache.spark.sql.catalyst.expressions.MapFromArrays\| \|map_funcs\|map_from_entries\|org.apache.spark.sql.catalyst.expressions.MapFromEntries\| \|map_funcs\|map_keys\|org.apache.spark.sql.catalyst.expressions.MapKeys\| \|map_funcs\|map_values\|org.apache.spark.sql.catalyst.expressions.MapValues\| \|map_funcs\|map\|org.apache.spark.sql.catalyst.expressions.CreateMap\| \|map_funcs\|str_to_map\|org.apache.spark.sql.catalyst.expressions.StringToMap\| \|math_funcs\|%\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|*\|org.apache.spark.sql.catalyst.expressions.Multiply\| \|math_funcs\|+\|org.apache.spark.sql.catalyst.expressions.Add\| \|math_funcs\|-\|org.apache.spark.sql.catalyst.expressions.Subtract\| \|math_funcs\|/\|org.apache.spark.sql.catalyst.expressions.Divide\| \|math_funcs\|abs\|org.apache.spark.sql.catalyst.expressions.Abs\| \|math_funcs\|acosh\|org.apache.spark.sql.catalyst.expressions.Acosh\| \|math_funcs\|acos\|org.apache.spark.sql.catalyst.expressions.Acos\| \|math_funcs\|asinh\|org.apache.spark.sql.catalyst.expressions.Asinh\| \|math_funcs\|asin\|org.apache.spark.sql.catalyst.expressions.Asin\| \|math_funcs\|atan2\|org.apache.spark.sql.catalyst.expressions.Atan2\| \|math_funcs\|atanh\|org.apache.spark.sql.catalyst.expressions.Atanh\| \|math_funcs\|atan\|org.apache.spark.sql.catalyst.expressions.Atan\| \|math_funcs\|bin\|org.apache.spark.sql.catalyst.expressions.Bin\| \|math_funcs\|bround\|org.apache.spark.sql.catalyst.expressions.BRound\| \|math_funcs\|cbrt\|org.apache.spark.sql.catalyst.expressions.Cbrt\| \|math_funcs\|ceiling\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|ceil\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|conv\|org.apache.spark.sql.catalyst.expressions.Conv\| \|math_funcs\|cosh\|org.apache.spark.sql.catalyst.expressions.Cosh\| \|math_funcs\|cos\|org.apache.spark.sql.catalyst.expressions.Cos\| \|math_funcs\|cot\|org.apache.spark.sql.catalyst.expressions.Cot\| \|math_funcs\|degrees\|org.apache.spark.sql.catalyst.expressions.ToDegrees\| \|math_funcs\|div\|org.apache.spark.sql.catalyst.expressions.IntegralDivide\| \|math_funcs\|expm1\|org.apache.spark.sql.catalyst.expressions.Expm1\| \|math_funcs\|exp\|org.apache.spark.sql.catalyst.expressions.Exp\| \|math_funcs\|e\|org.apache.spark.sql.catalyst.expressions.EulerNumber\| \|math_funcs\|factorial\|org.apache.spark.sql.catalyst.expressions.Factorial\| \|math_funcs\|floor\|org.apache.spark.sql.catalyst.expressions.Floor\| \|math_funcs\|greatest\|org.apache.spark.sql.catalyst.expressions.Greatest\| \|math_funcs\|hex\|org.apache.spark.sql.catalyst.expressions.Hex\| \|math_funcs\|hypot\|org.apache.spark.sql.catalyst.expressions.Hypot\| \|math_funcs\|least\|org.apache.spark.sql.catalyst.expressions.Least\| \|math_funcs\|ln\|org.apache.spark.sql.catalyst.expressions.Log\| \|math_funcs\|log10\|org.apache.spark.sql.catalyst.expressions.Log10\| \|math_funcs\|log1p\|org.apache.spark.sql.catalyst.expressions.Log1p\| \|math_funcs\|log2\|org.apache.spark.sql.catalyst.expressions.Log2\| \|math_funcs\|log\|org.apache.spark.sql.catalyst.expressions.Logarithm\| \|math_funcs\|mod\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|negative\|org.apache.spark.sql.catalyst.expressions.UnaryMinus\| \|math_funcs\|pi\|org.apache.spark.sql.catalyst.expressions.Pi\| \|math_funcs\|pmod\|org.apache.spark.sql.catalyst.expressions.Pmod\| \|math_funcs\|positive\|org.apache.spark.sql.catalyst.expressions.UnaryPositive\| \|math_funcs\|power\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|pow\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|radians\|org.apache.spark.sql.catalyst.expressions.ToRadians\| \|math_funcs\|randn\|org.apache.spark.sql.catalyst.expressions.Randn\| \|math_funcs\|random\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rand\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rint\|org.apache.spark.sql.catalyst.expressions.Rint\| \|math_funcs\|round\|org.apache.spark.sql.catalyst.expressions.Round\| \|math_funcs\|shiftleft\|org.apache.spark.sql.catalyst.expressions.ShiftLeft\| \|math_funcs\|signum\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sign\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sinh\|org.apache.spark.sql.catalyst.expressions.Sinh\| \|math_funcs\|sin\|org.apache.spark.sql.catalyst.expressions.Sin\| \|math_funcs\|sqrt\|org.apache.spark.sql.catalyst.expressions.Sqrt\| \|math_funcs\|tanh\|org.apache.spark.sql.catalyst.expressions.Tanh\| \|math_funcs\|tan\|org.apache.spark.sql.catalyst.expressions.Tan\| \|math_funcs\|unhex\|org.apache.spark.sql.catalyst.expressions.Unhex\| \|math_funcs\|width_bucket\|org.apache.spark.sql.catalyst.expressions.WidthBucket\| \|misc_funcs\|assert_true\|org.apache.spark.sql.catalyst.expressions.AssertTrue\| \|misc_funcs\|current_catalog\|org.apache.spark.sql.catalyst.expressions.CurrentCatalog\| \|misc_funcs\|current_database\|org.apache.spark.sql.catalyst.expressions.CurrentDatabase\| \|misc_funcs\|input_file_block_length\|org.apache.spark.sql.catalyst.expressions.InputFileBlockLength\| \|misc_funcs\|input_file_block_start\|org.apache.spark.sql.catalyst.expressions.InputFileBlockStart\| \|misc_funcs\|input_file_name\|org.apache.spark.sql.catalyst.expressions.InputFileName\| \|misc_funcs\|java_method\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|monotonically_increasing_id\|org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID\| \|misc_funcs\|raise_error\|org.apache.spark.sql.catalyst.expressions.RaiseError\| \|misc_funcs\|reflect\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|spark_partition_id\|org.apache.spark.sql.catalyst.expressions.SparkPartitionID\| \|misc_funcs\|typeof\|org.apache.spark.sql.catalyst.expressions.TypeOf\| \|misc_funcs\|uuid\|org.apache.spark.sql.catalyst.expressions.Uuid\| \|misc_funcs\|version\|org.apache.spark.sql.catalyst.expressions.SparkVersion\| \|predicate_funcs\|!\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|<=>\|org.apache.spark.sql.catalyst.expressions.EqualNullSafe\| \|predicate_funcs\|<=\|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual\| \|predicate_funcs\|<\|org.apache.spark.sql.catalyst.expressions.LessThan\| \|predicate_funcs\|==\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|=\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|>=\|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual\| \|predicate_funcs\|>\|org.apache.spark.sql.catalyst.expressions.GreaterThan\| \|predicate_funcs\|and\|org.apache.spark.sql.catalyst.expressions.And\| \|predicate_funcs\|in\|org.apache.spark.sql.catalyst.expressions.In\| \|predicate_funcs\|isnan\|org.apache.spark.sql.catalyst.expressions.IsNaN\| \|predicate_funcs\|isnotnull\|org.apache.spark.sql.catalyst.expressions.IsNotNull\| \|predicate_funcs\|isnull\|org.apache.spark.sql.catalyst.expressions.IsNull\| \|predicate_funcs\|like\|org.apache.spark.sql.catalyst.expressions.Like\| \|predicate_funcs\|not\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|or\|org.apache.spark.sql.catalyst.expressions.Or\| \|predicate_funcs\|regexp_like\|org.apache.spark.sql.catalyst.expressions.RLike\| \|predicate_funcs\|rlike\|org.apache.spark.sql.catalyst.expressions.RLike\| \|string_funcs\|ascii\|org.apache.spark.sql.catalyst.expressions.Ascii\| \|string_funcs\|base64\|org.apache.spark.sql.catalyst.expressions.Base64\| \|string_funcs\|bit_length\|org.apache.spark.sql.catalyst.expressions.BitLength\| \|string_funcs\|char_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|character_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|char\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|chr\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|concat_ws\|org.apache.spark.sql.catalyst.expressions.ConcatWs\| \|string_funcs\|decode\|org.apache.spark.sql.catalyst.expressions.Decode\| \|string_funcs\|elt\|org.apache.spark.sql.catalyst.expressions.Elt\| \|string_funcs\|encode\|org.apache.spark.sql.catalyst.expressions.Encode\| \|string_funcs\|find_in_set\|org.apache.spark.sql.catalyst.expressions.FindInSet\| \|string_funcs\|format_number\|org.apache.spark.sql.catalyst.expressions.FormatNumber\| \|string_funcs\|format_string\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|initcap\|org.apache.spark.sql.catalyst.expressions.InitCap\| \|string_funcs\|instr\|org.apache.spark.sql.catalyst.expressions.StringInstr\| \|string_funcs\|lcase\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|left\|org.apache.spark.sql.catalyst.expressions.Left\| \|string_funcs\|length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|levenshtein\|org.apache.spark.sql.catalyst.expressions.Levenshtein\| \|string_funcs\|locate\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|lower\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|lpad\|org.apache.spark.sql.catalyst.expressions.StringLPad\| \|string_funcs\|ltrim\|org.apache.spark.sql.catalyst.expressions.StringTrimLeft\| \|string_funcs\|octet_length\|org.apache.spark.sql.catalyst.expressions.OctetLength\| \|string_funcs\|overlay\|org.apache.spark.sql.catalyst.expressions.Overlay\| \|string_funcs\|parse_url\|org.apache.spark.sql.catalyst.expressions.ParseUrl\| \|string_funcs\|position\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|printf\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|regexp_extract_all\|org.apache.spark.sql.catalyst.expressions.RegExpExtractAll\| \|string_funcs\|regexp_extract\|org.apache.spark.sql.catalyst.expressions.RegExpExtract\| \|string_funcs\|regexp_replace\|org.apache.spark.sql.catalyst.expressions.RegExpReplace\| \|string_funcs\|repeat\|org.apache.spark.sql.catalyst.expressions.StringRepeat\| \|string_funcs\|replace\|org.apache.spark.sql.catalyst.expressions.StringReplace\| \|string_funcs\|right\|org.apache.spark.sql.catalyst.expressions.Right\| \|string_funcs\|rpad\|org.apache.spark.sql.catalyst.expressions.StringRPad\| \|string_funcs\|rtrim\|org.apache.spark.sql.catalyst.expressions.StringTrimRight\| \|string_funcs\|sentences\|org.apache.spark.sql.catalyst.expressions.Sentences\| \|string_funcs\|soundex\|org.apache.spark.sql.catalyst.expressions.SoundEx\| \|string_funcs\|space\|org.apache.spark.sql.catalyst.expressions.StringSpace\| \|string_funcs\|split\|org.apache.spark.sql.catalyst.expressions.StringSplit\| \|string_funcs\|substring_index\|org.apache.spark.sql.catalyst.expressions.SubstringIndex\| \|string_funcs\|substring\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|substr\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|translate\|org.apache.spark.sql.catalyst.expressions.StringTranslate\| \|string_funcs\|trim\|org.apache.spark.sql.catalyst.expressions.StringTrim\| \|string_funcs\|ucase\|org.apache.spark.sql.catalyst.expressions.Upper\| \|string_funcs\|unbase64\|org.apache.spark.sql.catalyst.expressions.UnBase64\| \|string_funcs\|upper\|org.apache.spark.sql.catalyst.expressions.Upper\| \|struct_funcs\|named_struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|struct_funcs\|struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|window_funcs\|cume_dist\|org.apache.spark.sql.catalyst.expressions.CumeDist\| \|window_funcs\|dense_rank\|org.apache.spark.sql.catalyst.expressions.DenseRank\| \|window_funcs\|lag\|org.apache.spark.sql.catalyst.expressions.Lag\| \|window_funcs\|lead\|org.apache.spark.sql.catalyst.expressions.Lead\| \|window_funcs\|nth_value\|org.apache.spark.sql.catalyst.expressions.NthValue\| \|window_funcs\|ntile\|org.apache.spark.sql.catalyst.expressions.NTile\| \|window_funcs\|percent_rank\|org.apache.spark.sql.catalyst.expressions.PercentRank\| \|window_funcs\|rank\|org.apache.spark.sql.catalyst.expressions.Rank\| \|window_funcs\|row_number\|org.apache.spark.sql.catalyst.expressions.RowNumber\| \|xml_funcs\|xpath_boolean\|org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean\| \|xml_funcs\|xpath_double\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_float\|org.apache.spark.sql.catalyst.expressions.xml.XPathFloat\| \|xml_funcs\|xpath_int\|org.apache.spark.sql.catalyst.expressions.xml.XPathInt\| \|xml_funcs\|xpath_long\|org.apache.spark.sql.catalyst.expressions.xml.XPathLong\| \|xml_funcs\|xpath_number\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_short\|org.apache.spark.sql.catalyst.expressions.xml.XPathShort\| \|xml_funcs\|xpath_string\|org.apache.spark.sql.catalyst.expressions.xml.XPathString\| \|xml_funcs\|xpath\|org.apache.spark.sql.catalyst.expressions.xml.XPathList\| Closes #30040 NOTE: An original author of this PR is tanelk, so the credit should be given to tanelk. ### Why are the changes needed? For better documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a test to check if exprs have a group tag in `ExpressionInfoSuite`. Closes #30867 from maropu/pr30040. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:24:04 -08:00
Yuming Wang	4b19f49dd0	[SPARK-33845][SQL] Remove unnecessary if when trueValue and falseValue are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == (1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == (1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:15:29 -08:00
Wenchen Fan	b4bea1aa89	[SPARK-28863][SQL][FOLLOWUP] Make sure optimized plan will not be re-analyzed ### What changes were proposed in this pull request? It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly. This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true. ### Why are the changes needed? make the code simpler and safer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes #30777 from cloud-fan/ds. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-21 20:59:33 +09:00
Max Gekk	cdd1752ad1	[SPARK-33862][SQL] Throw `PartitionAlreadyExistsException` if the target partition exists while renaming ### What changes were proposed in this pull request? Throw `PartitionAlreadyExistsException` from `ALTER TABLE .. RENAME TO PARTITION` for a table from Hive V1 External Catalog in the case when the target partition already exists. ### Why are the changes needed? 1. To have the same behavior of V1 In-Memory and Hive External Catalog. 2. To not propagate internal Hive's exceptions to users. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the partition renaming command throws `PartitionAlreadyExistsException` for tables from the Hive catalog. ### How was this patch tested? Added new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" ``` Closes #30866 from MaxGekk/throw-PartitionAlreadyExistsException. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:37:30 -08:00
Kousuke Saruta	f4e1069bb8	[SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code ### What changes were proposed in this pull request? This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries. The following example is about `EXPLAIN CODEGEN`. ``` spark.conf.set("spark.sql.adaptive.enabled", "false") val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) == (1) Project [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] : +- Subquery scalar-subquery#3, [id=#24] : +- (2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#20] : +- (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) : +- (1) Range (1, 100, step=1, splits=12) +- (1) Scan OneRowRelation[] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator rdd_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / / 012 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / rdd_input_0 = inputs[0]; / 020 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / / 022 / } / 023 / / 024 / private void project_doConsume_0() throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / project_mutableStateArray_0[0].reset(); / 028 / / 029 / if (false) { / 030 / project_mutableStateArray_0[0].setNullAt(0); / 031 / } else { / 032 / project_mutableStateArray_0[0].write(0, 1L); / 033 / } / 034 / append((project_mutableStateArray_0[0].getRow())); / 035 / / 036 / } / 037 / / 038 / protected void processNext() throws java.io.IOException { / 039 / while ( rdd_input_0.hasNext()) { / 040 / InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next(); / 041 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 042 / project_doConsume_0(); / 043 / if (shouldStop()) return; / 044 / } / 045 / } / 046 / / 047 / } ``` After this change, the corresponding code for subqueries are shown. ``` Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) == (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) +- (1) Range (1, 100, step=1, splits=12) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean agg_initAgg_0; / 010 / private boolean agg_bufIsNull_0; / 011 / private long agg_bufValue_0; / 012 / private boolean range_initRange_0; / 013 / private long range_nextIndex_0; / 014 / private TaskContext range_taskContext_0; / 015 / private InputMetrics range_inputMetrics_0; / 016 / private long range_batchEnd_0; / 017 / private long range_numElementsTodo_0; / 018 / private boolean agg_agg_isNull_2_0; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3]; / 020 / / 021 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 022 / this.references = references; / 023 / } / 024 / / 025 / public void init(int index, scala.collection.Iterator[] inputs) { / 026 / partitionIndex = index; / 027 / this.inputs = inputs; / 028 / / 029 / range_taskContext_0 = TaskContext.get(); / 030 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 031 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 032 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 033 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 034 / / 035 / } / 036 / / 037 / private void agg_doAggregateWithoutKey_0() throws java.io.IOException { / 038 / // initialize aggregation buffer / 039 / agg_bufIsNull_0 = true; / 040 / agg_bufValue_0 = -1L; / 041 / / 042 / // initialize Range / 043 / if (!range_initRange_0) { / 044 / range_initRange_0 = true; / 045 / initRange(partitionIndex); / 046 / } / 047 / / 048 / while (true) { / 049 / if (range_nextIndex_0 == range_batchEnd_0) { / 050 / long range_nextBatchTodo_0; / 051 / if (range_numElementsTodo_0 > 1000L) { / 052 / range_nextBatchTodo_0 = 1000L; / 053 / range_numElementsTodo_0 -= 1000L; / 054 / } else { / 055 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 056 / range_numElementsTodo_0 = 0; / 057 / if (range_nextBatchTodo_0 == 0) break; / 058 / } / 059 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 060 / } / 061 / / 062 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 063 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 064 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 065 / / 066 / agg_doConsume_0(range_value_0); / 067 / / 068 / // shouldStop check is eliminated / 069 / } / 070 / range_nextIndex_0 = range_batchEnd_0; / 071 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 072 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 073 / range_taskContext_0.killTaskIfInterrupted(); / 074 / } / 075 / / 076 / } / 077 / / 078 / private void initRange(int idx) { / 079 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 080 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L); / 081 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L); / 082 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 083 / java.math.BigInteger start = java.math.BigInteger.valueOf(1L); / 084 / long partitionEnd; / 085 / / 086 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 087 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 088 / range_nextIndex_0 = Long.MAX_VALUE; / 089 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 090 / range_nextIndex_0 = Long.MIN_VALUE; / 091 / } else { / 092 / range_nextIndex_0 = st.longValue(); / 093 / } / 094 / range_batchEnd_0 = range_nextIndex_0; / 095 / / 096 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 097 / .multiply(step).add(start); / 098 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 099 / partitionEnd = Long.MAX_VALUE; / 100 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 101 / partitionEnd = Long.MIN_VALUE; / 102 / } else { / 103 / partitionEnd = end.longValue(); / 104 / } / 105 / / 106 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 107 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 108 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 109 / if (range_numElementsTodo_0 < 0) { / 110 / range_numElementsTodo_0 = 0; / 111 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 112 / range_numElementsTodo_0++; / 113 / } / 114 / } / 115 / / 116 / private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException { / 117 / // do aggregate / 118 / // common sub-expressions / 119 / / 120 / // evaluate aggregate functions and update aggregation buffers / 121 / / 122 / agg_agg_isNull_2_0 = true; / 123 / long agg_value_2 = -1L; / 124 / / 125 / if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 \|\| / 126 / agg_value_2 > agg_bufValue_0)) { / 127 / agg_agg_isNull_2_0 = false; / 128 / agg_value_2 = agg_bufValue_0; / 129 / } / 130 / / 131 / if (!false && (agg_agg_isNull_2_0 \|\| / 132 / agg_value_2 > agg_expr_0_0)) { / 133 / agg_agg_isNull_2_0 = false; / 134 / agg_value_2 = agg_expr_0_0; / 135 / } / 136 / / 137 / agg_bufIsNull_0 = agg_agg_isNull_2_0; / 138 / agg_bufValue_0 = agg_value_2; / 139 / / 140 / } / 141 / / 142 / protected void processNext() throws java.io.IOException { / 143 / while (!agg_initAgg_0) { / 144 / agg_initAgg_0 = true; / 145 / long agg_beforeAgg_0 = System.nanoTime(); / 146 / agg_doAggregateWithoutKey_0(); / 147 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / aggTime /).add((System.nanoTime() - agg_beforeAgg_0) / 1000000); / 148 / / 149 / // output the result / 150 / / 151 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] / numOutputRows /).add(1); / 152 / range_mutableStateArray_0[2].reset(); / 153 / / 154 / range_mutableStateArray_0[2].zeroOutNullBytes(); / 155 / / 156 / if (agg_bufIsNull_0) { / 157 / range_mutableStateArray_0[2].setNullAt(0); / 158 / } else { / 159 / range_mutableStateArray_0[2].write(0, agg_bufValue_0); / 160 / } / 161 / append((range_mutableStateArray_0[2].getRow())); / 162 / } / 163 / } / 164 / / 165 */ } ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`. ### How was this patch tested? New test. Closes #30859 from sarutak/explain-codegen-subqueries. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:29:00 -08:00
Max Gekk	b313a1e9e6	[SPARK-33849][SQL][TESTS] Unify v1 and v2 DROP TABLE tests ### What changes were proposed in this pull request? 1. Move the `DROP TABLE` parsing tests to `DropTableParserSuite` 2. Place the v1 tests for `DROP TABLE` from `DDLSuite` and v2 tests from `DataSourceV2SQLSuite` to the common trait `DropTableSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `DROP TABLE` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableSuite" ``` Closes #30854 from MaxGekk/unify-drop-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 08:34:12 +00:00
Terry Kim	1c7b79c057	[SPARK-33856][SQL] Migrate ALTER TABLE ... RENAME TO PARTITION to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RENAME TO PARTITION` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RENAME TO PARTITION` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ``` sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") // works fine assuming id=1 exists. ``` , but after this PR: ``` sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RENAME TO PARTITION' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30862 from imback82/alter_table_rename_partition_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 04:58:56 +00:00
Kousuke Saruta	3c8be3983c	[SPARK-33850][SQL][FOLLOWUP] Improve and cleanup the test code ### What changes were proposed in this pull request? This PR mainly improves and cleans up the test code introduced in #30855 based on the comment. The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`). ### Why are the changes needed? To keep the test code clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ExplainSuite` passes. Closes #30861 from sarutak/followup-SPARK-33850. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-21 09:40:42 +09:00
Terry Kim	df2314b63a	[SPARK-33852][SQL][TESTS] Use assertAnalysisError in HiveDDLSuite.scala ### What changes were proposed in this pull request? `HiveDDLSuite` has many of the following patterns: ```scala val e = intercept[AnalysisException] { sql(sqlString) } assert(e.message.contains(exceptionMessage)) ``` However, there already exists `assertAnalysisError` helper function which does exactly the same thing. ### Why are the changes needed? To refactor code to simplify. ### Does this PR introduce _any_ user-facing change? No, just refactoring the test code. ### How was this patch tested? Existing tests Closes #30857 from imback82/hive_ddl_suite_use_assertAnalysisError. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:37:15 -08:00
Kousuke Saruta	70da86a085	[SPARK-33850][SQL] EXPLAIN FORMATTED doesn't show the plan for subqueries if AQE is enabled ### What changes were proposed in this pull request? This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries. ```scala val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED") == Physical Plan == AdaptiveSparkPlan (3) +- Project (2) +- Scan OneRowRelation (1) (1) Scan OneRowRelation Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L] Input: [] (3) AdaptiveSparkPlan Output [1]: [scalarsubquery()#5L] Arguments: isFinalPlan=false ``` After this change, the plan for the subquerie is shown. ```scala == Physical Plan == * Project (2) +- * Scan OneRowRelation (1) (1) Scan OneRowRelation [codegen id : 1] Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project [codegen id : 1] Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] Input: [] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24] * HashAggregate (6) +- Exchange (5) +- * HashAggregate (4) +- * Range (3) (3) Range [codegen id : 1] Output [1]: [id#0L] Arguments: Range (1, 100, step=1, splits=Some(12)) (4) HashAggregate [codegen id : 1] Input [1]: [id#0L] Keys: [] Functions [1]: [partial_min(id#0L)] Aggregate Attributes [1]: [min#7L] Results [1]: [min#8L] (5) Exchange Input [1]: [min#8L] Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20] (6) HashAggregate [codegen id : 2] Input [1]: [min#8L] Keys: [] Functions [1]: [min(id#0L)] Aggregate Attributes [1]: [min(id#0L)#4L] Results [1]: [min(id#0L)#4L AS v#2L] ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. Users can see the formatted plan for subqueries. ### How was this patch tested? New test. Closes #30855 from sarutak/fix-aqe-explain. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:10:20 -08:00
Ammar Al-Batool	37c4cd8f05	[MINOR][DOCS] Fix typos in ScalaDocs for DataStreamWriter#foreachBatch The title is pretty self-explanatory. ### What changes were proposed in this pull request? Fixing typos in the docs for `foreachBatch` functions. ### Why are the changes needed? To fix typos in JavaDoc/ScalaDoc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Yes. Closes #30782 from ammar1x/patch-1. Lead-authored-by: Ammar Al-Batool <ammar.albatool@gmail.com> Co-authored-by: Ammar Al-Batool <ammar.al-batool@disneystreaming.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-19 14:53:40 -06:00
Terry Kim	06075d849e	[SPARK-33829][SQL] Renaming v2 tables should recreate the cache ### What changes were proposed in this pull request? Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior. ### Why are the changes needed? Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated. ### Does this PR introduce _any_ user-facing change? Yes, now when a v2 table is renamed, cache is correctly updated. ### How was this patch tested? Added a new test Closes #30825 from imback82/rename_recreate_cache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:32:58 -08:00
Kent Yao	dd44ba5460	[SPARK-32976][SQL][FOLLOWUP] SET and RESTORE hive.exec.dynamic.partition.mode for HiveSQLInsertTestSuite to avoid flakiness ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/29893#discussion_r545303780 mentioned: > We need to set spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") before executing this suite; otherwise, test("insert with column list - follow table output order + partitioned table") will fail. The reason why it does not fail because some test cases [running before this suite] do not change the default value of hive.exec.dynamic.partition.mode back to strict. However, the order of test suite execution is not deterministic. ### Why are the changes needed? avoid flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30843 from yaooqinn/SPARK-32976-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:00:09 -08:00
Wenchen Fan	de234eec8f	[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property ### What changes were proposed in this pull request? Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big. This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats. ### Why are the changes needed? To be able to analyze table when histogram data is big. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test and new tests Closes #30809 from cloud-fan/cbo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-19 14:35:28 +09:00
Kent Yao	c17c76dd16	[SPARK-33599][SQL][FOLLOWUP] FIX Github Action with unidoc ### What changes were proposed in this pull request? FIX Github Action with unidoc ### Why are the changes needed? FIX Github Action with unidoc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass GA Closes #30846 from yaooqinn/SPARK-33599. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 11:23:38 -08:00
gengjiaan	6dca2e5d35	[SPARK-33599][SQL] Group exception messages in catalyst/analysis ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30717 from beliefer/SPARK-33599. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 14:12:35 +00:00
gengjiaan	f239128802	[SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream databases ### What changes were proposed in this pull request? There are a lot of mainstream databases support regex function `REGEXP_LIKE`. Currently, Spark supports `RLike` and we just need add a new alias `REGEXP_LIKE` for it. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 Presto https://prestodb.io/docs/current/functions/regexp.html Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_____5 Snowflake https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html Additional modifications 1. Because test case named `check outputs of expression examples` in ExpressionInfoSuite executes the example SQL of built-in function, so the below SQL be executed: `SELECT '%SystemDrive%\Users\John' regexp_like '%SystemDrive%\\Users.'` But Spark SQL not supports this syntax yet. 2. Another reason: `SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.';` is an SQL syntax, not the usecase for function `RLike`. As the above reason, this PR changes the example SQL of `RLike`. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? Make the behavior of Spark SQL consistent with mainstream databases. ### How was this patch tested? Jenkins test Closes #30543 from beliefer/SPARK-33597. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:47:31 +00:00
Yuming Wang	06b1bbbbab	[SPARK-33798][SQL] Add new rule to push down the foldable expressions through CaseWhen/If ### What changes were proposed in this pull request? This pr add a new rule(`PushFoldableIntoBranches`) to push down the foldable expressions through `CaseWhen/If`. This is a real case from production: ```sql create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; ``` Before this PR: ``` == Physical Plan == Union :- (1) Project [a AS event_type#30533, id#30535L] : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- (2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- (2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- (2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet ``` After this PR: ``` == Physical Plan == (1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30790 from wangyum/SPARK-33798. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:20:58 +00:00
angerszhu	0603913c66	[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value ### What changes were proposed in this pull request? Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) \| USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` ### Why are the changes needed? Fix data incorrect issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 00:01:13 -08:00
Terry Kim	0f1a18370a	[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe ### What changes were proposed in this pull request? This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190 For reference, `UNCACHE TABLE` also uses `LogicalPlan`: `0c12900120/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (L91-L98)` ### Why are the changes needed? To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`. ### Does this PR introduce _any_ user-facing change? No, just internal changes. ### How was this patch tested? Existing tests since this is an internal refactoring change. Closes #30815 from imback82/cache_with_logical_plan. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 04:30:15 +00:00
Takeshi Yamamuro	51ef4430dc	[SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin ### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](`031c5ef280`)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: `1e85707738/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (L47-L50)` The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed. Closes #30818 from maropu/BugfixInAQE. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-17 16:16:05 -08:00
allisonwang-db	1e85707738	[SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default ### What changes were proposed in this pull request? This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents. ### Why are the changes needed? Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #30659 from allisonwang-db/spark-33697-remove-project-union. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:47:44 +00:00
Terry Kim	0c19497222	[SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES] to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t SET SERDE 'serdename'") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t SET SERDE 'serdename'") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES\' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `t` in the above example is resolved to a temp view first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30813 from imback82/alter_table_serde_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:25:51 +00:00
Terry Kim	e7e29fd0af	[SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case class ### What changes were proposed in this pull request? This PR removes unused `TruncateTableStatement`: https://github.com/apache/spark/pull/30457#discussion_r544433820 ### Why are the changes needed? To remove unused `TruncateTableStatement` from #30457. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed. Closes #30811 from imback82/remove_truncate_table_stmt. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:13:02 -08:00
Kent Yao	728a1298af	[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions ### What changes were proposed in this pull request? It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. ### Why are the changes needed? 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:09:28 -08:00
Terry Kim	8666d1c39c	[SPARK-33800][SQL] Remove command name in AnalysisException message when a relation is not resolved ### What changes were proposed in this pull request? Based on the discussion https://github.com/apache/spark/pull/30743#discussion_r543124594, this PR proposes to remove the command name in AnalysisException message when a relation is not resolved. For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands. ### Why are the changes needed? To make the exception message consistent. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed from ``` Table or view not found for 'SHOW TBLPROPERTIES': badtable ``` to ``` Table or view not found: badtable ``` for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier. ### How was this patch tested? Updated existing tests. Closes #30794 from imback82/remove_cmd_from_exception_msg. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 15:56:50 +00:00
Kent Yao	205d8e40bc	[SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first ### What changes were proposed in this pull request? As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting. ### Why are the changes needed? to make reset command saner. ### Does this PR introduce _any_ user-facing change? yes, RESET will respect session initials first not always go to the system defaults ### How was this patch tested? add new tests Closes #30642 from yaooqinn/SPARK-32991-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 14:36:38 +00:00
Max Gekk	9d9d4a8e12	[SPARK-33789][SQL][TESTS] Refactor unified V1 and V2 datasource tests ### What changes were proposed in this pull request? 1. Move common utility functions such as `test()`, `withNsTable()` and `checkPartitions()` to `DDLCommandTestUtils`. 2. Place common settings such as `version`, `catalog`, `defaultUsing`, `sparkConf` to `CommandSuiteBase`. ### Why are the changes needed? To improve code maintenance of the unified tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowTablesSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30779 from MaxGekk/refactor-unified-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:49:49 +00:00
HyukjinKwon	7845865b8d	[SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE command ### What changes were proposed in this pull request? This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well: `e3058ba17c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (L63)` This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map. ### Why are the changes needed? To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/ ``` describe.sql Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29 DESC FORMATTED v ``` ### Does this PR introduce _any_ user-facing change? Yes, it will change the text output from `DESCRIBE [EXTENDED\|FORMATTED] table_name`. Now the table properties are sorted by its key. ### How was this patch tested? Related unittests were fixed accordingly. Closes #30799 from HyukjinKwon/SPARK-33803. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:42:30 +00:00
Terry Kim	ef7f6903b4	[SPARK-33786][SQL] The storage level for a cache should be respected when a table name is altered ### What changes were proposed in this pull request? This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`. ### Why are the changes needed? Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example: ```scala def getStorageLevel(tableName: String): StorageLevel = { val table = spark.table(tableName) val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get cachedData.cachedRepresentation.cacheBuilder.storageLevel } Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'") sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')") val oldStorageLevel = getStorageLevel("old") sql("ALTER TABLE old RENAME TO new") val newStorageLevel = getStorageLevel("new") ``` `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level. ### Does this PR introduce _any_ user-facing change? Yes, now the storage level for the cache will be retained. ### How was this patch tested? Added a unit test. Closes #30774 from imback82/alter_table_rename_cache_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:45:44 +00:00
Terry Kim	62be2483d7	[SPARK-33765][SQL] Migrate UNCACHE TABLE to use UnresolvedRelation to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `UNCACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022. ### Why are the changes needed? To resolve the table/view in the analyzer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated existing tests Closes #30743 from imback82/uncache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:37:56 +00:00
Max Gekk	3dfdcf4f92	[SPARK-33788][SQL] Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30778 from MaxGekk/hive-drop-partition-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 10:03:48 +09:00
Anton Okolnychyi	4d56d43838	[SPARK-33735][SQL] Handle UPDATE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `UpdateTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize update conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `UpdateTable`. Closes #30787 from aokolnychyi/spark-33735. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-15 13:50:58 -08:00
Wenchen Fan	40c37d69fd	[SPARK-33617][SQL][FOLLOWUP] refine the default parallelism SQL config ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30559 . The default parallelism config in Spark core is not good, as it's unclear where it applies. To not inherit this problem in Spark SQL, this PR refines the default parallelism SQL config, to make it clear that it only applies to leaf nodes. ### Why are the changes needed? Make the config clearer. ### Does this PR introduce _any_ user-facing change? It changes an unreleased config. ### How was this patch tested? existing tests Closes #30736 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 14:16:43 +00:00
Prakhar Jain	23083aa594	[SPARK-33758][SQL] Prune unrequired partitionings from AliasAwareOutputPartitionings when some columns are dropped from projection ### What changes were proposed in this pull request? This PR tries to prune the unrequired output partitionings in cases when the columns are dropped from Project/Aggregates etc. ### Why are the changes needed? Consider this query: select t1.id from t1 JOIN t2 on t1.id = t2.id This query will have top level Project node which will just project t1.id. But the outputPartitioning of this project node will be: PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id)). But since we are not propagating t2.id column, so we can drop HashPartitioning(t2.id) from the output partitioning of Project node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs. Closes #30762 from prakharjain09/SPARK-33758-prune-partitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 13:46:58 +00:00
gengjiaan	58cb2bae74	[SPARK-33752][SQL] Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly ### What changes were proposed in this pull request? The current `getSimpleMessage` of `AnalysisException` may adds semicolon repeatedly. There show an example below: `select decode()` The output will be: ``` org.apache.spark.sql.AnalysisException Invalid number of arguments for function decode. Expected: 2; Found: 0;; line 1 pos 7 ``` ### Why are the changes needed? Fix a bug, because it adds semicolon repeatedly. ### Does this PR introduce _any_ user-facing change? Yes. the message of AnalysisException will be correct. ### How was this patch tested? Jenkins test. Closes #30724 from beliefer/SPARK-33752. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 19:20:01 +09:00
Chongguang LIU	20f6d63bc1	[SPARK-33769][SQL] Improve the next-day function of the sql component to deal with Column type ### What changes were proposed in this pull request? The proposition of this pull request is described in this JIRA ticket: [https://issues.apache.org/jira/browse/SPARK-33769](url) It proposes to improve the next-day function of the sql component to deal with Column type for the parameter dayOfWeek. ### Why are the changes needed? It makes this functionality easier to use. Actually the signature of this function is: > def next_day(date: Column, dayOfWeek: String): Column. It accepts the dayOfWeek parameter as a String. However in some cases, the dayOfWeek is in a Column, so a different value for each row of the dataframe. A current workaround is to use the NextDay function like this: > NextDay(dateCol.expr, dayOfWeekCol.expr). The proposition is to add another signature for this function: > def next_day(date: Column, dayOfWeek: Column): Column In fact it is already the case for some other functions in this scala object, exemple: > def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days)) > def date_sub(start: Column, days: Column): Column = withExpr \{ DateSub(start.expr, days.expr) } or > def add_months(startDate: Column, numMonths: Int): Column = add_months(startDate, lit(numMonths)) > def add_months(startDate: Column, numMonths: Column): Column = withExpr { > AddMonths(startDate.expr, numMonths.expr) > } This pull request is the same idea for the function next_day. ### Does this PR introduce _any_ user-facing change? Yes With this pull request, users of spark will have a new signature of the function: > def next_day(date: Column, dayOfWeek: Column): Column But the existing function signature should still work: > def next_day(date: Column, dayOfWeek: String): Column So this change should be retrocompatible. ### How was this patch tested? The unit tests of the next_day function has been enhanced. It tests the dayOfWeek parameter both as String and Column. I also added a test case for the existing signature where the dayOfWeek is a non valid String. This should return null. Closes #30761 from chongguang/SPARK-33769. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:55:48 +09:00
Wenchen Fan	03042529e3	[SPARK-33273][SQL] Fix a race condition in subquery execution ### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing https://github.com/apache/spark/pull/30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:29:28 +09:00
Max Gekk	141e26d65b	[SPARK-33767][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. DROP PARTITION` parsing tests to `AlterTableDropPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. DROP PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableDropPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. DROP PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionParserSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30747 from MaxGekk/unify-alter-table-drop-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:36:57 +00:00
Terry Kim	366beda54a	[SPARK-33785][SQL] Migrate ALTER TABLE ... RECOVER PARTITIONS to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RECOVER PARTITIONS` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RECOVER PARTITIONS` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t RECOVER PARTITIONS") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t RECOVER PARTITIONS") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RECOVER PARTITIONS' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE t RECOVER PARTITIONS` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30773 from imback82/alter_table_recover_part_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:23:39 +00:00
Chao Sun	49d3256497	[SPARK-33653][SQL] DSv2: REFRESH TABLE should recache the table itself ### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes #30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 15:18:50 -08:00
Max Gekk	f156718587	[SPARK-33777][SQL] Sort output of V2 SHOW PARTITIONS ### What changes were proposed in this pull request? List partitions returned by the V2 `SHOW PARTITIONS` command in alphabetical order. ### Why are the changes needed? To have the same behavior as: 1. V1 in-memory catalog, see `a28ed86a38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (L546)` 2. V1 Hive catalogs, see `fab2995972/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L715)` ### Does this PR introduce _any_ user-facing change? Yes, after the changes, V2 SHOW PARTITIONS sorts its output. ### How was this patch tested? Added new UT to the base trait `ShowPartitionsSuiteBase` which contains tests for V1 and V2. Closes #30764 from MaxGekk/sort-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 14:28:47 -08:00
Yuming Wang	412d86e711	[SPARK-33771][SQL][TESTS] Fix Invalid value for HourOfAmPm when testing on JDK 14 ### What changes were proposed in this pull request? This pr fix invalid value for HourOfAmPm when testing on JDK 14. ### Why are the changes needed? Run test on JDK 14. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30754 from wangyum/SPARK-33771. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 13:34:23 -08:00
Anton Okolnychyi	bb60fb1bbd	[SPARK-33779][SQL][FOLLOW-UP] Fix Java Linter error ### What changes were proposed in this pull request? This PR removes unused imports. ### Why are the changes needed? These changes are required to fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Via `dev/lint-java`. Closes #30767 from aokolnychyi/fix-linter. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 11:39:42 -08:00
Anton Okolnychyi	82aca7eb8f	[SPARK-33779][SQL] DataSource V2: API to request distribution and ordering on write ### What changes were proposed in this pull request? This PR adds connector interfaces proposed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful: - global sort - cluster data and sort within partitions - local sort within partitions - no sort Please see the design doc above for a more detailed explanation of requirements. ### Does this PR introduce _any_ user-facing change? This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info). The existing `Distribution` interface in `read` package is read-specific and not flexible enough like discussed in the design doc. The current proposal is to evolve these interfaces separately until they converge. ### How was this patch tested? This patch adds only interfaces. Closes #30706 from aokolnychyi/spark-23889-interfaces. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Ryan Blue <blue@apache.org>	2020-12-14 10:54:18 -08:00
ulysses-you	839d6899ad	[SPARK-33733][SQL] PullOutNondeterministic should check and collect deterministic field ### What changes were proposed in this pull request? The deterministic field is wider than `NonDerterministic`, we should keep same range between pull out and check analysis. ### Why are the changes needed? For example ``` select * from values(1), (4) as t(c1) order by java_method('java.lang.Math', 'abs', c1) ``` We will get exception since `java_method` deterministic field is false but not a `NonDeterministic` ``` Exception in thread "main" org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: java_method('java.lang.Math', 'abs', t.`c1`) ASC NULLS FIRST in operator Sort [java_method(java.lang.Math, abs, c1#1) ASC NULLS FIRST], true ;; ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30703 from ulysses-you/SPARK-33733. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 14:35:24 +00:00
angerszhu	5f9a7fea06	[SPARK-33428][SQL] Conv UDF use BigInt to avoid Long value overflow ### What changes were proposed in this pull request? Use Long value store encode value will overflow and return unexpected result, use BigInt to replace Long value and make logical more simple. ### Why are the changes needed? Fix value overflow issue ### Does this PR introduce _any_ user-facing change? People can sue `conf` function to convert value big then LONG.MAX_VALUE ### How was this patch tested? Added UT #### BenchMark ``` /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. / package org.apache.spark.sql.execution.benchmark import scala.util.Random import org.apache.spark.benchmark.Benchmark import org.apache.spark.sql.functions._ object ConvFuncBenchMark extends SqlBasedBenchmark { val charset = Array[String]("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z") def constructString(from: Int, length: Int): String = { val chars = charset.slice(0, from) (0 to length).map(x => { val v = Random.nextInt(from) chars(v) }).mkString("") } private def doBenchmark(cardinality: Long, length: Int, from: Int, toBase: Int): Unit = { spark.range(cardinality) .withColumn("str", lit(constructString(from, length))) .select(conv(col("str"), from, toBase)) .noop() } /* * Main process of the whole benchmark. * Implementations of this method are supposed to use the wrapper method `runBenchmark` * for each benchmark scenario. */ override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val N = 1000000L val benchmark = new Benchmark("conv", N, output = output) benchmark.addCase("length 10 from 2 to 16") { _ => doBenchmark(N, 10, 2, 16) } benchmark.addCase("length 10 from 2 to 10") { _ => doBenchmark(N, 10, 2, 10) } benchmark.addCase("length 10 from 10 to 16") { _ => doBenchmark(N, 10, 10, 16) } benchmark.addCase("length 10 from 10 to 36") { _ => doBenchmark(N, 10, 10, 36) } benchmark.addCase("length 10 from 16 to 10") { _ => doBenchmark(N, 10, 10, 10) } benchmark.addCase("length 10 from 16 to 36") { _ => doBenchmark(N, 10, 16, 36) } benchmark.addCase("length 10 from 36 to 10") { _ => doBenchmark(N, 10, 36, 10) } benchmark.addCase("length 10 from 36 to 16") { _ => doBenchmark(N, 10, 36, 16) } // benchmark.addCase("length 20 from 10 to 16") { _ => doBenchmark(N, 20, 10, 16) } benchmark.addCase("length 20 from 10 to 36") { _ => doBenchmark(N, 20, 10, 36) } benchmark.addCase("length 30 from 10 to 16") { _ => doBenchmark(N, 30, 10, 16) } benchmark.addCase("length 30 from 10 to 36") { _ => doBenchmark(N, 30, 10, 36) } // benchmark.addCase("length 20 from 16 to 10") { _ => doBenchmark(N, 20, 16, 10) } benchmark.addCase("length 20 from 16 to 36") { _ => doBenchmark(N, 20, 16, 36) } benchmark.addCase("length 30 from 16 to 10") { _ => doBenchmark(N, 30, 16, 10) } benchmark.addCase("length 30 from 16 to 36") { _ => doBenchmark(N, 30, 16, 36) } benchmark.run() } } ``` Result with patch : ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-8259U CPU 2.30GHz conv: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ length 10 from 2 to 16 54 73 18 18.7 53.6 1.0X length 10 from 2 to 10 43 47 5 23.5 42.5 1.3X length 10 from 10 to 16 39 47 12 25.5 39.2 1.4X length 10 from 10 to 36 38 42 3 26.5 37.7 1.4X length 10 from 16 to 10 39 41 3 25.7 38.9 1.4X length 10 from 16 to 36 36 41 4 27.6 36.3 1.5X length 10 from 36 to 10 38 40 2 26.3 38.0 1.4X length 10 from 36 to 16 37 39 2 26.8 37.2 1.4X length 20 from 10 to 16 36 39 2 27.4 36.5 1.5X length 20 from 10 to 36 37 39 2 27.2 36.8 1.5X length 30 from 10 to 16 37 39 2 27.0 37.0 1.4X length 30 from 10 to 36 36 38 2 27.5 36.3 1.5X length 20 from 16 to 10 35 38 2 28.3 35.4 1.5X length 20 from 16 to 36 34 38 3 29.2 34.3 1.6X length 30 from 16 to 10 38 40 2 26.3 38.1 1.4X length 30 from 16 to 36 37 38 1 27.2 36.8 1.5X ``` Result without patch: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-8259U CPU 2.30GHz conv: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ length 10 from 2 to 16 66 101 29 15.1 66.1 1.0X length 10 from 2 to 10 50 55 5 20.2 49.5 1.3X length 10 from 10 to 16 46 51 5 21.8 45.9 1.4X length 10 from 10 to 36 43 48 4 23.4 42.7 1.5X length 10 from 16 to 10 44 47 4 22.9 43.7 1.5X length 10 from 16 to 36 40 44 2 24.7 40.5 1.6X length 10 from 36 to 10 40 44 4 25.0 40.1 1.6X length 10 from 36 to 16 41 43 2 24.3 41.2 1.6X length 20 from 10 to 16 39 41 2 25.7 38.9 1.7X length 20 from 10 to 36 40 42 2 24.9 40.2 1.6X length 30 from 10 to 16 39 40 1 25.9 38.6 1.7X length 30 from 10 to 36 40 41 1 25.0 40.0 1.7X length 20 from 16 to 10 40 41 1 25.1 39.8 1.7X length 20 from 16 to 36 40 42 2 25.2 39.7 1.7X length 30 from 16 to 10 39 42 2 25.6 39.0 1.7X length 30 from 16 to 36 39 40 2 25.7 38.8 1.7X ``` Closes #30350 from AngersZhuuuu/SPARK-33428. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 14:32:08 +00:00
yangjie01	cd0356df9e	[SPARK-33673][SQL] Avoid push down partition filters to ParquetScan for DataSourceV2 ### What changes were proposed in this pull request? As described in SPARK-33673, some test suites in `ParquetV2SchemaPruningSuite` will failed when set `parquet.version` to 1.11.1 because Parquet will return empty results for non-existent column since PARQUET-1765. This pr change to use `readDataSchema()` instead of `schema` to build `pushedParquetFilters` in `ParquetScanBuilder` to avoid push down partition filters to `ParquetScan` for `DataSourceV2` ### Why are the changes needed? Prepare for upgrade using Parquet 1.11.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test as follows: ``` mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite -Dparquet.version=1.11.1 test -pl sql/core -am ``` Before ``` Run completed in 3 minutes, 13 seconds. Total number of tests run: 134 Suites: completed 2, aborted 0 Tests: succeeded 120, failed 14, canceled 0, ignored 0, pending 0 * 14 TESTS FAILED * ``` After ``` Run completed in 3 minutes, 46 seconds. Total number of tests run: 134 Suites: completed 2, aborted 0 Tests: succeeded 134, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #30652 from LuciferYang/SPARK-33673. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-12-14 17:51:40 +08:00
Terry Kim	a84c8d842c	[SPARK-33751][SQL] Migrate ALTER VIEW ... AS command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR migrates `ALTER VIEW ... AS` to use `UnresolvedView` to resolve the view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). The `TempViewOrV1Table` extractor in `ResolveSessionCatalog.scala` can now be removed as well. ### Why are the changes needed? To use `UnresolvedView` for view resolution. ### Does this PR introduce _any_ user-facing change? The exception message changes if a table is found instead of view: ``` // OLD `tab1` is not a view" ``` ``` // NEW "tab1 is a table. 'ALTER VIEW ... AS' expects a view." ``` ### How was this patch tested? Updated existing tests. Closes #30723 from imback82/alter_view_as_statement. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:39:01 +00:00
Linhong Liu	b7c8210135	[SPARK-33142][SPARK-33647][SQL][FOLLOW-UP] Add docs and test cases ### What changes were proposed in this pull request? Addressed comments in PR #30567, including: 1. add test case for SPARK-33647 and SPARK-33142 2. add migration guide 3. add `getRawTempView` and `getRawGlobalTempView` to return the raw view info (i.e. TemporaryViewRelation) 4. other minor code clean ### Why are the changes needed? Code clean and more test cases ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and newly added test cases Closes #30666 from linhongliu-db/SPARK-33142-followup. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:31:50 +00:00
xuewei.linxuewei	e7fe92f129	[SPARK-33546][SQL] Enable row format file format validation in CREATE TABLE LIKE ### What changes were proposed in this pull request? [SPARK-33546] stated the there are three inconsistency behaviors for CREATE TABLE LIKE. 1. CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., STORED AS PARQUET can't be used with ROW FORMAT SERDE. 2. CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified together, which is not necessary. 3. CREATE TABLE LIKE does not respect the default hive serde. This PR fix No.1, and after investigate, No.2 and No.3 turn out not to be issue. Within Hive. CREATE TABLE abc ... ROW FORMAT SERDE 'xxx.xxx.SerdeClass' (Without Stored as) will have following result. Using the user specific SerdeClass and fetch default input/output format from default textfile format. ``` SerDe Library: xxx.xxx.SerdeClass InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ``` But for CREATE TABLE dst LIKE src ROW FORMAT SERDE 'xxx.xxx.SerdeClass' (Without Stored as) will just ignore user specific SerdeClass and using (input, output, serdeClass) from src table. It's better to just throw an exception on such ambiguous behavior, so No.2 is not an issue, but in the PR, we add some comments. For No.3, in fact, CreateTableLikeCommand is using following logical to try to follow src table's storageFormat if current fileFormat.inputFormat is empty ``` val newStorage = if (fileFormat.inputFormat.isDefined) { fileFormat } else { sourceTableDesc.storage.copy(locationUri = fileFormat.locationUri) } ``` If we try to fill the new target table with HiveSerDe.getDefaultStorage if file format and row format is not explicity spefified, it will break the CREATE TABLE LIKE semantic. ### Why are the changes needed? Bug Fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30705 from leanken/leanken-SPARK-33546. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:27:18 +00:00
Max Gekk	817f58ddcb	[SPARK-33768][SQL] Remove `retainData` from `AlterTableDropPartition` ### What changes were proposed in this pull request? Remove the `retainData` parameter from the logical node `AlterTableDropPartition`. ### Why are the changes needed? The `AlterTableDropPartition` command reflects the sql statement (see SqlBase.g4): ``` \| ALTER (TABLE \| VIEW) multipartIdentifier DROP (IF EXISTS)? partitionSpec (',' partitionSpec)* PURGE? #dropTablePartitions ``` but Spark doesn't allow to specify data retention. So, the parameter can be removed to improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test suite `DDLParserSuite`. Closes #30748 from MaxGekk/remove-retainData. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:16:33 +00:00
Max Gekk	9160d59ae3	[SPARK-33770][SQL][TESTS] Fix the `ALTER TABLE .. DROP PARTITION` tests that delete files out of partition path ### What changes were proposed in this pull request? Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`. The problem belongs to Hive's MetaStore method `drop_partition_common`: `8696c82d07/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java (L4876)` which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path. ### Why are the changes needed? To fix test failures like https://github.com/apache/spark/pull/30643#issuecomment-743774733: ``` org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) ... Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381) at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source) ``` The issue can be reproduced by the following steps: 1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location` 2. Create a sub-folder in the base folder and drop permissions for it: ``` $ mkdir /Users/maximgekk/tmp/part-location/aaa $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa $ ls -al /Users/maximgekk/tmp/part-location total 0 drwxr-xr-x 3 maximgekk staff 96 Dec 13 18:42 . drwxr-xr-x 33 maximgekk staff 1056 Dec 13 18:32 .. d--------- 2 maximgekk staff 64 Dec 13 18:42 aaa ``` 3. Create a table with a partition folder in the base folder: ```sql spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int); spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl'; ``` 4. Try to drop this partition: ``` spark-sql> alter table tbl drop partition (part0=1,part1=2); 20/12/13 18:46:07 ERROR HiveClientImpl: ====================== Attempt to drop the partition specs in table 'tbl' database 'default': Map(part0 -> 1, part1 -> 2) In this attempt, the following partitions have been dropped successfully: The remaining partitions have not been dropped: [1, 2] ====================== Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; ``` The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected tests from local IDEA which does not have access to folders out of partition paths. Closes #30752 from MaxGekk/fix-drop-partition-location. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-14 15:56:46 +09:00
Kent Yao	4d47ac4b4b	[SPARK-33705][SQL][TEST] Fix HiveThriftHttpServerSuite flakiness ### What changes were proposed in this pull request? TO FIX flaky tests: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132345/testReport/ ``` org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.Checks Hive version org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.SPARK-24829 Checks cast as float ``` The root cause here is a jar conflict issue. `NewCookie.isHttpOnly` is not defined in the `jsr311-api.jar` which conflicts The transitive artifact `jsr311-api.jar` of `hadoop-client` is excluded at the maven side. See https://issues.apache.org/jira/browse/SPARK-27179. The Jenkins PR builder and Github Action use `SBT` as the compiler tool. First, the exclusion rule from maven is not followed by sbt, so I was able to see `jsr311-api.jar` from maven cache to be added to the classpath directly. This seems to be a bug of `sbt-pom-reader` plugin but I'm not that sure. Then I added an `ExcludeRule` for the `hive-thriftserver` module at the SBT side and did see the `jsr311-api.jar` gone, but the CI jobs still failed with the same error. I added a trace log in ThriftHttpServlet ```s ERROR ThriftHttpServlet: !!!!!!!!! Suspect???????? ---> file:/home/jenkins/workspace/SparkPullRequestBuilder/assembly/target/scala-2.12/jars/jsr311-api-1.1.1.jar ``` And the log pointed out that the assembly phase copied it to `assembly/target/scala-2.12/jars/` which will be added to the classpath too. With the help of SBT `dependencyTree` tool, I saw the `jsr311-api` again as a transitive of `jersery-core` from `yarn` module with a `test` scope. So This seems to be another bug from the SBT side of the `sbt-assembly` plugin. It copied a test scope transitive artifact to the assembly output. In this PR, I defined some rules in SparkBuild.scala to bypass the potential bugs from the SBT side. First, exclude the `jsr311` from all over the project and then add it back separately to the YARN module for SBT. Additionally, the HiveThriftServerSuites was reflected for reducing flakiness too, but not related to the bugs I have found so far. ### Why are the changes needed? fix test here ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? passing jenkins and ga Closes #30643 from yaooqinn/HiveThriftHttpServerSuite. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 05:14:38 +00:00
Gengliang Wang	6e862792fb	[SPARK-33723][SQL] ANSI mode: Casting String to Date should throw exception on parse error ### What changes were proposed in this pull request? Currently, when casting a string as timestamp type in ANSI mode, Spark throws a runtime exception on parsing error. However, the result for casting a string to date is always null. We should throw an exception on parsing error as well. ### Why are the changes needed? Add missing feature for ANSI mode ### Does this PR introduce _any_ user-facing change? Yes for ANSI mode, Casting string to date will throw an exception on parsing error ### How was this patch tested? Unit test Closes #30687 from gengliangwang/castDate. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-14 10:22:37 +09:00
Takeshi Yamamuro	8197ee3b15	[SPARK-33690][SQL] Escape meta-characters in showString ### What changes were proposed in this pull request? This PR intends to escape meta-characters (e.g., \n and \t) in `Dataset.showString`. Before this PR: ``` scala> Seq("aaa\nbbb\t\tccccc").toDF("value").show() +--------------+ \| value\| +--------------+ \|aaa bbb ccccc\| +--------------+ ``` After this PR: ``` +-----------------+ \| value\| +-----------------+ \|aaa\nbbb\t\tccccc\| +-----------------+ ``` ### Why are the changes needed? For better output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test. Closes #30647 from maropu/EscapeMetaInShow. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 15:04:23 -08:00
Liang-Chi Hsieh	45af3c9688	[SPARK-33764][SS] Make state store maintenance interval as SQL config ### What changes were proposed in this pull request? Currently the maintenance interval is hard-coded in `StateStore`. This patch proposes to make it as SQL config. ### Why are the changes needed? Currently the maintenance interval is hard-coded in `StateStore`. For consistency reason, it should be placed together with other SS configs together. SQLConf also has a better way to have doc and default value setting. ### Does this PR introduce _any_ user-facing change? Yes. Previously users use Spark config to set the maintenance interval. Now they could use SQL config to set it. ### How was this patch tested? Unit test. Closes #30741 from viirya/maintenance-interval-sqlconfig. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 14:57:09 -08:00
Yuming Wang	94bc2d61a2	[SPARK-33589][SQL][FOLLOWUP] Replace Throwable with NonFatal ### What changes were proposed in this pull request? This pr replace `Throwable` with `NonFatal`. ### Why are the changes needed? Improve code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30744 from wangyum/SPARK-33589-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 14:52:26 -08:00
Chao Sun	be09d37398	[SPARK-33729][SQL] When refreshing cache, Spark should not use cached plan when recaching data ### What changes were proposed in this pull request? This fixes `CatalogImpl.refreshTable` by using a new logical plan when recache the target table. ### Why are the changes needed? In `CatalogImpl.refreshTable`, we currently recache the target table via: ```scala sparkSession.sharedState.cacheManager.cacheQuery(table, cacheName, cacheLevel) ``` However, here `table` is generated before the `tableRelationCache` in `SessionCatalog` is invalidated, and therefore it still refers to old and staled logical plan, which is incorrect. ### Does this PR introduce _any_ user-facing change? Yes, this fix behavior when a table is refreshed. ### How was this patch tested? Added a unit test. Closes #30699 from sunchao/SPARK-33729. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-11 14:43:51 -08:00
ulysses-you	5bab27e00b	[SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver ### What changes were proposed in this pull request? This PR add a new config `spark.sql.thriftServer.forceCancel` to give user a way to interrupt task when cancel statement. ### Why are the changes needed? After [#29933](https://github.com/apache/spark/pull/29933), we support cancel query if timeout, but the default behavior of `SparkContext.cancelJobGroups` won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? Add test. Closes #30481 from ulysses-you/SPARK-33526. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-12 00:52:33 +09:00
Max Gekk	8b97b19ffa	[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() ### What changes were proposed in this pull request? 1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields). 2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`. ### Why are the changes needed? The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running existing test suites: ``` $ build/sbt "test:testOnly AlterTablePartitionV2SQLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly SupportsPartitionManagementSuite" ``` Closes #30667 from MaxGekk/check-len-partitionExists. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-11 12:48:40 +00:00
Terry Kim	8f5db716fa	[SPARK-33654][SQL] Migrate CACHE TABLE to use UnresolvedRelation to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `CACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022. ### Why are the changes needed? To resolve the table in the analyzer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30598 from imback82/cache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-11 12:39:58 +00:00
Kousuke Saruta	8377aca60a	[SPARK-33527][SQL][FOLLOWUP] Fix the scala 2.13 build failure ### What changes were proposed in this pull request? This PR fixes the Scala 2.13 build failure brought by #30479 . ### Why are the changes needed? To pass Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done byGitHub Actions. Closes #30727 from sarutak/fix-scala213-build-failure. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-11 01:53:41 -08:00
Josh Soref	c05f6f98b6	[MINOR][SQL] Spelling: enabled - legacy_setops_precedence_enbled ### What changes were proposed in this pull request? Replace `legacy_setops_precedence_enbled` with `legacy_setops_precedence_enabled` Alternatively, `legacy_setops_precedence_enabled` could be added, and `legacy_setops_precedence_enbled` retained, and if set the code could honor it and warn about the deprecated spelling. ### Why are the changes needed? `enabled` is misspelled in `legacy_setops_precedence_enbled` ### Does this PR introduce _any_ user-facing change? Yes. It would break current consumers. Examples include: * https://www.programmersought.com/article/87752082924/ * `125d873c38/fugue_sql/_antlr/fugue_sqlLexer.py` * https://github.com/search?q=legacy_setops_precedence_enbled&type=code ### How was this patch tested? It's been included in #30323 for a while (and is now split out here) Closes #30677 from jsoref/spelling-enabled. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-11 06:49:45 +00:00
Dongjoon Hyun	8ac86a4c31	[SPARK-33750][SQL][TESTS] Use `hadoop-3.2` distribution in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? This PR aims to use `hadoop-3.2` distribution in HiveExternalCatalogVersionsSuite if available. ### Why are the changes needed? Apache Spark 3.1 is using Hadoop 3 by default. We need to focus on Hadoop 3 more to prepare the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30722 from dongjoon-hyun/SPARK-33750. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-10 22:32:23 -08:00
gengjiaan	24d7e45d31	[SPARK-33527][SQL] Extend the function of decode so as consistent with mainstream databases ### What changes were proposed in this pull request? In Spark, decode(bin, charset) - Decodes the first argument using the second argument character set. Unfortunately this is NOT what any other SQL vendor understands `DECODE` to do. `DECODE` generally is a short hand for a simple case expression: ``` SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS T(c1) => (Hello), (World) (!) ``` There are some mainstream database support the syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/DECODE.html#GUID-39341D91-3442-4730-BD34-D3CF5D4701CE Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/DECODE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____10 DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1447.htm Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_DECODE_expression.html Pig https://pig.apache.org/docs/latest/api/org/apache/pig/piggybank/evaluation/decode/Decode.html Teradata https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/jtCpCycpEaXESG4d63kMjg Snowflake https://docs.snowflake.com/en/sql-reference/functions/decode.html ### Why are the changes needed? It is very useful. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Jenkins test. Closes #30479 from beliefer/SPARK-33527. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-11 05:52:33 +00:00
Max Gekk	fab2995972	[SPARK-33742][SQL] Throw PartitionsAlreadyExistException from HiveExternalCatalog.createPartitions() ### What changes were proposed in this pull request? Throw `PartitionsAlreadyExistException` from `createPartitions()` in Hive external catalog when a partition exists. Currently, `HiveExternalCatalog.createPartitions()` throws `AlreadyExistsException` wrapped by `AnalysisException`. In the PR, I propose to catch `AlreadyExistsException` in `HiveClientImpl` and replace it by `PartitionsAlreadyExistException`. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `PartitionsAlreadyExistException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes #30711 from MaxGekk/hive-partition-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-10 17:49:56 -08:00
Kent Yao	31e0baca30	[SPARK-33740][SQL] hadoop configs in hive-site.xml can overrides pre-existing hadoop ones ### What changes were proposed in this pull request? org.apache.hadoop.conf.Configuration#setIfUnset will ignore those with defaults too ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30709 from yaooqinn/SPARK-33740. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-10 16:32:24 -08:00
Linhong Liu	1554977670	[SPARK-33692][SQL] View should use captured catalog and namespace to lookup function ### What changes were proposed in this pull request? Using the view captured catalog and namespace to lookup function, so the view referred functions won't be overridden by newly created function with the same name, but different database or function type (i.e. temporary function) ### Why are the changes needed? bug fix, without this PR, changing database or create a temporary function with the same name may cause failure when querying a view. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? newly added and existing test cases. Closes #30662 from linhongliu-db/SPARK-33692. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 09:14:07 +00:00
gengjiaan	cef28c2c51	[SPARK-32670][SQL][FOLLOWUP] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/29497. Because https://github.com/apache/spark/pull/29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors. This PR group other `AnalysisExcpetion` into QueryCompilationErrors. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30564 from beliefer/SPARK-32670-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 08:38:24 +00:00
Terry Kim	b112e2bfa6	[SPARK-33714][SQL] Migrate ALTER VIEW ... SET/UNSET TBLPROPERTIES commands to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR adds `allowTemp` flag to `UnresolvedView` so that `Analyzer` can check whether to resolve temp views or not. This PR also migrates `ALTER VIEW ... SET/UNSET TBLPROPERTIES` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. One benefit is that the exception message is better for `ALTER VIEW ... SET/UNSET TBLPROPERTIES`. Before, if a temp view is passed, you will just get `NoSuchTableException` with `Table or view 'tmpView' not found in database 'default'`. But with this PR, you will get more description exception message: `tmpView is a temp view. ALTER VIEW ... SET TBLPROPERTIES expects a permanent view`. ### Does this PR introduce _any_ user-facing change? The exception message changes as describe above. ### How was this patch tested? Updated existing tests. Closes #30676 from imback82/alter_view_set_unset_properties. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 05:18:34 +00:00
Max Gekk	af37c7f411	[SPARK-33558][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. ADD PARTITION` parsing tests to `AlterTableAddPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. ADD PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableAddPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. ADD PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes #30685 from MaxGekk/unify-alter-table-add-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 04:54:52 +00:00
Anton Okolnychyi	fa9ce1d4e8	[SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize delete conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `DeleteFromTable`. Closes #30688 from aokolnychyi/spark-33722. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-09 11:42:54 -08:00
HyukjinKwon	b5399d4ef1	[SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 20:26:18 +09:00
Dooyoung Hwang	a713a7eee3	[SPARK-33655][SQL] Improve performance of processing FETCH_PRIOR ### What changes were proposed in this pull request? Currently, when a client requests FETCH_PRIOR to Thriftserver, Thriftserver reiterates from the start position. Because Thriftserver caches a query result with an array when THRIFTSERVER_INCREMENTAL_COLLECT feature is off, FETCH_PRIOR can be implemented without reiterating the result. A trait FeatureIterator is added in order to separate the implementation for iterator and an array. Also, FeatureIterator supports moves cursor with absolute position, which will be useful for the implementation of FETCH_RELATIVE, FETCH_ABSOLUTE. ### Why are the changes needed? For better performance of Thriftserver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? FetchIteratorSuite Closes #30600 from Dooyoung-Hwang/refactor_with_fetch_iterator. Authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 18:35:24 +09:00
Terry Kim	29fed23ba1	[SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `MSCK REPAIR TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `MSCK REPAIR TABLE` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("MSCK REPAIR TABLE t") // works fine ``` , but after this PR: ``` sql("MSCK REPAIR TABLE t") org.apache.spark.sql.AnalysisException: t is a temp view. 'MSCK REPAIR TABLE' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `MSCK REPAIR TABLE t` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30664 from imback82/repair_table_V2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-09 05:06:37 +00:00
Wenchen Fan	6fd234503c	[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ ### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 11:41:35 -08:00
Josh Soref	a093d6feef	[MINOR] Spelling sql/core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/core` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30531 from jsoref/spelling-sql-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-08 08:57:13 -06:00
Terry Kim	c05ee06f5b	[SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier. This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30636 from imback82/drop_view_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 14:07:58 +00:00
Max Gekk	2b30dde249	[SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution framework ### What changes were proposed in this pull request? 1. Remove old statement `ShowTableStatement` 2. Introduce new command `ShowTableExtended` for `SHOW TABLE EXTENDED`. This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. ### Does this PR introduce _any_ user-facing change? The changes should not affect V1 tables. For V2, Spark outputs the error: ``` SHOW TABLE EXTENDED is not supported for v2 tables. ``` ### How was this patch tested? By running `SHOW TABLE EXTENDED` tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #30645 from MaxGekk/show-table-extended-statement. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 12:08:22 +00:00
luluorta	99613cd581	[SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30625 from luluorta/SPARK-33677. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-08 20:45:25 +09:00
Dongjoon Hyun	031c5ef280	[SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default ### What changes were proposed in this pull request? This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark 3.2.0. ### Why are the changes needed? By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously. ### Does this PR introduce _any_ user-facing change? Yes, but this is an improvement and it's supposed to have no bugs. ### How was this patch tested? Pass the CIs. Closes #30628 from dongjoon-hyun/SPARK-33679. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 23:10:35 -08:00
Terry Kim	5aefc49b0f	[SPARK-33664][SQL] Migrate ALTER TABLE ... RENAME TO to use UnresolvedTableOrView to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER [TABLE\|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30610 from imback82/rename_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 03:54:16 +00:00
Dongjoon Hyun	b2a79306ef	[SPARK-33680][SQL][TESTS][FOLLOWUP] Fix more test suites to have explicit confs ### What changes were proposed in this pull request? This is a follow-up for SPARK-33680 to remove the assumption on the default value of `spark.sql.adaptive.enabled` . ### Why are the changes needed? According to the test result https://github.com/apache/spark/pull/30628#issuecomment-739866168, the [previous run](https://github.com/apache/spark/pull/30628#issuecomment-739641105) didn't run all tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30655 from dongjoon-hyun/SPARK-33680. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 18:59:15 -08:00
Anton Okolnychyi	02508b68ec	[SPARK-33621][SQL] Add a way to inject data source rewrite rules ### What changes were proposed in this pull request? This PR adds a way to inject data source rewrite rules. ### Why are the changes needed? Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR comes with a new test. Closes #30577 from aokolnychyi/spark-33621-v3. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 15:32:10 -08:00
Wenchen Fan	c0874ba9f1	[SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data. ### Why are the changes needed? This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #30653 from cloud-fan/minor2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 13:35:37 -08:00
Wenchen Fan	6aff215077	[SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 10:50:31 -08:00
Josh Soref	c62b84a043	[MINOR] Spelling sql not core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/catalyst` * `sql/hive-thriftserver` * `sql/hive` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30532 from jsoref/spelling-sql-not-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-07 08:40:29 -06:00
Kent Yao	da72b87374	[SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results ### What changes were proposed in this pull request? In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later ### Why are the changes needed? 1. udf ```scala scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2)) scala> spark.sql("select abcd()").show scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212) at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606) ... 47 elided ``` 2. spark.createDataframe ``` scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show +--------------------+ \| c\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| \|rich set of highe...\| \|MLlib for machine...\| \|and Structured St...\| \| \| \|<https://spark.ap...\| \| \| \|[![Jenkins Build]...\| \|[![AppVeyor Build...\| \|[![PySpark Covera...\| \| \| \| \| ``` 3. reader.schema ``` scala> spark.read.schema("a varchar(2)").text("./README.md").show(100) +--------------------+ \| a\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| ``` 4. etc ### Does this PR introduce _any_ user-facing change? NO, we intend to avoid protentical breaking change ### How was this patch tested? new tests Closes #30586 from yaooqinn/SPARK-33641. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 13:40:15 +00:00
Linhong Liu	d730b6bdaa	[SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query ### What changes were proposed in this pull request? The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved object" (instead of a user-friendly error msg: "table or view not found"). This PR fixes this wrongly preprocess for CTAS using V2 catalog. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? The error message for CTAS with a non-exists table changed from: `UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to `AnalysisException: Table or view not found: xxx` ### How was this patch tested? added test Closes #30637 from linhongliu-db/fix-ctas. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 13:25:43 +00:00
Yuming Wang	1e0c006748	[SPARK-33617][SQL] Add default parallelism configuration for Spark SQL queries ### What changes were proposed in this pull request? This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`. ### Why are the changes needed? Avoid generating small files for INSERT INTO TABLE from VALUES, for example: ```sql CREATE TABLE t1(id int) USING parquet; INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8); ``` Before this pr: ``` -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:54 _SUCCESS ``` After this pr and set `spark.sql.files.minPartitionNum` to 1: ``` -rw-r--r-- 1 root root 452 Dec 1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:59 _SUCCESS ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30559 from wangyum/SPARK-33617. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 21:36:52 +09:00
Max Gekk	26c0493318	[SPARK-33676][SQL] Require exact matching of partition spec to the schema in V2 `ALTER TABLE .. ADD/DROP PARTITION` ### What changes were proposed in this pull request? Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs). ### Why are the changes needed? 1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error: ```sql spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b); spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9'); Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`'; ``` 2. To prevent future errors caused by not fully specified partition specs. ### Does this PR introduce _any_ user-facing change? Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands. ### How was this patch tested? By running the test suite with new UT: ``` $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" ``` Closes #30624 from MaxGekk/add-partition-full-spec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 08:14:36 +00:00
Max Gekk	87c056088e	[SPARK-33671][SQL] Remove VIEW checks from V1 table commands ### What changes were proposed in this pull request? Remove VIEW checks from the following V1 commands: - `SHOW PARTITIONS` - `TRUNCATE TABLE` - `LOAD DATA` The checks are performed earlier at: `acc211d2cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L885-L889)` ### Why are the changes needed? To improve code maintenance, and remove dead codes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites like `v1/ShowPartitionsSuite`. 1. LOAD DATA: `acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L176-L179)` 2. TRUNCATE TABLE: `acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L180-L183)` 3. SHOW PARTITIONS: - v1/ShowPartitionsSuite Closes #30620 from MaxGekk/show-table-check-view. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 23:22:52 -08:00
Dongjoon Hyun	73412ffb3a	[SPARK-33680][SQL][TESTS] Fix PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite not to depend on the default conf ### What changes were proposed in this pull request? This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly. ### Why are the changes needed? The unit test should not depend on the default configurations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? According to https://github.com/apache/spark/pull/30628 , this seems to be the only ones. Pass the CIs. Closes #30631 from dongjoon-hyun/SPARK-CONF-AGNO. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 19:34:54 -08:00
Max Gekk	29096a8869	[SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up https://github.com/apache/spark/pull/16373 and https://github.com/apache/spark/pull/15515. ### Why are the changes needed? To output an user friendly error with recommendation like " ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` " instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 10:21:04 +09:00
Terry Kim	119539fd49	[SPARK-33663][SQL] Uncaching should not be called on non-existing temp views ### What changes were proposed in this pull request? This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views: ``` scala> sql("CREATE TABLE table USING parquet AS SELECT 2") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.table("table") df: org.apache.spark.sql.DataFrame = [2: int] scala> df.createOrReplaceTempView("t2") 20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name org.apache.spark.sql.AnalysisException: Table or view not found: t2;; 'UnresolvedRelation [t2], [], false at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589) at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476) at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124) ``` Since `t2` does not exist yet, it shouldn't try to uncache. ### Why are the changes needed? To fix misleading message. ### Does this PR introduce _any_ user-facing change? Yes, the above message will not be displayed if the view doesn't exist yet. ### How was this patch tested? Manually tested since this is a log message printed. Closes #30608 from imback82/fix_cache_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 09:48:16 +09:00
Max Gekk	48297818f3	[SPARK-33667][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW PARTITIONS` ### What changes were proposed in this pull request? Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag spark.sql.caseSensitive. ### Why are the changes needed? V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config spark.sql.caseSensitive which is false by default, for instance: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; ``` The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the command above works as expected: ```sql spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); year=2015/month=1 ``` ### How was this patch tested? By running the affected test suites: - `v1/ShowPartitionsSuite` - `v2/ShowPartitionsSuite` Closes #30615 from MaxGekk/show-partitions-case-sensitivity-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 02:56:08 -08:00
Chao Sun	e857e06452	[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache ### What changes were proposed in this pull request? This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`. ### Why are the changes needed? Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now delete from table in v2 also refreshes cache. ### How was this patch tested? Added a test case. Closes #30597 from sunchao/SPARK-33652. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 01:14:22 -08:00
Terry Kim	154f604403	[MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataConsumer.scala ### What changes were proposed in this pull request? This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`. ### Why are the changes needed? To fix a string interpolation bug. ### Does this PR introduce _any_ user-facing change? Yes, the string will be correctly constructed. ### How was this patch tested? Existing tests since they were used in exception/log messages. Closes #30609 from imback82/fix_cache_str_interporlation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-06 12:03:14 +09:00
Wenchen Fan	1b4e35d1a8	[SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data source tables ### What changes were proposed in this pull request? This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it. ### Why are the changes needed? It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all. ### Does this PR introduce _any_ user-facing change? Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed. ### How was this patch tested? new tests Closes #30595 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 16:48:31 -08:00
allisonwang-db	960d6af75d	[SPARK-33472][SQL][FOLLOW-UP] Update RemoveRedundantSorts comment ### What changes were proposed in this pull request? This PR is a follow-up for #30373 that updates the comment for RemoveRedundantSorts in QueryExecution. ### Why are the changes needed? To update an incorrect comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30584 from allisonwang-db/spark-33472-followup. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 15:15:19 -08:00
Dongjoon Hyun	b6b45bc695	[SPARK-33141][SQL][FOLLOW-UP] Fix Scala 2.13 compilation ### What changes were proposed in this pull request? This PR aims to fix Scala 2.13 compilation. ### Why are the changes needed? To recover Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Action Scala 2.13 build job. Closes #30611 from dongjoon-hyun/SPARK-33141. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 15:04:18 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
Wenchen Fan	acc211d2cf	[SPARK-33141][SQL][FOLLOW-UP] Store the max nested view depth in AnalysisContext ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30289. It removes the hack in `View.effectiveSQLConf`, by putting the max nested view depth in `AnalysisContext`. Then we don't get the max nested view depth from the active SQLConf, which keeps changing during nested view resolution. ### Why are the changes needed? remove hacks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? If I just remove the hack, `SimpleSQLViewSuite.restrict the nested level of a view` fails. With this fix, it passes again. Closes #30575 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-04 14:01:15 +00:00
Jungtaek Lim (HeartSaVioR)	233a8494c8	[SPARK-27237][SS] Introduce State schema validation among query restart ## What changes were proposed in this pull request? Please refer the description of [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) to see rationalization of this patch. This patch proposes to introduce state schema validation, via storing key schema and value schema to `schema` file (for the first time) and verify new key schema and value schema for state are compatible with existing one. To be clear for definition of "compatible", state schema is "compatible" when number of fields are same and data type for each field is same - Spark has been allowing rename of field. This patch will prevent query run which has incompatible state schema, which would reduce the chance to get indeterministic behavior (actually renaming of field is also the smell of semantically incompatible, but end users could just modify its name so we can't say) as well as providing more informative error message. ## How was this patch tested? Added UTs. Closes #24173 from HeartSaVioR/SPARK-27237. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 19:33:11 +09:00
Yuanjian Li	325abf7957	[SPARK-33577][SS] Add support for V1Table in stream writer table API and create table if not exist by default ### What changes were proposed in this pull request? After SPARK-32896, we have table API for stream writer but only support DataSource v2 tables. Here we add the following enhancements: - Create non-existing tables by default - Support both managed and external V1Tables ### Why are the changes needed? Make the API covers more use cases. Especially for the file provider based tables. ### Does this PR introduce _any_ user-facing change? Yes, new features added. ### How was this patch tested? Add new UTs. Closes #30521 from xuanyuanking/SPARK-33577. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-04 16:45:55 +09:00
Max Gekk	94c144bdd0	[SPARK-33571][SQL][DOCS] Add a ref to INT96 config from the doc for `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` ### What changes were proposed in this pull request? For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by: 1. Explicitly document on which parquet types, those configs influence on 2. Refer to corresponding configs for `INT96` ### Why are the changes needed? To avoid user confusions like reposted in SPARK-33571, and make the config description more precise. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30596 from MaxGekk/clarify-rebase-docs. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 16:26:07 +09:00
Gengliang Wang	e8380665c7	[SPARK-33658][SQL] Suggest using Datetime conversion functions for invalid ANSI casting ### What changes were proposed in this pull request? Suggest users using Datetime conversion functions in the error message of invalid ANSI explicit casting. ### Why are the changes needed? In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed. As of now, we have introduced new functions `UNIX_SECONDS`/`UNIX_MILLIS`/`UNIX_MICROS`/`UNIX_DATE`/`DATE_FROM_UNIX_DATE`, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, better error messages ### How was this patch tested? Unit test Closes #30603 from gengliangwang/improveErrorMsgOfExplicitCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 16:24:41 +09:00
Huaxin Gao	15579ba1f8	[SPARK-33430][SQL] Support namespaces in JDBC v2 Table Catalog ### What changes were proposed in this pull request? Add namespaces support in JDBC v2 Table Catalog by making ```JDBCTableCatalog``` extends```SupportsNamespaces``` ### Why are the changes needed? make v2 JDBC implementation complete ### Does this PR introduce _any_ user-facing change? Yes. Add the following to ```JDBCTableCatalog``` - listNamespaces - listNamespaces(String[] namespace) - namespaceExists(String[] namespace) - loadNamespaceMetadata(String[] namespace) - createNamespace - alterNamespace - dropNamespace ### How was this patch tested? Add new docker tests Closes #30473 from huaxingao/name_space. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-04 07:23:35 +00:00
Linhong Liu	e02324f2dd	[SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view ### What changes were proposed in this pull request? Currently, in spark, the temp view is saved as its analyzed logical plan, while the permanent view is kept in HMS with its origin SQL text. As a result, permanent and temporary views have different behaviors in some cases. In this PR we store the SQL text for temporary view in order to unify the behavior between permanent and temporary views. ### Why are the changes needed? to unify the behavior between permanent and temporary views ### Does this PR introduce _any_ user-facing change? Yes, with this PR, the temporary view will be re-analyzed when it's referred. So if the underlying datasource changed, the view will also be updated. ### How was this patch tested? existing and newly added test cases Closes #30567 from linhongliu-db/SPARK-33142. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-04 06:48:49 +00:00
Gengliang Wang	29e415deac	[SPARK-33649][SQL][DOC] Improve the doc of spark.sql.ansi.enabled ### What changes were proposed in this pull request? Improve the documentation of SQL configuration `spark.sql.ansi.enabled` ### Why are the changes needed? As there are more and more new features under the SQL configuration `spark.sql.ansi.enabled`, we should make it more clear about: 1. what exactly it is 2. where can users find all the features of the ANSI mode 3. whether all the features are exactly from the SQL standard ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change. Closes #30593 from gengliangwang/reviseAnsiDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-12-04 10:58:41 +08:00
Max Gekk	85949588b7	[SPARK-33650][SQL] Fix the error from ALTER TABLE .. ADD/DROP PARTITION for non-supported partition management table ### What changes were proposed in this pull request? In the PR, I propose to change the order of post-analysis checks for the `ALTER TABLE .. ADD/DROP PARTITION` command, and perform the general check (does the table support partition management at all) before specific checks. ### Why are the changes needed? The error message for the table which doesn't support partition management can mislead users: ```java PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable5d3ff859 ``` because it says nothing about the root cause of the issue. ### Does this PR introduce _any_ user-facing change? Yes. After the change, the error message will be: ``` Table ns1.ns2.tbl can not alter partitions ``` ### How was this patch tested? By running the affected test suite `AlterTablePartitionV2SQLSuite`. Closes #30594 from MaxGekk/check-order-AlterTablePartition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-03 16:43:15 -08:00
Wenchen Fan	63f9d474b9	[SPARK-33634][SQL][TESTS] Use Analyzer in PlanResolutionSuite ### What changes were proposed in this pull request? Instead of using several analyzer rules, this PR uses the actual analyzer to run tests in `PlanResolutionSuite`. ### Why are the changes needed? Make the test suite to match reality. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test-only Closes #30574 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-03 09:22:53 -08:00
Anton Okolnychyi	aa13e207c9	[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete ### What changes were proposed in this pull request? This PR provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time. ### Why are the changes needed? The only way to support delete statements right now is to implement ``SupportsDelete``. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table). This PR actually provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters. Consider an example of a partitioned Hive table. If we have a delete predicate like `part_col = '2020'`, we can just drop the matching partition to satisfy this delete. In this case, the data source should return `true` from `canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like `id = 10`, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return `false` from `canDeleteWhere` and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis). If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return `false` from `canDeleteWhere`. ### Does this PR introduce _any_ user-facing change? Yes but it is backward compatible. ### How was this patch tested? This PR comes with a new test. Closes #30562 from aokolnychyi/spark-33623. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-03 09:12:30 -08:00
Wenchen Fan	0706e64c49	[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command ### What changes were proposed in this pull request? For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false. This is a retry after we unify the CREATE TABLE syntax. It partially reverts `d2bec5e265` This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables. ### Why are the changes needed? Changing from Hive text table to native Parquet table has many benefits: 1. be consistent with `DataFrameWriter.saveAsTable`. 2. better performance 3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result) 4. better interoperability as Parquet is a more popular open file format. ### Does this PR introduce _any_ user-facing change? No by default. If the config is set, the behavior change is described below: Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE \| SERDEPROPERTIES]` and `LOAD DATA`. char/varchar behavior has been taken care by https://github.com/apache/spark/pull/30412, and there is no behavior difference between data source and hive tables. One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough. Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables. ### How was this patch tested? Re-enable the tests Closes #30554 from cloud-fan/create-table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 15:24:44 +00:00
luluorta	512fb32b38	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case of codegen when casting float to Integer ### What changes were proposed in this pull request? This is a followup of [#27151](https://github.com/apache/spark/pull/27151). It fixes the same issue for the codegen path. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30585 from luluorta/SPARK-26218. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 14:58:56 +00:00
Gengliang Wang	ff13f574e6	[SPARK-20044][SQL] Add new function DATE_FROM_UNIX_DATE and UNIX_DATE ### What changes were proposed in this pull request? Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between Date type and Numeric types. ### Why are the changes needed? 1. Explicit conversion between Date type and Numeric types is disallowed in ANSI mode. We need to provide new functions for users to complete the conversion. 2. We have introduced new functions from Bigquery for conversion between Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense to add functions for conversion between Date type and Numeric types as well. ### Does this PR introduce _any_ user-facing change? Yes, two new datetime functions are added. ### How was this patch tested? Unit tests Closes #30588 from gengliangwang/dateToNumber. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 14:04:08 +00:00
Yuanjian Li	878cc0e6e9	[SPARK-32896][SS][FOLLOW-UP] Rename the API to `toTable` ### What changes were proposed in this pull request? As the discussion in https://github.com/apache/spark/pull/30521#discussion_r531463427, rename the API to `toTable`. ### Why are the changes needed? Rename the API for further extension and accuracy. ### Does this PR introduce _any_ user-facing change? Yes, it's an API change but the new API is not released yet. ### How was this patch tested? Existing UT. Closes #30571 from xuanyuanking/SPARK-32896-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-12-02 17:36:25 -08:00
uncleGen	4f96670358	[SPARK-31953][SS] Add Spark Structured Streaming History Server Support ### What changes were proposed in this pull request? Add Spark Structured Streaming History Server Support. ### Why are the changes needed? Add a streaming query history server plugin. ![image](https://user-images.githubusercontent.com/7402327/84248291-d26cfe80-ab3b-11ea-86d2-98205fa2bcc4.png) ![image](https://user-images.githubusercontent.com/7402327/84248347-e44ea180-ab3b-11ea-81de-eefe207656f2.png) ![image](https://user-images.githubusercontent.com/7402327/84248396-f0d2fa00-ab3b-11ea-9b0d-e410115471b0.png) - Follow-ups - Query duration should not update in history UI. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Update UT. Closes #28781 from uncleGen/SPARK-31953. Lead-authored-by: uncleGen <hustyugm@gmail.com> Co-authored-by: Genmao Yu <hustyugm@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-12-02 17:11:51 -08:00
Gengliang Wang	b76c6b759c	[SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers. ### Why are the changes needed? 1. Symmetry of the conversion functions 2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion. ### Does this PR introduce _any_ user-facing change? 3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type. ### How was this patch tested? Unit tests. Closes #30566 from gengliangwang/timestampLong. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-02 12:44:39 -08:00
yi.wu	a082f4600b	[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin ### What changes were proposed in this pull request? Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`. In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change. Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed. ### Why are the changes needed? For the query below, it returns the wrong result while it should throws ambiguous self join exception instead: ```scala val emp1 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop"), TestData(4, "IT")).toDS() val emp2 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop")).toDS() val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("")) emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer") .select(emp1.col(""), emp3.col("key").as("e2")).show() // wrong result +---+---------+---+ \|key\| value\| e2\| +---+---------+---+ \| 1\| sales\| 1\| \| 2\|personnel\| 2\| \| 3\| develop\| 3\| \| 4\| IT\| 4\| +---+---------+---+ ``` This PR fixes the wrong behaviour. ### Does this PR introduce _any_ user-facing change? Yes, users hit the exception instead of the wrong result after this PR. ### How was this patch tested? Added a new unit test. Closes #30488 from Ngone51/fix-self-join. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 17:51:22 +00:00
xuewei.linxuewei	58583f7c3f	[SPARK-33619][SQL] Fix GetMapValueUtil code generation error ### What changes were proposed in this pull request? Code Gen bug fix that introduced by SPARK-33460 ``` GetMapValueUtil s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" SHOULD BE s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" ``` And the reason why SPARK-33460 failed to detect this bug via UT, it was because that `checkExceptionInExpression ` did not work as expect like `checkEvaluation` which will try eval expression with BOTH `CODEGEN_ONLY` and `NO_CODEGEN` mode, and in this PR, will also fix this Test bug, too. ### Why are the changes needed? Bug Fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT and Existing UT. Closes #30560 from leanken/leanken-SPARK-33619. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 16:10:45 +00:00
HyukjinKwon	df8d3f1bf7	[SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify the documentation more ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30504. It proposes: - Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used. - Clarify, in the docs in the expressions, that it means they don't throw exceptions ### Why are the changes needed? `NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic. It's best to be explicit so `NoThrow` was proposed - I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow). For determinism, we already have a way to note it under `Expression.deterministic`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually ran the existing unittests written. Closes #30570 from HyukjinKwon/SPARK-33544. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 16:03:08 +00:00
Dongjoon Hyun	290aa02179	[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work ### What changes were proposed in this pull request? This reverts commit SPARK-33212 (`cb3fa6c936`) mostly with three exceptions: 1. `SparkSubmitUtils` was updated recently by SPARK-33580 2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency. 3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471. ### Why are the changes needed? According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following. 1. Spark distribution with `-Phadoop-cloud` ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[], app id = local-1606806088715). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("s3a://dongjoon/users.parquet").show 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties +------+--------------+----------------+ \| name\|favorite_color\|favorite_numbers\| +------+--------------+----------------+ \|Alyssa\| null\| [3, 9, 15, 20]\| \| Ben\| red\| []\| +------+--------------+----------------+ scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V ``` 2. Spark distribution without `-Phadoop-cloud`* ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0 ... java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 18:23:48 +09:00
Cheng Su	a4788ee8c6	[MINOR][SS] Rename auxiliary protected methods in StreamingJoinSuite ### What changes were proposed in this pull request? Per request from https://github.com/apache/spark/pull/30395#issuecomment-735028698, here we remove `Windowed` from methods names `setupWindowedJoinWithRangeCondition` and `setupWindowedSelfJoin` as they don't join on time window. ### Why are the changes needed? There's no such official name for `windowed join`, so this is to help avoid confusion for future developers. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30563 from c21/stream-minor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-02 15:28:16 +09:00
Cheng Su	51ebcd95a5	[SPARK-32863][SS] Full outer stream-stream join ### What changes were proposed in this pull request? This PR is to add full outer stream-stream join, and the implementation of full outer join is: * For left side input row, check if there's a match on right side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store. * For right side input row, check if there's a match on left side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store. * State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join). ### Why are the changes needed? Enable more use cases for spark stream-stream join. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30395 from c21/stream-foj. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-02 10:17:00 +09:00
Thomas Graves	f71f34572d	[SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size of its children ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals). When this happens you know that the values aren't null and it has a size. It already handles the empty array. The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map. That makes it a literal and then makes it ultimately be optimized out. ### Why are the changes needed? remove unneeded filter ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Unit tests added and manually tested various cases Closes #30504 from tgravescs/SPARK-33544. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 09:50:02 +09:00
Anton Okolnychyi	c24f2b2d6a	[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer ### What changes were proposed in this pull request? This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources. ### Why are the changes needed? Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The change is trivial and SPARK-23889 will depend on it. Closes #30558 from aokolnychyi/spark-33612. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-01 09:27:46 -08:00
Anton Okolnychyi	478fb7f528	[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates ### What changes were proposed in this pull request? This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`. ### Why are the changes needed? Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The PR adds 3 new test cases. Closes #30555 from aokolnychyi/spark-33608. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 14:11:01 +00:00
Prakhar Jain	cf4ad212b1	[SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens ### What changes were proposed in this pull request? This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 . ### Why are the changes needed? sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-01 21:13:27 +09:00
gengjiaan	9273d4250d	[SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ANY` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_any`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30465 from beliefer/SPARK-33045-like_any-bak. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:48:30 +00:00
Huaxin Gao	d38883c1d8	[SPARK-32405][SQL][FOLLOWUP] Throw Exception if provider is specified in JDBCTableCatalog create table ### What changes were proposed in this pull request? Throw Exception if JDBC Table Catalog has provider in create table. ### Why are the changes needed? JDBC Table Catalog doesn't support provider and we should throw Exception. Previously CREATE TABLE syntax forces people to specify a provider so we have to add a `USING_`. Now the problem was fix and we will throw Exception for provider. ### Does this PR introduce _any_ user-facing change? Yes. We throw Exception if a provider is specified in CREATE TABLE for JDBC Table catalog. ### How was this patch tested? Existing tests (remove `USING _`) Closes #30544 from huaxingao/followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:38:42 +00:00
Gabor Somogyi	e5bb2937f6	[SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API ### What changes were proposed in this pull request? Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details. The PR contains the following changes: * Added `AdminClient` based offset fetching * GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * Additional unit tests * Code comment changes * Minor bugfixes here and there * Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone. * Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration` ### Why are the changes needed? Driver may hang forever. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Cluster test with simple Kafka topic to another topic query. Documentation: ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #29729 from gaborgsomogyi/SPARK-32032. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 20:34:00 +09:00
zky.zhoukeyong	1034815519	[SPARK-33572][SQL] Datetime building should fail if the year, month, ..., second combination is invalid ### What changes were proposed in this pull request? Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30516 from waitinfuture/SPARK-33498. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: waitinfuture <waitinfuture@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:07:16 +00:00
Jungtaek Lim (HeartSaVioR)	52e5cc46bc	[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files ### What changes were proposed in this pull request? This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file. This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source. ### Why are the changes needed? The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch. Lots of end users have been reporting the issue: see comments in [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) and [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995), and [SPARK-30462](https://issues.apache.org/jira/browse/SPARK-30462). (There're some reports from end users which include their workarounds: SPARK-24295) ### Does this PR introduce any user-facing change? No, as the configuration is new and by default it is not applied. ### How was this patch tested? New UT. Closes #28363 from HeartSaVioR/SPARK-27188-v2. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 14:42:48 +09:00
Jungtaek Lim (HeartSaVioR)	2af2da5a4b	[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch ### What changes were proposed in this pull request? This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query. When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons. The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice. FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to be started from when restarting query. This patch leverages the fact to skip calculation if possible. ### Why are the changes needed? Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27649 from HeartSaVioR/SPARK-30900. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 13:11:14 +09:00
Kousuke Saruta	c50fcac00e	[SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendered if built with Scala 2.13 ### What changes were proposed in this pull request? This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13. ![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png) ![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png) The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13. The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`. Actually, the comparison logic for `Double` type was changed in Scala 2.13. https://github.com/scala/bug/issues/12107 https://github.com/scala/scala/pull/6410 So this issue happens as of Scala 2.13. The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164). This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue. ### Why are the changes needed? To make sure we can use the histogram/timeline with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? First, I built with the following commands. ``` $ /dev/change-scala-version.sh 2.13 $ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package ``` Then, ran the following query (this is brought from #30427 ). ``` import org.apache.spark.sql.streaming.Trigger val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() 100000) AS BIGINT) AS TIMESTAMP) AS tsMod") .selectExpr("tsMod", "mod(value, 100) as mod", "value") .withWatermark("tsMod", "10 seconds") .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() ``` Finally, I confirmed that the timeline and histogram are rendered. ![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png) ``` Closes #30546 from sarutak/ss-nan. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 11:45:32 +09:00
Max Gekk	030b3139da	[SPARK-33569][SPARK-33452][SQL][FOLLOWUP] Fix a build error in `ShowPartitionsExec` ### What changes were proposed in this pull request? Use `listPartitionIdentifiers ` instead of `listPartitionByNames` in `ShowPartitionsExec`. The `listPartitionByNames` was renamed by https://github.com/apache/spark/pull/30514. ### Why are the changes needed? To fix build error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests for the `SHOW PARTITIONS` command: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite" ``` Closes #30553 from MaxGekk/fix-build-show-partitions-exec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 16:40:36 +00:00
Max Gekk	6fd148fea8	[SPARK-33569][SQL] Remove getting partitions by an identifier prefix ### What changes were proposed in this pull request? 1. Remove the method `listPartitionIdentifiers()` from the `SupportsPartitionManagement` interface. The method lists partitions by ident prefix. 2. Rename `listPartitionByNames()` to `listPartitionIdentifiers()`. 3. Re-implement the default method `partitionExists()` using new method. ### Why are the changes needed? Getting partitions by ident prefix only is not used, and it can be removed to improve code maintenance. Also this makes the `SupportsPartitionManagement` interface cleaner. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly org.apache.spark.sql.connector.catalog.*" ``` Closes #30514 from MaxGekk/remove-listPartitionIdentifiers. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 14:05:49 +00:00
Max Gekk	0a612b6a40	[SPARK-33452][SQL] Support v2 SHOW PARTITIONS ### What changes were proposed in this pull request? 1. Remove V2 logical node `ShowPartitionsStatement `, and replace it by V2 `ShowPartitions`. 2. Implement V2 execution node `ShowPartitionsExec` similar to V1 `ShowPartitionsCommand`. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes. Before the change, `SHOW PARTITIONS` fails in V2 table catalogs with the exception: ``` org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is only supported with v1 tables. at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$parseV1Table(ResolveSessionCatalog.scala:628) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:466) ``` ### How was this patch tested? By running the following test suites: 1. Modified `ShowPartitionsParserSuite` where `ShowPartitionsStatement` is replaced by V2 `ShowPartitions`. 2. `v2.ShowPartitionsSuite` Closes #30398 from MaxGekk/show-partitions-exec-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 13:45:53 +00:00
Wenchen Fan	5cfbdddefe	[SPARK-33480][SQL] Support char/varchar type ### What changes were proposed in this pull request? This PR adds the char/varchar type which is kind of a variant of string type: 1. Char type is fixed-length string. When comparing char type values, we need to pad the shorter one to the longer length. 2. Varchar type is string with a length limitation. To implement the char/varchar semantic, this PR: 1. Do string length check when writing to char/varchar type columns. 2. Do string padding when reading char type columns. We don't do it at the writing side to save storage space. 3. Do string padding when comparing char type column with string literal or another char type column. (string literal is fixed length so should be treated as char type as well) To simplify the implementation, this PR doesn't propagate char/varchar type info through functions/operators(e.g. `substring`). That said, a column can only be char/varchar type if it's a table column, not a derived column like `SELECT substring(col)`. To be safe, this PR doesn't add char/varchar type to the query engine(expression input check, internal row framework, codegen framework, etc.). We will replace char/varchar type by string type with metadata (`Attribute.metadata` or `StructField.metadata`) that includes the original type string before it goes into the query engine. That said, the existing code will not see char/varchar type but only string type. char/varchar type may come from several places: 1. v1 table from hive catalog. 2. v2 table from v2 catalog. 3. user-specified schema in `spark.read.schema` and `spark.readStream.schema` 4. `Column.cast` 5. schema string in places like `from_json`, pandas UDF, etc. These places use SQL parser which replaces char/varchar with string already, even before this PR. This PR covers all the above cases, implements the length check and padding feature by looking at string type with special metadata. ### Why are the changes needed? char and varchar are standard SQL types. varchar is widely used in other databases instead of string type. ### Does this PR introduce _any_ user-facing change? For hive tables: now the table insertion fails if the value exceeds char/varchar length. Previously we truncate the value silently. For other tables: 1. now char type is allowed. 2. now we have length check when inserting to varchar columns. Previously we write the value as it is. ### How was this patch tested? new tests Closes #30412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 09:23:05 +00:00
gengjiaan	b665d58819	[SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream databases ### What changes were proposed in this pull request? Currently, Spark allows calls to `count` even for non parameterless aggregate function. For example, the following query actually works: `SELECT count() FROM tenk1;` On the other hand, mainstream databases will throw an error. Oracle `> ORA-00909: invalid number of arguments` PgSQL `ERROR: count() must be used to call a parameterless aggregate function` MySQL* `> 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')` ### Why are the changes needed? Fix a bug so that consistent with mainstream databases. There is an example query output with/without this fix. `SELECT count() FROM testData;` The output before this fix: `0` The output after this fix: ``` org.apache.spark.sql.AnalysisException cannot resolve 'count()' due to data type mismatch: count requires at least one argument.; line 1 pos 7 ``` ### Does this PR introduce _any_ user-facing change? Yes. If not specify parameter for `count`, will throw an error. ### How was this patch tested? Jenkins test. Closes #30541 from beliefer/SPARK-28646. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 17:04:38 +09:00
xuewei.linxuewei	225c2e2815	[SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using checkCastWithParseError ### What changes were proposed in this pull request? Dup code removed in SPARK-33498 as follow-up. ### Why are the changes needed? Nit. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30540 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 15:36:26 +09:00
Terry Kim	0fd9f57dd4	[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables ### What changes were proposed in this pull request? This PR proposes to support `CHACHE/UNCACHE TABLE` commands for v2 tables. In addtion, this PR proposes to migrate `CACHE/UNCACHE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To support `CACHE/UNCACHE TABLE` commands for v2 tables. Note that `CACHE/UNCACHE TABLE` for v1 tables/views go through `SparkSession.table` to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework. ### Does this PR introduce _any_ user-facing change? Yes. Now the user can run `CACHE/UNCACHE TABLE` commands on v2 tables. ### How was this patch tested? Added/updated existing tests. Closes #30403 from imback82/cache_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:37:10 +00:00
Kent Yao	2da72593c1	[SPARK-32976][SQL] Support column list in INSERT statement ### What changes were proposed in this pull request? #### JIRA expectations ``` INSERT currently does not support named column lists. INSERT INTO <table> (col1, col2,…) VALUES( 'val1', 'val2', … ) Note, we assume the column list contains all the column names. Issue an exception if the list is not complete. The column order could be different from the column order defined in the table definition. ``` #### implemetations In this PR, we add a column list as an optional part to the `INSERT OVERWRITE/INTO` statements: ``` /** * {{{ * INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? [identifierList] ... * INSERT INTO [TABLE] tableIdentifier [partitionSpec] [identifierList] ... * }}} / ``` The column list represents all expected columns with an explicit order that you want to insert to the target table. Particularly, we assume the column list contains all the column names in the current implementation, it will fail when the list is incomplete. In Analyzer*, we add a code path to resolve the column list in the `ResolveOutputRelation` rule before it is transformed to v1 or v2 command. It will fail here if the list has any field that not belongs to the target table. Then, for v2 command, e.g. `AppendData`, we use the resolved column list and output of the target table to resolve the output of the source query `ResolveOutputRelation` rule. If the list has duplicated columns, we fail. If the list is not empty but the list size does not match the target table, we fail. If no other exceptions occur, we use the column list to map the output of the source query to the output of the target table. The column list will be set to Nil and it will not hit the rule again after it is resolved. for v1 command, those all happen in the `PreprocessTableInsertion` rule ### Why are the changes needed? new feature support ### Does this PR introduce _any_ user-facing change? yes, insert into/overwrite table support specify column list ### How was this patch tested? new tests Closes #29893 from yaooqinn/SPARK-32976. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:23:23 +00:00
Chao Sun	feda7299e3	[SPARK-33567][SQL] DSv2: Use callback instead of passing Spark session and v2 relation for refreshing cache ### What changes were proposed in this pull request? This replaces Spark session and `DataSourceV2Relation` in V2 write plans by replacing them with a callback `afterWrite`. ### Why are the changes needed? Per discussion in #30429, it's better to not pass Spark session and `DataSourceV2Relation` through Spark plans. Instead we can use a callback which makes the interface cleaner. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30491 from sunchao/SPARK-33492-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 04:50:50 +00:00
Yuming Wang	a5e13acd19	[SPARK-33582][SQL] Hive Metastore support filter by not-equals ### What changes were proposed in this pull request? This pr make partition predicate pushdown into Hive metastore support not-equals operator. Hive related changes: `b8bd4594be/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java (L2194-L2207)` https://issues.apache.org/jira/browse/HIVE-2702 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30534 from wangyum/SPARK-33582. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 11:24:15 +09:00
Yuming Wang	f93d4395b2	[SPARK-33589][SQL] Close opened session if the initialization fails ### What changes were proposed in this pull request? This pr add try catch when opening session. ### Why are the changes needed? Close opened session if the initialization fails. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Before this pr: ``` [rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Connecting to jdbc:hive2://localhost:10000/db_not_exist log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Database 'db_not_exist' not found; (state=08S01,code=0) Beeline version 2.3.7 by Apache Hive beeline> ``` ![image](https://user-images.githubusercontent.com/5399861/100560975-73ba5d80-32f2-11eb-8f92-b2509e7a121f.png) After this pr: ``` [rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Connecting to jdbc:hive2://localhost:10000/db_not_exist Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Failed to open new session: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db_not_exist' not found; (state=08S01,code=0) Beeline version 2.3.7 by Apache Hive beeline> ``` ![image](https://user-images.githubusercontent.com/5399861/100560917-479edc80-32f2-11eb-986f-7a997f1163fc.png) Closes #30536 from wangyum/SPARK-33589. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 11:21:02 +09:00
Max Gekk	a088a801ed	[SPARK-33585][SQL][DOCS] Fix the comment for `SQLContext.tables()` and mention the `database` column ### What changes were proposed in this pull request? Change the comments for `SQLContext.tables()` to "The returned DataFrame has three columns, database, tableName and isTemporary". ### Why are the changes needed? Currently, the comment mentions only 2 columns but `tables()` returns 3 columns actually: ```scala scala> spark.range(10).createOrReplaceTempView("view1") scala> val tables = spark.sqlContext.tables() tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field] scala> tables.printSchema root \|-- database: string (nullable = false) \|-- tableName: string (nullable = false) \|-- isTemporary: boolean (nullable = false) scala> tables.show +--------+---------+-----------+ \|database\|tableName\|isTemporary\| +--------+---------+-----------+ \| default\| t1\| false\| \| default\| t2\| false\| \| default\| ymd\| false\| \| \| view1\| true\| +--------+---------+-----------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30526 from MaxGekk/sqlcontext-tables-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 12:18:07 -08:00
Max Gekk	0054fc937f	[SPARK-33588][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW TABLE EXTENDED` ### What changes were proposed in this pull request? Perform partition spec normalization in `ShowTablesCommand` according to the table schema before getting partitions from the catalog. The normalization via `PartitioningUtils.normalizePartitionSpec()` adjusts the column names in partition specification, w.r.t. the real partition column names and case sensitivity. ### Why are the changes needed? Even when `spark.sql.caseSensitive` is `false` which is the default value, v1 `SHOW TABLE EXTENDED` is case sensitive: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > partitioned by (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`'; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the `SHOW TABLE EXTENDED` command respects the SQL config. And for example above, it returns correct result: ```sql spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); default tbl1 false Partition Values: [year=2015, month=1] Location: file:/Users/maximgekk/spark-warehouse/tbl1/year=2015/month=1 Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=1, path=file:/Users/maximgekk/spark-warehouse/tbl1] Partition Parameters: {transient_lastDdlTime=1606595118, totalSize=623, numFiles=1} Created Time: Sat Nov 28 23:25:18 MSK 2020 Last Access: UNKNOWN Partition Statistics: 623 bytes ``` ### How was this patch tested? By running the modified test suite `v1/ShowTablesSuite` Closes #30529 from MaxGekk/show-table-case-sensitive-spec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 12:10:16 -08:00
Yuming Wang	ba178f852f	[SPARK-33581][SQL][TEST] Refactor HivePartitionFilteringSuite ### What changes were proposed in this pull request? This pr refactor HivePartitionFilteringSuite. ### Why are the changes needed? To make it easy to maintain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30525 from wangyum/SPARK-33581. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-29 09:36:55 +08:00
Max Gekk	bfe9380ba2	[MINOR][SQL] Remove `getTables()` from `r.SQLUtils` ### What changes were proposed in this pull request? Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes https://github.com/apache/spark/pull/17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223 ### Why are the changes needed? To improve code maintenance, and remove the dead code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By R tests. Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-28 16:58:40 -08:00
luluorta	35ded12fc6	[SPARK-33141][SQL] Capture SQL configs when creating permanent views ### What changes were proposed in this pull request? This PR makes CreateViewCommand/AlterViewAsCommand capturing runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. Users can set `spark.sql.legacy.useCurrentConfigsForView` to `true` to restore the behavior before. ### Why are the changes needed? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138) that proposes to unify temp view and permanent view behaviors. This PR makes permanent views mimicking the temp view behavior that "fixes" view semantic by directly storing resolved LogicalPlan. For example, if a user uses spark 2.4 to create a view that contains null values from division-by-zero expressions, she may not want that other users' queries which reference her view throw exceptions when running on spark 3.x with ansi mode on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #30289 from luluorta/SPARK-33141. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 13:32:25 +00:00
xuewei.linxuewei	b9f2f78de5	[SPARK-33498][SQL] Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid ### What changes were proposed in this pull request? Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30442 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 13:24:11 +00:00
gengjiaan	e43255051c	[SPARK-28645][SQL] ParseException is thrown when the window is redefined ### What changes were proposed in this pull request? Currently in Spark one could redefine a window. For instance: `select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w AS (ORDER BY unique1);` The window `w` is defined two times. In PgSQL, on the other hand, a thrown will happen: `ERROR: window "w" is already defined` ### Why are the changes needed? The current implement gives the following window definitions a higher priority. But it wasn't Spark's intention and users can't know from any document of Spark. This PR fixes the bug. ### Does this PR introduce _any_ user-facing change? Yes. There is an example query output with/without this fix. ``` SELECT employee_name, salary, first_value(employee_name) OVER w highest_salary, nth_value(employee_name, 2) OVER w second_highest_salary FROM basic_pays WINDOW w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING), w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 2 FOLLOWING) ORDER BY salary DESC ``` The output before this fix: ``` Larry Bott 11798 Larry Bott Gerard Bondur Gerard Bondur 11472 Larry Bott Gerard Bondur Pamela Castillo 11303 Larry Bott Gerard Bondur Barry Jones 10586 Larry Bott Gerard Bondur George Vanauf 10563 Larry Bott Gerard Bondur Loui Bondur 10449 Larry Bott Gerard Bondur Mary Patterson 9998 Larry Bott Gerard Bondur Steve Patterson 9441 Larry Bott Gerard Bondur Julie Firrelli 9181 Larry Bott Gerard Bondur Jeff Firrelli 8992 Larry Bott Gerard Bondur William Patterson 8870 Larry Bott Gerard Bondur Diane Murphy 8435 Larry Bott Gerard Bondur Leslie Jennings 8113 Larry Bott Gerard Bondur Gerard Hernandez 6949 Larry Bott Gerard Bondur Foon Yue Tseng 6660 Larry Bott Gerard Bondur Anthony Bow 6627 Larry Bott Gerard Bondur Leslie Thompson 5186 Larry Bott Gerard Bondur ``` The output after this fix: ``` struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException The definition of window 'w' is repetitive(line 8, pos 0) ``` ### How was this patch tested? Jenkins test. Closes #30512 from beliefer/SPARK-28645. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 10:27:08 +00:00
Terry Kim	2c41d9d8fa	[SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView ### What changes were proposed in this pull request? This PR proposes to improve the exception messages while `UnresolvedTableOrView` is handled based on this suggestion: https://github.com/apache/spark/pull/30321#discussion_r521127001. Currently, when an identifier is resolved to a temp view when a table/permanent view is expected, the following exception message is displayed (e.g., for `SHOW CREATE TABLE`): ``` t is a temp view not table or permanent view. ``` After this PR, the message will be: ``` t is a temp view. 'SHOW CREATE TABLE' expects a table or permanent view. ``` Also, if an identifier is not resolved, the following exception message is currently used: ``` Table or view not found: t ``` After this PR, the message will be: ``` Table or permanent view not found for 'SHOW CREATE TABLE': t ``` or ``` Table or view not found for 'ANALYZE TABLE ... FOR COLUMNS ...': t ``` ### Why are the changes needed? To improve the exception message. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed as described above. ### How was this patch tested? Updated existing tests. Closes #30475 from imback82/unresolved_table_or_view. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 10:16:56 +00:00
Terry Kim	8792280a73	[SPARK-33575][SQL] Fix misleading exception for "ANALYZE TABLE ... FOR COLUMNS" on temporary views ### What changes were proposed in this pull request? This PR proposes to fix the exception message for `ANALYZE TABLE ... FOR COLUMNS` on temporary views. The current behavior throws `NoSuchTableException` even if the temporary view exists: ``` sql("CREATE TEMP VIEW t AS SELECT 1 AS id") sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS id") org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 't' not found in database 'db'; at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:76) at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54) ``` After this PR, more reasonable exception is thrown: ``` org.apache.spark.sql.AnalysisException: Temporary view `testView` is not cached for analyzing columns.; [info] at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:74) [info] at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54) ``` ### Why are the changes needed? To fix a misleading exception. ### Does this PR introduce _any_ user-facing change? Yes, the exception thrown is changed as shown above. ### How was this patch tested? Updated existing test. Closes #30519 from imback82/analyze_table_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 07:08:24 +00:00
yangjie01	433ae9064f	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV ### What changes were proposed in this pull request? There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different. The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable. On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default. So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv. ### Why are the changes needed? Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new case similar to that described in SPARK-33566 Closes #30518 from LuciferYang/SPARK-33566. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 15:47:39 +09:00
Maryann Xue	dfa3978d91	[SPARK-33551][SQL] Do not use custom shuffle reader for repartition ### What changes were proposed in this pull request? This PR fixes an AQE issue where local shuffle reader, partition coalescing, or skew join optimization can be mistakenly applied to a shuffle introduced by repartition or a regular shuffle that logically replaces a repartition shuffle. The proposed solution checks for the presence of any repartition shuffle and filters out not applicable optimization rules for the final stage in an AQE plan. ### Why are the changes needed? Without the change, the output of a repartition query may not be correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #30494 from maryannxue/csr-repartition. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-11-25 19:32:22 -08:00
Liang-Chi Hsieh	fb7b870214	[SPARK-33523][SQL][TEST][FOLLOWUP] Fix benchmark case name in SubExprEliminationBenchmark ### What changes were proposed in this pull request? Fix the wrong benchmark case name. ### Why are the changes needed? The last commit to refactor the benchmark code missed a change of case name. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30505 from viirya/SPARK-33523-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 15:22:47 -08:00
Yuming Wang	1de3fc4282	[SPARK-33525][SQL] Update hive-service-rpc to 3.1.2 ### What changes were proposed in this pull request? We supported Hive metastore are 0.12.0 through 3.1.2, but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw `TProtocolException` if we use hive-jdbc 3.x: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable. Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) ``` ``` org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client? at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) ``` This pr upgrade hive-service-rpc to 3.1.2 to fix this issue. ### Why are the changes needed? To support hive-jdbc 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; +---------+ \| Result \| +---------+ +---------+ No rows selected (1.051 seconds) 0: jdbc:hive2://localhost:10000/default> insert into t1 values(1); +---------+ \| Result \| +---------+ +---------+ No rows selected (2.08 seconds) 0: jdbc:hive2://localhost:10000/default> select * from t1; +-----+ \| id \| +-----+ \| 1 \| +-----+ 1 row selected (0.605 seconds) ``` Closes #30478 from wangyum/SPARK-33525. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 12:37:59 -08:00
Dongjoon Hyun	7cf6a6f996	[SPARK-31257][SPARK-33561][SQL][FOLLOWUP] Fix Scala 2.13 compilation ### What changes were proposed in this pull request? This PR is a follow-up to fix Scala 2.13 compilation. ### Why are the changes needed? To support Scala 2.13 in Apache Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action Scala 2.13 compilation job. Closes #30502 from dongjoon-hyun/SPARK-31257. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 09:57:46 -08:00
Liang-Chi Hsieh	9643eab53e	[SPARK-33540][SQL] Subexpression elimination for interpreted predicate ### What changes were proposed in this pull request? This patch proposes to support subexpression elimination for interpreted predicate. ### Why are the changes needed? Similar to interpreted projection, there are use cases when codegen predicate is not able to work, e.g. too complex schema, non-codegen expression, etc. When there are frequently occurring expressions (subexpressions) among predicate expression, the performance is quite bad as we need to re-compute same expressions. We should be able to support subexpression elimination for interpreted predicate like interpreted projection. ### Does this PR introduce _any_ user-facing change? No, this doesn't change user behavior. ### How was this patch tested? Unit test and benchmark. Closes #30497 from viirya/SPARK-33540. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 08:55:39 -08:00
Gengliang Wang	d691d85701	[SPARK-33496][SQL] Improve error message of ANSI explicit cast ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/30260, there are some type conversions disallowed under ANSI mode. We should tell users what they can do if they have to use the disallowed casting. ### Why are the changes needed? Make it more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes, the error message is improved on casting failure when ANSI mode is enabled ### How was this patch tested? Unit tests. Closes #30440 from gengliangwang/improveAnsiCastErrorMSG. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-25 23:15:52 +08:00
Ryan Blue	6f68ccf532	[SPARK-31257][SPARK-33561][SQL] Unify create table syntax ### What changes were proposed in this pull request? * Unify the create table syntax in the parser by merging Hive and DataSource clauses * Add `SerdeInfo` and `external` boolean to statement plans and update AstBuilder to produce them * Add conversion from create statement plan to v1 create plans in ResolveSessionCatalog * Support new statement clauses in ResolveCatalogs conversion to v2 create plans * Remove SparkSqlParser rules for Hive syntax * Add "option." namespace to distinguish SERDEPROPERTIES and OPTIONS in table properties ### Why are the changes needed? * Current behavior is confusing. * A way to pass the Hive create options to DSv2 is needed for a Hive source. ### Does this PR introduce any user-facing change? Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses. ### How was this patch tested? Existing tests validate there are no behavior changes. Update unit tests for using a statement plan for Hive create syntax: * Move create tests from spark-sql DDLParserSuite into PlanResolutionSuite * Add parser tests to spark-catalyst DDLParserSuite Closes #28026 from rdblue/unify-create-table. Lead-authored-by: Ryan Blue <blue@apache.org> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 15:09:02 +00:00
duripeng	7c59aeeef4	[SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode ### What changes were proposed in this pull request? When using dynamic partition overwrite, each task has its working dir under staging dir like `stagingDir/.spark-staging-{jobId}`, each task commits to `outputPath/.spark-staging-{jobId}/{partitionId}/part-{taskId}-{jobId}{ext}`. When speculation enable, multiple task attempts would be setup for one task, they have same task id and they would commit to same file concurrently. Due to host done or node preemption, the partly-committed files aren't cleaned up, a FileAlreadyExistsException would be raised in this situation, resulting in job failure. I don't try to change task commit process for dynamic partition overwrite, like adding attempt id to task working dir for each attempts and committing to final output dir via a new outputCommitCoordinator, here is reason: 1. `FileOutputCommitter` already has commit coordinator for each task attempts, we can leverage it rather than build a new one. 2. To say the least, we implement a coordinator solving task attempts commit conflict, suppose a severe case, application master failover, tasks with same attempt id and same task id would commit to same files, the `FileAlreadyExistsException` risk still exists In this pr, I leverage FileOutputCommitter to solve the problem: 1. when initing a write job description, set `outputPath/.spark-staging-{jobId}` as the output dir 2. each task attempt writes output to `outputPath/.spark-staging-{jobId}/_temporary/${appAttemptId}/_temporary/${taskAttemptId}/{partitionId}/part-{taskId}-{jobId}{ext}` 3. leverage `FileOutputCommitter` coordinator, write job firstly commits output to `outputPath/.spark-staging-{jobId}/{partitionId}` 4. for dynamic partition overwrite, write job finally move `outputPath/.spark-staging-{jobId}/{partitionId}` to `outputPath/{partitionId}` ### Why are the changes needed? Without this pr, dynamic partition overwrite would fail ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? added UT. Closes #29000 from WinkerDu/master-fix-dynamic-partition-multi-commit. Authored-by: duripeng <duripeng@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 12:50:21 +00:00
Max Gekk	2c5cc36e3f	[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management ### What changes were proposed in this pull request? 1. Add new method `listPartitionByNames` to the `SupportsPartitionManagement` interface. It allows to list partitions by partition names and their values. 2. Implement new method in `InMemoryPartitionTable` which is used in DSv2 tests. ### Why are the changes needed? Currently, the `SupportsPartitionManagement` interface exposes only `listPartitionIdentifiers` which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them). For example, the table `tableA` is partitioned by two column `year` and `month` ``` CREATE TABLE tableA (price int, year int, month int) USING _ partitioned by (year, month) ``` and has the following partitions: ``` PARTITION(year = 2015, month = 1) PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) PARTITION(year = 2016, month = 3) ``` If we want to list all partitions with `month = 2`, we have to specify `year` for listPartitionIdentifiers() which not always possible as we don't know all `year` values in advance. New method listPartitionByNames() allows to specify partition values only for `month`, and get two partitions: ``` PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `SupportsPartitionManagementSuite`. Closes #30452 from MaxGekk/column-names-listPartitionIdentifiers. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 12:41:53 +00:00
Gengliang Wang	19f3b89d62	[SPARK-33549][SQL] Remove configuration spark.sql.legacy.allowCastNumericToTimestamp ### What changes were proposed in this pull request? Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp ### Why are the changes needed? In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true. After https://github.com/apache/spark/pull/30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior. As the configuration is not in any released yet, we should remove the configuration to make things simpler. ### Does this PR introduce _any_ user-facing change? No, since the configuration is not released yet. ### How was this patch tested? Existing test cases Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 08:59:31 +00:00
Yuming Wang	781e19c4d1	[SPARK-33477][SQL] Hive Metastore support filter by date type ### What changes were proposed in this pull request? Hive Metastore supports strings and integral types in filters. It could also support dates. Please see [HIVE-5679](`5106bf1c86`) for more details. This pr add support it. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30408 from wangyum/SPARK-33477. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-25 16:38:55 +09:00
Kousuke Saruta	c3ce9701b4	[SPARK-33533][SQL] Fix the regression bug that ConnectionProviders don't consider case-sensitivity for properties ### What changes were proposed in this pull request? This PR fixes an issue that `BasicConnectionProvider` doesn't consider case-sensitivity for properties. For example, the property `oracle.jdbc.mapDateToTimestamp` should be considered case-sensitivity but it is not considered. ### Why are the changes needed? This is a bug introduced by #29024 . Caused by this issue, `OracleIntegrationSuite` doesn't pass. ``` [info] - SPARK-16625: General data types to be mapped to Oracle * FAILED * (32 seconds, 129 milliseconds) [info] types.apply(9).equals(org.apache.spark.sql.types.DateType) was false (OracleIntegrationSuite.scala:238) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.jdbc.OracleIntegrationSuite.$anonfun$new$4(OracleIntegrationSuite.scala:238) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this change, I confirmed that `OracleIntegrationSuite` passes with the following command. ``` $ git clone https://github.com/oracle/docker-images.git $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles $ ./buildDockerImage.sh -v 18.4.0 -x $ ORACLE_DOCKER_IMAGE_NAME=oracle/database:18.4.0-xe build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver "testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite" ``` Closes #30485 from sarutak/fix-oracle-integration-suite. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-24 20:18:45 -08:00
Jungtaek Lim (HeartSaVioR)	edab094dda	[SPARK-33224][SS][WEBUI] Add watermark gap information into SS UI page ### What changes were proposed in this pull request? This PR proposes to add the watermark gap information in SS UI page. Please refer below screenshots to see what we'd like to show in UI. ![Screen Shot 2020-11-19 at 6 56 38 PM](https://user-images.githubusercontent.com/1317309/99669306-3532d080-2ab2-11eb-9a93-03d2c6a54948.png) Please note that this PR doesn't plot the watermark value - knowing the gap between actual wall clock and watermark looks more useful than the absolute value. ### Why are the changes needed? Watermark is the one of major metrics the end users need to track for stateful queries. Watermark defines "when" the output will be emitted for append mode, hence knowing how much gap between wall clock and watermark (input data) is very helpful to make expectation of the output. ### Does this PR introduce _any_ user-facing change? Yes, SS UI query page will contain the watermark gap information. ### How was this patch tested? Basic UT added. Manually tested with two queries: > simple case You'll see consistent watermark gap with (15 seconds + a) = 10 seconds are from delay in watermark definition, 5 seconds are trigger interval. ``` import org.apache.spark.sql.streaming.Trigger spark.conf.set("spark.sql.shuffle.partitions", "10") val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("timestamp", "mod(value, 100) as mod", "value") .withWatermark("timestamp", "10 seconds") .groupBy(window($"timestamp", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() query.awaitTermination() ``` ![Screen Shot 2020-11-19 at 7 00 21 PM](https://user-images.githubusercontent.com/1317309/99669049-dbcaa180-2ab1-11eb-8789-10b35857dda0.png) > complicated case This randomizes the timestamp, hence producing random watermark gap. This won't be smaller than 15 seconds as I described earlier. ``` import org.apache.spark.sql.streaming.Trigger spark.conf.set("spark.sql.shuffle.partitions", "10") val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() 100000) AS BIGINT) AS TIMESTAMP) AS tsMod") .selectExpr("tsMod", "mod(value, 100) as mod", "value") .withWatermark("tsMod", "10 seconds") .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() query.awaitTermination() ``` ![Screen Shot 2020-11-19 at 6 56 47 PM](https://user-images.githubusercontent.com/1317309/99669029-d5d4c080-2ab1-11eb-9c63-d05b3e1ab391.png) Closes #30427 from HeartSaVioR/SPARK-33224. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-25 13:12:20 +09:00
Terry Kim	b7f034d8dc	[SPARK-33543][SQL] Migrate SHOW COLUMNS command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW COLUMNS` is not yet supported for v2 tables. ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `ShowColumnsCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30490 from imback82/show_columns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 03:04:04 +00:00
Wenchen Fan	d1b4f06179	[SPARK-33494][SQL][AQE] Do not use local shuffle reader for repartition ### What changes were proposed in this pull request? This PR updates `ShuffleExchangeExec` to carry more information about how much we can change the partitioning. For `repartition(col)`, we should preserve the user-specified partitioning and don't apply the AQE local shuffle reader. ### Why are the changes needed? Similar to `repartition(number, col)`, we should respect the user-specified partitioning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? a new test Closes #30432 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 02:02:32 +00:00
Gabor Somogyi	95b6dabc33	[SPARK-33287][SS][UI] Expose state custom metrics information on SS UI ### What changes were proposed in this pull request? Structured Streaming UI is not containing state custom metrics information. In this PR I've added it. ### Why are the changes needed? Missing state custom metrics information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. ``` #Compile Spark echo "spark.sql.streaming.ui.enabledCustomMetricList stateOnCurrentVersionSizeBytes" >> conf/spark-defaults.conf sbin/start-master.sh sbin/start-worker.sh spark://gsomogyi-MBP16:7077 ./bin/spark-submit --master spark://gsomogyi-MBP16:7077 --deploy-mode client --class com.spark.Main ../spark-test/target/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar ``` <img width="1119" alt="Screenshot 2020-11-18 at 12 45 36" src="https://user-images.githubusercontent.com/18561820/99527506-2f979680-299d-11eb-9187-4ae7fbd2596a.png"> Closes #30336 from gaborgsomogyi/SPARK-33287. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-25 07:38:45 +09:00
Terry Kim	fdd6c73b3c	[SPARK-33514][SQL] Migrate TRUNCATE TABLE command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `TRUNCATE TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `TRUNCATE TABLE` works only with v1 tables, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t using csv AS SELECT 1") sql("USE db") sql("TRUNCATE TABLE t") // Succeeds ``` With this PR, `TRUNCATE TABLE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$42(Analyzer.scala:866) ``` , which is expected since temporary view is resolved first and `TRUNCATE TABLE` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `TRUNCATE TABLE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30457 from imback82/truncate_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-24 11:06:39 +00:00
Max Gekk	a6555ee596	[SPARK-33521][SQL] Universal type conversion in resolving V2 partition specs ### What changes were proposed in this pull request? In the PR, I propose to changes the resolver of partition specs used in V2 `ALTER TABLE .. ADD/DROP PARTITION` (at the moment), and re-use `CAST` in conversion partition values to desired types according to the partition schema. ### Why are the changes needed? Currently, the resolver of V2 partition specs supports just a few types: `23e9920b39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (L72)`, and fails on other types like date/timestamp. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running `AlterTablePartitionV2SQLSuite` Closes #30474 from MaxGekk/dsv2-partition-value-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-24 08:04:21 +00:00
Liang-Chi Hsieh	f35e28fea5	[SPARK-33523][SQL][TEST] Add predicate related benchmark to SubExprEliminationBenchmark ### What changes were proposed in this pull request? This patch adds predicate related benchmark to `SubExprEliminationBenchmark`. ### Why are the changes needed? We should have a benchmark for subexpression elimination of predicate. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Run benchmark locally. Closes #30476 from viirya/SPARK-33523. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-24 13:30:06 +09:00
Dongjoon Hyun	8380e00419	[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` ### What changes were proposed in this pull request? This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`. ### Why are the changes needed? SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions. - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159 Scala 2.12.10 ```scala $ bin/scala Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode res0: Int = -2074071657 ``` Scala 2.13.3 ```scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode val res0: Int = -1669302457 ``` ### Does this PR introduce _any_ user-facing change? Yes. This is a correctness issue. ### How was this patch tested? Pass the UT with both Scala 2.12/2.13. Closes #30477 from dongjoon-hyun/SPARK-33524. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 19:35:58 -08:00
Dongjoon Hyun	3ce4ab545b	[SPARK-33513][BUILD] Upgrade to Scala 2.13.4 to improve exhaustivity ### What changes were proposed in this pull request? This PR aims the followings. 1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1 2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.) 3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job. ### Why are the changes needed? Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support. - https://github.com/scala/scala/releases/tag/v2.13.4 Also, it improves exhaustivity check. - https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors) - https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components) ### Does this PR introduce _any_ user-facing change? Yep. Although it's a maintenance version change, it's a Scala version change. ### How was this patch tested? Pass the CIs and do the manual testing. - Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change. - Scala 2.13 Compilation job to check the compilation Closes #30455 from dongjoon-hyun/SCALA_3.13. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 16:28:43 -08:00
gengjiaan	f83fcb1254	[SPARK-33278][SQL][FOLLOWUP] Improve OptimizeWindowFunctions to avoid transfer first to nth_value ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30178 provided `OptimizeWindowFunctions` used to transfer `first` to `nth_value`. If the window frame is `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, `nth_value` has better performance than `first`. But the `OptimizeWindowFunctions` need to exclude other window frame. ### Why are the changes needed? Improve `OptimizeWindowFunctions` to avoid transfer `first` to `nth_value` if the specified window frame isn't `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30419 from beliefer/SPARK-33278_followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 14:54:44 +00:00
Max Gekk	23e9920b39	[SPARK-33511][SQL] Respect case sensitivity while resolving V2 partition specs ### What changes were proposed in this pull request? 1. Pre-process partition specs in `ResolvePartitionSpec`, and convert partition names according to the partition schema and the SQL config `spark.sql.caseSensitive`. In the PR, I propose to invoke `normalizePartitionSpec` for that. The function is used in DSv1 commands, so, the behavior will be similar to DSv1. 2. Move `normalizePartitionSpec()` from `sql/core/.../datasources/PartitioningUtils` to `sql/catalyst/.../util/PartitioningUtils` to use it in Catalyst's rule `ResolvePartitionSpec` ### Why are the changes needed? DSv1 commands like `ALTER TABLE .. ADD PARTITION` and `ALTER TABLE .. DROP PARTITION` respect the SQL config `spark.sql.caseSensitive` while resolving partition specs. For example: ```sql spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet PARTITIONED BY (id); spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1); spark-sql> SHOW PARTITIONS tbl1; id=1 ``` The same command fails on V2 Table catalog with error: ``` AnalysisException: Partition key ID not exists ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, partition spec resolution works as for DSv1 (without the exception showed above). ### How was this patch tested? By running `AlterTablePartitionV2SQLSuite`. Closes #30454 from MaxGekk/partition-spec-case-sensitivity. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 09:00:41 +00:00
Terry Kim	60f3a730e4	[SPARK-33515][SQL] Improve exception messages while handling UnresolvedTable ### What changes were proposed in this pull request? This PR proposes to improve the exception messages while `UnresolvedTable` is handled based on this suggestion: https://github.com/apache/spark/pull/30321#discussion_r521127001. Currently, when an identifier is resolved to a view when a table is expected, the following exception message is displayed (e.g., for `COMMENT ON TABLE`): ``` v is a temp view not table. ``` After this PR, the message will be: ``` v is a temp view. 'COMMENT ON TABLE' expects a table. ``` Also, if an identifier is not resolved, the following exception message is currently used: ``` Table not found: t ``` After this PR, the message will be: ``` Table not found for 'COMMENT ON TABLE': t ``` ### Why are the changes needed? To improve the exception message. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed as described above. ### How was this patch tested? Updated existing tests. Closes #30461 from imback82/unresolved_table_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 08:54:00 +00:00
Xiao Li	c891e025b8	Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" ### What changes were proposed in this pull request? This reverts commit `065f17386d`, which is not part of any released version. That is, this is an unreleased feature ### Why are the changes needed? I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex. I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance? In sum, I think we should not merge the PR https://github.com/apache/spark/pull/29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1 ### Does this PR introduce _any_ user-facing change? Reverted the original commit ### How was this patch tested? The existing tests. Closes #30463 from gatorsmile/revertSpark-32481. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 17:43:58 +09:00
Liang-Chi Hsieh	aa78c05edc	[SPARK-33427][SQL][FOLLOWUP] Put key and value into IdentityHashMap sequantially ### What changes were proposed in this pull request? This follow-up fixes an issue when inserting key/value pairs into `IdentityHashMap` in `SubExprEvaluationRuntime`. ### Why are the changes needed? The last commits to #30341 follows review comment to use `IdentityHashMap`. Because we leverage `IdentityHashMap` to compare keys in reference, we should not convert expression pairs to Scala map before inserting. Scala map compares keys by equality so we will loss keys with different references. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run benchmark to verify. Closes #30459 from viirya/SPARK-33427-map. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 10:42:28 +09:00
ulysses	6d625ccd5b	[SPARK-33469][SQL] Add current_timezone function ### What changes were proposed in this pull request? Add a `CurrentTimeZone` function and replace the value at `Optimizer` side. ### Why are the changes needed? Let user get current timezone easily. Then user can call ``` SELECT current_timezone() ``` Presto: https://prestodb.io/docs/current/functions/datetime.html SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/functions/current-timezone-transact-sql?view=sql-server-ver15 ### Does this PR introduce _any_ user-facing change? Yes, a new function. ### How was this patch tested? Add test. Closes #30400 from ulysses-you/SPARK-33469. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-22 15:36:44 -08:00
CC Highman	d338af3101	[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source ### What changes were proposed in this pull request? Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written. ### Why are the changes needed? When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code. ### Does this PR introduce _any_ user-facing change? This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option. Example Usages _Load all CSV files modified after date:_ `spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()` _Load all CSV files modified before date:_ `spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()` _Load all CSV files modified between two dates:_ `spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load() ` ### How was this patch tested? A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments. (quoted from cchighman) Closes #30411 from HeartSaVioR/SPARK-31962. Lead-authored-by: CC Highman <christopher.highman@microsoft.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-23 08:30:41 +09:00
angerszhu	d7f4b2ad50	[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+ ### What changes were proposed in this pull request? We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later. ### Why are the changes needed? To recover test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check CI logs. Closes #30451 from AngersZhuuuu/SPARK-28704. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-22 10:29:15 -08:00
Gustavo Martin Morcuende	517b810dfa	[SPARK-33463][SQL] Keep Job Id during incremental collect in Spark Thrift Server ### What changes were proposed in this pull request? When enabling spark.sql.thriftServer.incrementalCollect Job Ids get lost and tracing queries in Spark Thrift Server ends up being too complicated. ### Why are the changes needed? Because it will make easier tracing Spark Thrift Server queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The current tests are enough. No need of more tests. Closes #30390 from gumartinm/master. Authored-by: Gustavo Martin Morcuende <gu.martinm@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-21 08:39:16 -08:00
Dongjoon Hyun	cf7490112a	Revert "[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+" This reverts commit `47326ac1c6`.	2020-11-20 19:01:58 -08:00
Max Gekk	530c0a8e28	[SPARK-33505][SQL][TESTS] Fix adding new partitions by INSERT INTO `InMemoryPartitionTable` ### What changes were proposed in this pull request? 1. Add a hook method to `addPartitionKey()` of `InMemoryTable` which is called per every row. 2. Override `addPartitionKey()` in `InMemoryPartitionTable`, and add partition key every time when new row is inserted to the table. ### Why are the changes needed? To be able to write unified tests for datasources V1 and V2. Currently, INSERT INTO a V1 table creates partitions but the same doesn't work for the custom catalog `InMemoryPartitionTableCatalog` used in DSv2 tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `DataSourceV2SQLSuite`. Closes #30449 from MaxGekk/insert-into-InMemoryPartitionTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 18:41:25 -08:00
Jungtaek Lim (HeartSaVioR)	67c6ed9068	[SPARK-33223][SS][FOLLOWUP] Clarify the meaning of "number of rows dropped by watermark" in SS UI page ### What changes were proposed in this pull request? This PR fixes the representation to clarify the meaning of "number of rows dropped by watermark" in SS UI page. ### Why are the changes needed? `Aggregated Number Of State Rows Dropped By Watermark` says that the dropped rows are from the state, whereas they're not. We say "evicted from the state" for the case, which is "normal" to emit outputs and reduce memory usage of the state. The metric actually represents the number of "input" rows dropped by watermark, and the meaning of "input" is relative to the "stateful operator". That's a bit confusing as we normally think "input" as "input from source" whereas it's not. ### Does this PR introduce _any_ user-facing change? Yes, UI element & tooltip change. ### How was this patch tested? Only text change in UI, so we know how thing will be changed intuitively. Closes #30439 from HeartSaVioR/SPARK-33223-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-21 10:27:00 +09:00
anchovYu	de0f50abf4	[SPARK-32670][SQL] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? Group all messages of `AnalysisExcpetions` created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file. * Create a new object: `org.apache.spark.sql.CatalystErrors` with many exception-creating functions. * When the `Analyzer` wants to create and throw a new `AnalysisException`, call functions of `CatalystErrors` ### Why are the changes needed? This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. ### Naming of exception functions All function names ended with `Error`. * For specific errors like `groupingIDMismatch` and `groupingColInvalid`, directly use them as name, just like `groupingIDMismatchError` and `groupingColInvalidError`. * For generic errors like `dataTypeMismatch`, * if confident with the context, prefix and condition can be added, like `pivotValDataTypeMismatchError` * if not sure about the context, add a `For` suffix of the specific component that this exception is related to, like `dataTypeMismatchForDeserializerError` Closes #29497 from anchovYu/32670. Lead-authored-by: anchovYu <aureole@sjtu.edu.cn> Co-authored-by: anchovYu <xyyu15@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-21 08:33:39 +09:00
Chao Sun	2479778934	[SPARK-33492][SQL] DSv2: Append/Overwrite/ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This adds changes in the following places: - logic to also refresh caches referencing the target table in v2 `AppendDataExec`, `OverwriteByExpressionExec`, `OverwritePartitionsDynamicExec`, as well as their v1 fallbacks `AppendDataExecV1` and `OverwriteByExpressionExecV1`. - logic to invalidate caches referencing the target table in v2 `ReplaceTableAsSelectExec` and its atomic version `AtomicReplaceTableAsSelectExec`. These are only supported in v2 at the moment though. In addition to the above, in order to test the v1 write fallback behavior, I extended `InMemoryTableWithV1Fallback` to also support batch reads. ### Why are the changes needed? Currently in DataSource v2 we don't refresh or invalidate caches referencing the target table when the table content is changed by operations such as append, overwrite, or replace table. This is different from DataSource v1, and could potentially cause data correctness issue if the staled caches are queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added unit tests for the new code path. Closes #30429 from sunchao/SPARK-33492. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 14:59:56 -08:00
angerszhu	47326ac1c6	[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+ ### What changes were proposed in this pull request? We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later. ### Why are the changes needed? To recover test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check CI logs. Closes #30428 from AngersZhuuuu/SPARK-28704. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 08:40:14 -08:00
ulysses	3384bda453	[SPARK-33468][SQL] ParseUrl in ANSI mode should fail if input string is not a valid url ### What changes were proposed in this pull request? With `ParseUrl`, instead of return null we throw exception if input string is not a vaild url. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception if `set spark.sql.ansi.enabled=true`. ### How was this patch tested? Add test. Closes #30399 from ulysses-you/SPARK-33468. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 13:23:08 +00:00
Max Gekk	870d409533	[SPARK-32512][SQL][TESTS][FOLLOWUP] Remove duplicate tests for ALTER TABLE .. PARTITIONS from DataSourceV2SQLSuite ### What changes were proposed in this pull request? Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by https://github.com/apache/spark/pull/29339. ### Why are the changes needed? - To reduce tests execution time - To improve test maintenance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly DataSourceV2SQLSuite" $ build/sbt "test:testOnly AlterTablePartitionV2SQLSuite" ``` Closes #30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 12:53:45 +00:00
Gabor Somogyi	883a213a8f	[MINOR] Structured Streaming statistics page indent fix ### What changes were proposed in this pull request? Structured Streaming statistics page code contains an indentation issue. This PR fixes it. ### Why are the changes needed? Indent fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30434 from gaborgsomogyi/STAT-INDENT-FIX. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 13:36:45 -08:00
Chao Sun	6da8ade5f4	[SPARK-33045][SQL][FOLLOWUP] Fix build failure with Scala 2.13 ### What changes were proposed in this pull request? Explicitly convert `scala.collection.mutable.Buffer` to `Seq`. In Scala 2.13 `Seq` is an alias of `scala.collection.immutable.Seq` instead of `scala.collection.Seq`. ### Why are the changes needed? Without the change build with Scala 2.13 fails with the following: ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1417:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case null => LikeAll(e, patterns) [error] ^ [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1418:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case _ => NotLikeAll(e, patterns) [error] ^ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30431 from sunchao/SPARK-33045-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 12:42:33 -08:00
gengjiaan	3695e997d5	[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ALL` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ALL` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_all`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29999 from beliefer/SPARK-33045-like_all. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 16:56:21 +00:00
ulysses	21b13506cd	[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row ### What changes were proposed in this pull request? Change `CombineLimits` name to `EliminateLimits` and add check if `Limit` child max row <= limit. ### Why are the changes needed? In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary. A general negative example is ``` select count(*) from t limit 100000; ``` It will be great if we can eliminate limit at Spark side. Also, we make a benchmark for this case ``` runBenchmark("Sort and Limit") { val N = 100000 val benchmark = new Benchmark("benchmark sort and limit", N) benchmark.addCase("TakeOrderedAndProject", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } benchmark.addCase("Sort And Limit", 3) { _ => withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> "-1") { spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } } benchmark.addCase("Sort", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").collect() } benchmark.run() } ``` and the result is ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.15.6 Intel(R) Core(TM) i5-5257U CPU 2.70GHz benchmark sort and limit: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ TakeOrderedAndProject 1833 2259 382 0.1 18327.1 1.0X Sort And Limit 1417 1658 285 0.1 14167.5 1.3X Sort 1324 1484 225 0.1 13238.3 1.4X ``` It shows that it makes sense to replace `TakeOrderedAndProjectExec` with `Sort + Project`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30368 from ulysses-you/SPARK-33442. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:31:10 +00:00
allisonwang-db	a03c540cf7	[SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order ### What changes were proposed in this pull request? This PR switched the order for the rule `RemoveRedundantSorts` and `EnsureRequirements` so that `EnsureRequirements` will be invoked before `RemoveRedundantSorts` to avoid IllegalArgumentException when instantiating PartitioningCollection. ### Why are the changes needed? `RemoveRedundantSorts` rule uses SparkPlan's `outputPartitioning` to check whether a sort node is redundant. Currently, it is added before `EnsureRequirements`. Since `PartitioningCollection` requires left and right partitioning to have the same number of partitions, which is not necessarily true before applying `EnsureRequirements`, the rule can fail with the following exception: ``` IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30373 from allisonwang-db/sort-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:29:01 +00:00
allisonwang-db	ef2638c3e3	[SPARK-33183][SQL][FOLLOW-UP] Update rule RemoveRedundantSorts config version ### What changes were proposed in this pull request? This PR is a follow up for #30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8. ### Why are the changes needed? To update the rule version it has been backported to 2.4. #30194 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30420 from allisonwang-db/spark-33183-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 00:12:22 -08:00
Dongjoon Hyun	d5e7bd0cc4	[SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE ### What changes were proposed in this pull request? This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0) ### Why are the changes needed? This prevents the situation like https://github.com/apache/spark/pull/30415. Currently, it missed `catalog` directory due to `.log` rule. ``` $ dev/check-license Could not find Apache license headers in the following files: !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the new rule. Closes #30418 from dongjoon-hyun/SPARK-RAT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 23:59:11 -08:00
Prakhar Jain	0b0fb70b09	[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations ### What changes were proposed in this pull request? This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering. Example: consider this join of three tables: """ \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id """. The plan for this looks like: (8) Project [t2id#1059L, id#1004L AS t3id#1060L] +- (8) SortMergeJoin [t2id#1059L], [id#1004L], Inner :- (5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0 <----------------------------- : +- (5) Project [id#1000L AS t2id#1059L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST ], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST ], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432] : +- (3) Range (0, 20, step=1, splits=2) +- (7) Sort [id#1004L ASC NULLS FIRST ], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443] +- *(6) Range (0, 30, step=1, splits=2) In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin. ### Why are the changes needed? To remove unneeded Sort operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #30302 from prakharjain09/SPARK-33400-sortorder. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 06:25:37 +00:00
Yuming Wang	014e1fbb3a	[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-19 14:01:42 +08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
Ryan Blue	66a76378cf	[SPARK-31255][SQL][FOLLOWUP] Add missing license headers ### What changes were proposed in this pull request? Add missing license headers for new files added in #28027. ### Why are the changes needed? To fix licenses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a purely non-functional change. Closes #30415 from rdblue/license-headers. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 19:18:28 -08:00
Liang-Chi Hsieh	e518008ca9	[SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections ### What changes were proposed in this pull request? Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`. ### Why are the changes needed? Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30406 from viirya/SPARK-33473. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 18:58:06 -08:00
Liang-Chi Hsieh	97d2cee4af	[SPARK-33427][SQL][FOLLOWUP] Prevent test flakyness in SubExprEvaluationRuntimeSuite ### What changes were proposed in this pull request? This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`. ### Why are the changes needed? Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30414 from viirya/SPARK-33427-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-18 18:35:11 -08:00
Gengliang Wang	9a4c79073b	[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode ### What changes were proposed in this pull request? In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types. ![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png) Comparing the ANSI CAST syntax rules with the current default behavior of Spark: ![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png) To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now ``` Numeric <=> Boolean String <=> Binary ``` ### Why are the changes needed? Better ANSI SQL compliance ### Does this PR introduce _any_ user-facing change? Yes, the following castings will not be allowed in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` ### How was this patch tested? Unit test The ANSI Compliance doc preview: ![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png) Closes #30260 from gengliangwang/ansiCanCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 09:23:36 +09:00
Ryan Blue	1df69f7e32	[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 ### What changes were proposed in this pull request? This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns. The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used. ### Why are the changes needed? This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan. ### Does this PR introduce any user-facing change? Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns. ### How was this patch tested? Will include new unit tests. Closes #28027 from rdblue/add-dsv2-metadata-columns. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-11-18 14:07:51 -08:00
Chao Sun	27cd945c15	[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils ### What changes were proposed in this pull request? This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API and combines it with the `filter`. 2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable. 3. Hides the `isRootLevel` flag from the top-level API. ### Why are the changes needed? Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests (e.g., `FileIndexSuite`) Closes #29959 from sunchao/hadoop-fs-utils-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-11-18 12:39:00 -08:00
Gengliang Wang	a180e02842	[SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of spark.sql.hive.metastore.jars ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/29881. It revises the documentation of the configuration `spark.sql.hive.metastore.jars`. ### Why are the changes needed? Fix grammatical error in the doc. Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc changes. Closes #30407 from gengliangwang/reviseJarPathDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-18 22:09:40 +08:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Yuming Wang	09bb9bedcd	[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values ### What changes were proposed in this pull request? We [rewrite](`5197c5d2e7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (L722-L724)`) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values. This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow. From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000. ### Why are the changes needed? Avoid Hive metastore stack overflow when `InSet` predicate have many values. Especially DPP, it may generate many values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #30325 from wangyum/SPARK-33416. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 13:47:01 +00:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
Cheng Su	5af5aa146e	[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from https://github.com/apache/spark/pull/30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-17 11:18:42 +09:00
Prakhar Jain	f5e3302840	[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id The plan for this looks like: (9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- (9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- (6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- (5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- (3) Range (0, 20, step=1, splits=2) +- (8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-17 10:35:43 +09:00
xuewei.linxuewei	b5eca18af0	[SPARK-33460][SQL] Accessing map values should fail if key is not found ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30386 from leanken/leanken-SPARK-33460. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:14:31 +00:00
Max Gekk	6883f29465	[SPARK-33453][SQL][TESTS] Unify v1 and v2 SHOW PARTITIONS tests ### What changes were proposed in this pull request? 1. Move `SHOW PARTITIONS` parsing tests to `ShowPartitionsParserSuite` 2. Place Hive tests for `SHOW PARTITIONS` from `HiveCommandSuite` to the base test suite `v1.ShowPartitionsSuiteBase`. This will allow to run the tests w/ and w/o Hive. The changes follow the approach of https://github.com/apache/spark/pull/30287. ### Why are the changes needed? - The unification will allow to run common `SHOW PARTITIONS` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: - new test suites `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"` - and old one `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCommandSuite"` Closes #30377 from MaxGekk/unify-dsv1_v2-show-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:11:42 +00:00
luluorta	dfa6fb46f4	[SPARK-33389][SQL] Make internal classes of SparkSession always using active SQLConf ### What changes were proposed in this pull request? This PR makes internal classes of SparkSession always using active SQLConf. We should remove all `conf: SQLConf`s from ctor-parameters of this classes (`Analyzer`, `SparkPlanner`, `SessionCatalog`, `CatalogManager` `SparkSqlParser` and etc.) and use `SQLConf.get` instead. ### Why are the changes needed? Code refine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test Closes #30299 from luluorta/SPARK-33389. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 15:27:18 +00:00
xuewei.linxuewei	aa508fcc03	[SPARK-33140][SQL][FOLLOW-UP] Revert code that not use passed-in SparkSession to get SQLConf ### What changes were proposed in this pull request? Revert code that does not use passed-in SparkSession to get SQLConf in [SPARK-33140]. The change scope of [SPARK-33140] change passed-in SQLConf instance and place using SparkSession to get SQLConf to be unified to use SQLConf.get. And the code reverted in the patch, the passed-in SparkSession was not about to get SQLConf, but using its catalog, it's better to be consistent. ### Why are the changes needed? Potential regression bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30364 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 11:57:50 +00:00
Max Gekk	71a29b2eca	[MINOR][SQL][DOCS] Fix a reference to `spark.sql.sources.useV1SourceList` ### What changes were proposed in this pull request? Replace `spark.sql.sources.write.useV1SourceList` by `spark.sql.sources.useV1SourceList` in the comment for `CatalogManager.v2SessionCatalog()`. ### Why are the changes needed? To have correct comments. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30385 from MaxGekk/fix-comment-useV1SourceList. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:57:20 +09:00
Liang-Chi Hsieh	10b011f837	[SPARK-33456][SQL][TEST][FOLLOWUP] Fix SUBEXPRESSION_ELIMINATION_ENABLED config name ### What changes were proposed in this pull request? To fix wrong config name in `subexp-elimination.sql`. ### Why are the changes needed? `CONFIG_DIM` should use config name's key. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30384 from viirya/SPARK-33456-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:53:31 +09:00
Yuming Wang	cdcbdaeb0d	[SPARK-33458][SQL] Hive partition pruning support Contains, StartsWith and EndsWith predicate ### What changes were proposed in this pull request? This pr add support Hive partition pruning on `Contains`, `StartsWith` and `EndsWith` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30383 from wangyum/SPARK-33458. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:18:13 +00:00
Max Gekk	4e5d2e0695	[SPARK-33394][SQL][TESTS] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()` ### What changes were proposed in this pull request? Throw `NoSuchNamespaceException` in `listTables()` of the custom test catalog `InMemoryTableCatalog` if the passed namespace doesn't exist. ### Why are the changes needed? 1. To align behavior of V2 `InMemoryTableCatalog` to V1 session catalog. 2. To distinguish two situations: 1. A namespace does exist but does not contain any tables. In that case, `listTables()` returns empty result. 2. A namespace does not exist. `listTables()` throws `NoSuchNamespaceException` in this case. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW TABLES` returns empty result before the changes. ### How was this patch tested? By running V1/V2 ShowTablesSuites. Closes #30358 from MaxGekk/show-tables-in-not-existing-namespace. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:08:21 +00:00
Liang-Chi Hsieh	d4cf1483fd	[SPARK-33456][SQL][TEST] Add end-to-end test for subexpression elimination ### What changes were proposed in this pull request? This patch proposes to add end-to-end test for subexpression elimination. ### Why are the changes needed? We have subexpression elimination feature for expression evaluation but we don't have end-to-end tests for the feature. We should have one to make sure we don't break it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit tests. Closes #30381 from viirya/SPARK-33456. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 15:47:35 +09:00
artiship	1ae6d64b5f	[SPARK-33358][SQL] Return code when command process failed Exit Spark SQL CLI processing loop if one of the commands (sub sql statement) process failed This is a regression at Apache Spark 3.0.0. ``` $ cat 1.sql select * from nonexistent_table; select 2; ``` Apache Spark 2.4.7 ``` spark-2.4.7-bin-hadoop2.7:$ bin/spark-sql -f 1.sql 20/11/15 16:14:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Error in query: Table or view not found: nonexistent_table; line 1 pos 14 ``` Apache Spark 3.0.1 ``` $ bin/spark-sql -f 1.sql Error in query: Table or view not found: nonexistent_table; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [nonexistent_table] 2 Time taken: 2.786 seconds, Fetched 1 row(s) ``` Apache Hive 1.2.2* ``` apache-hive-1.2.2-bin:$ bin/hive -f 1.sql Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'nonexistent_table' ``` Yes. This is a fix of regression. Pass the UT. Closes #30263 from artiship/SPARK-33358. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-15 16:57:12 -08:00
Liang-Chi Hsieh	eea846b895	[SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination ### What changes were proposed in this pull request? This patch adds a benchmark `SubExprEliminationBenchmark` for benchmarking subexpression elimination feature. ### Why are the changes needed? We need a benchmark for subexpression elimination feature for change such as #30341. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30379 from viirya/SPARK-33455. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 19:02:36 -08:00
luluorta	156704ba0d	[SPARK-33432][SQL] SQL parser should use active SQLConf ### What changes were proposed in this pull request? This PR makes SQL parser using active SQLConf instead of the one in ctor-parameters. ### Why are the changes needed? In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: ```scala spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > > == SQL == > time Timestamp > ^^^ But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: ```scala DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > +--------------------------------+ > \|from_json({"time":"26/10/2015"})\| > +--------------------------------+ > \| {2015-10-26 00:00...\| > +--------------------------------+ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly and updated UTs Closes #30357 from luluorta/SPARK-33432. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 13:37:12 -08:00
artiship	34a9a77ab5	[SPARK-33396][SQL] Spark SQL CLI prints appliction id when process file ### What changes were proposed in this pull request? Modify SparkSQLCLIDriver.scala to move ahead calling the cli.printMasterAndAppId method before process file. ### Why are the changes needed? Even though in SPARK-25043 it has already brought in the printing application id feature. But the process file situation seems have not been included. This small change is to make spark-sql will also print out application id when process file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? env ``` spark version: 3.0.1 os: centos 7 ``` /tmp/tmp.sql ```sql select 1; ``` submit command: ```sh export HADOOP_USER_NAME=my-hadoop-user bin/spark-sql \ --master yarn \ --deploy-mode client \ --queue my.queue.name \ --conf spark.driver.host=$(hostname -i) \ --conf spark.app.name=spark-test \ --name "spark-test" \ -f /tmp/tmp.sql ``` execution log: ```sh 20/11/09 23:18:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.rpc.server.address.use.ip does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.submit.timeout.interval does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.enforce.bucketing does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.run.timeout.seconds does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.support.sql11.reserved.keywords does not exist 20/11/09 23:18:40 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 20/11/09 23:18:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 20/11/09 23:18:42 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 20/11/09 23:18:52 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Spark master: yarn, Application Id: application_1567136266901_27355775 1 1 Time taken: 4.974 seconds, Fetched 1 row(s) ``` Closes #30301 from artiship/SPARK-33396. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-14 20:54:17 +08:00
Liang-Chi Hsieh	0046222a75	[SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite ### What changes were proposed in this pull request? This is a simple followup to prevent test flakyness in SubexpressionEliminationSuite. If `getAllEquivalentExprs` returns more than 1 sequences, due to HashMap, we should use `contains` instead of assuming the order of results. ### Why are the changes needed? Prevent test flakyness in SubexpressionEliminationSuite. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30371 from viirya/SPARK-33337-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-13 15:10:02 -08:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00
Kent Yao	cdd8e51742	[SPARK-33419][SQL] Unexpected behavior when using SET commands before a query in SparkSession.sql ### What changes were proposed in this pull request? SparkSession.sql converts a string value to a DataFrame, and the string value should be one single SQL statement ending up w/ or w/o one or more semicolons. e.g. ```sql scala> spark.sql(" select 2").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;;;;").show +---+ \| 2\| +---+ \| 2\| +---+ ``` If we put 2 or more statements in, it fails in the parser as expected, e.g. ```sql scala> spark.sql(" select 2; select 1;").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input 'select' expecting {<EOF>, ';'}(line 1, pos 11) == SQL == select 2; select 1; -----------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` As a very generic user scenario, users may want to change some settings before they execute the queries. They may pass a string value like `set spark.sql.abc=2; select 1;` into this API, which creates a confusing gap between the actual effect and the user's expectations. The user may want the query to be executed with spark.sql.abc=2, but Spark actually treats the whole part of `2; select 1;` as the value of the property 'spark.sql.abc', e.g. ``` scala> spark.sql("set spark.sql.abc=2; select 1;").show +-------------+------------+ \| key\| value\| +-------------+------------+ \|spark.sql.abc\|2; select 1;\| +-------------+------------+ ``` What's more, the SET symbol could digest everything behind it, which makes it unstable from version to version, e.g. #### 3.1 ```sql scala> spark.sql("set;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided scala> spark.sql("set a;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set a; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` #### 2.4 ```sql scala> spark.sql("set;").show +---+-----------+ \|key\| value\| +---+-----------+ \| ;\|<undefined>\| +---+-----------+ scala> spark.sql("set a;").show +---+-----------+ \|key\| value\| +---+-----------+ \| a;\|<undefined>\| +---+-----------+ ``` In this PR, 1. make `set spark.sql.abc=2; select 1;` in `SparkSession.sql` fail directly, user should call `.sql` for each statement separately. 2. make the semicolon as the separator of statements, and if users want to use it as part of the property value, shall use quotes too. ### Why are the changes needed? 1. disambiguation for `SparkSession.sql` 2. make semicolon work same both w/ `SET` and other statements ### Does this PR introduce _any_ user-facing change? yes, the semicolon works as a separator of statements now, it will be trimmed if it is at the end of the statement and fail the statement if it is in the middle. you need to use quotes if you want it to be part of the property value ### How was this patch tested? new tests Closes #30332 from yaooqinn/SPARK-33419. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 06:58:16 +00:00
ulysses	82a21d2a3e	[SPARK-33433][SQL] Change Aggregate max rows to 1 if grouping is empty ### What changes were proposed in this pull request? Change `Aggregate` max rows to 1 if grouping is empty. ### Why are the changes needed? If `Aggregate` grouping is empty, the result is always one row. Then we don't need push down limit in `LimitPushDown` with such case ``` select count() from t1 union select count() from t2 limit 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30356 from ulysses-you/SPARK-33433. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-13 15:57:07 +09:00
Max Gekk	539c2deb89	[SPARK-33426][SQL][TESTS] Unify Hive SHOW TABLES tests ### What changes were proposed in this pull request? 1. Create the separate test suite `org.apache.spark.sql.hive.execution.command.ShowTablesSuite`. 2. Re-use V1 SHOW TABLES tests added by https://github.com/apache/spark/pull/30287 in the Hive test suites. 3. Add new test case for the pattern `'table_name_1\|table_name_2'` in the common test suite. ### Why are the changes needed? To test V1 + common SHOW TABLES tests in Hive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running v1/v2 and Hive v1 `ShowTablesSuite`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #30340 from MaxGekk/show-tables-hive-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 05:15:13 +00:00
Liang-Chi Hsieh	2c64b731ae	[SPARK-33259][SS] Disable streaming query with possible correctness issue by default ### What changes were proposed in this pull request? This patch proposes to disable the streaming query with possible correctness issue in chained stateful operators. The behavior can be controlled by a SQL config, so if users understand the risk and still want to run the query, they can disable the check. ### Why are the changes needed? The possible correctness in chained stateful operators in streaming query is not straightforward for users. From users perspective, it will be considered as a Spark bug. It is also possible the worse case, users are not aware of the correctness issue and use wrong results. A better approach should be to disable such queries and let users choose to run the query if they understand there is such risk, instead of implicitly running the query and let users to find out correctness issue by themselves and report this known to Spark community. ### Does this PR introduce _any_ user-facing change? Yes. Streaming query with possible correctness issue will be blocked to run, except for users explicitly disable the SQL config. ### How was this patch tested? Unit test. Closes #30210 from viirya/SPARK-33259. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 15:31:57 -08:00
Chao Sun	cf3b6551ce	[SPARK-33435][SQL] DSv2: REFRESH TABLE should invalidate caches referencing the table ### What changes were proposed in this pull request? This changes `RefreshTableExec` in DSv2 to also invalidate caches with references to the target table to be refreshed. The change itself is similar to what's done in #30211. Note that though, since we currently don't support caching a DSv2 table directly, this doesn't add recache logic as in the DSv1 impl. I marked it as a TODO for now. ### Why are the changes needed? Currently the behavior in DSv1 and DSv2 is inconsistent w.r.t refreshing table: in DSv1 we invalidate both metadata cache as well as all table caches that are related to the table, but in DSv2 we only do the former. This addresses the issue and make the behavior consistent. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing a v2 table also invalidate all the related caches. ### How was this patch tested? Added a new UT. Closes #30359 from sunchao/SPARK-33435. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 15:22:56 -08:00
Linhong Liu	1baf0d5c9b	[SPARK-33140][SQL][FOLLOW-UP] change val to def in object rule ### What changes were proposed in this pull request? In #30097, many rules changed from case class to object, but if the rule is stateful, there will be a problem. For example, if an object rule uses a `val` to refer to a config, it will be unchanged after initialization even if other spark session uses a different config value. ### Why are the changes needed? Avoid potential bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #30354 from linhongliu-db/SPARK-33140-followup-2. Lead-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Co-authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-13 01:10:28 +09:00
gengjiaan	2f07c56810	[SPARK-33278][SQL] Improve the performance for FIRST_VALUE ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29800 provides a performance improvement for `NTH_VALUE`. `FIRST_VALUE` also could use the `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame`. ### Why are the changes needed? Improve the performance for `FIRST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30178 from beliefer/SPARK-33278. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 14:59:22 +00:00
ulysses	a3d2954662	[SPARK-33421][SQL] Support Greatest and Least in Expression Canonicalize ### What changes were proposed in this pull request? Add `Greatest` and `Least` check in `Canonicalize`. ### Why are the changes needed? The children of both `Greatest` and `Least` are order Irrelevant. Let's say we have `greatest(1, 2)` and `greatest(2, 1)`. We can get the same canonicalized expression in this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30330 from ulysses-you/SPARK-33421. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 20:26:33 +09:00
xuewei.linxuewei	6d31daeb6a	[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for `element_at`，`elt`, `GetArrayItem` functions. ### Why are the changes needed? For ansiMode. ### Does this PR introduce any user-facing change? When `spark.sql.ansi.enabled` = true, Spark will throw `ArrayIndexOutOfBoundsException` if out-of-range index when accessing array elements ### How was this patch tested? Added UT and existing UT. Closes #30297 from leanken/leanken-SPARK-33386. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 08:50:32 +00:00
Yuanjian Li	9f983a68f1	[SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods ### Why are the changes needed? Follow the comment: https://github.com/apache/spark/pull/26935#discussion_r514697997 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test and Mima test. Closes #30344 from xuanyuanking/SPARK-30294-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 12:22:25 +09:00
Max Gekk	7e867298fe	[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc` ### What changes were proposed in this pull request? Updated results of `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By regenerating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark" ``` Closes #30338 from MaxGekk/fix-trunc_date-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 08:50:43 -08:00
stczwd	1eb236b936	[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 ### What changes were proposed in this pull request? This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617. ### Does this PR introduce _any_ user-facing change? Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table. ### How was this patch tested? Run suites and fix old tests. Closes #29339 from stczwd/SPARK-32512-new. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jacky Lee <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 09:30:42 +00:00
Wenchen Fan	8760032f4f	[SPARK-33412][SQL] OverwriteByExpression should resolve its delete condition based on the table relation not the input query ### What changes were proposed in this pull request? Make a special case in `ResolveReferences`, which resolves `OverwriteByExpression`'s condition expression based on the table relation instead of the input query. ### Why are the changes needed? The condition expression is passed to the table implementation at the end, so we should resolve it using table schema. Previously it works because we have a hack in `ResolveReferences` to delay the resolution if `outputResolved == false`. However, this hack doesn't work for tables accepting any schema like https://github.com/delta-io/delta/pull/521 . We may wrongly resolve the delete condition using input query's outout columns which don't match the table column names. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests and updated test in v2 write. Closes #30318 from cloud-fan/v2-write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 16:13:21 +09:00
Takeshi Yamamuro	4b367976a8	[SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #30324 from maropu/FilterBugInTPCDSQueryBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 15:24:05 +09:00
Terry Kim	6d5d030957	[SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("SHOW CREATE TABLE t AS SERDE") // Succeeds ``` With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883) at scala.Option.map(Option.scala:230) ``` , which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail. ### Does this PR introduce _any_ user-facing change? After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`. ### How was this patch tested? Updated existing tests. Closes #30321 from imback82/show_create_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:54:27 +00:00
Max Gekk	1e2eeda20e	[SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests ### What changes were proposed in this pull request? In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`. ### Why are the changes needed? - The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: - `org.apache.spark.sql.execution.command.v1.ShowTablesSuite` - `org.apache.spark.sql.execution.command.v2.ShowTablesSuite` - `ShowTablesParserSuite` Closes #30287 from MaxGekk/unify-dsv1_v2-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:26:46 +00:00
ulysses	5197c5d2e7	[SPARK-33390][SQL] Make Literal support char array ### What changes were proposed in this pull request? Make Literal support char array. ### Why are the changes needed? We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`. ### Does this PR introduce _any_ user-facing change? Yes, user can call `Literal()` with `char[]`. ### How was this patch tested? Add test. Closes #30295 from ulysses-you/SPARK-33390. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 11:39:11 +09:00
Utkarsh	46346943bb	[SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression ### What changes were proposed in this pull request? The following query produces incorrect results: ``` SELECT date_trunc('minute', '1769-10-17 17:10:02') ``` Spark currently incorrectly returns ``` 1769-10-17 17:10:02 ``` against the expected return value of ``` 1769-10-17 17:10:00 ``` Steps to repro Run the following commands in spark-shell: ``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show() ``` This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`. This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`. This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests to `DateTimeUtilsSuite` which previously failed and pass now. Closes #30303 from utkarsh39/trunc-timestamp-fix. Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 09:28:59 +09:00
Liang-Chi Hsieh	6fa80ed1dd	[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions ### What changes were proposed in this pull request? Currently we skip subexpression elimination in branches of conditional expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions. ### Why are the changes needed? We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two `Project`s and produces conditional expression like: ``` CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END ``` If `jsonToStruct(json)` is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30245 from viirya/SPARK-33337. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-10 16:17:00 -08:00
Chao Sun	3165ca742a	[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader ### What changes were proposed in this pull request? This removes the `sharesHadoopClasses` flag from `IsolatedClientLoader` in Hive module. ### Why are the changes needed? Currently, when initializing `IsolatedClientLoader`, users can set the `sharesHadoopClasses` flag to decide whether the `HiveClient` created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies. There are two reasons to remove this: 1. this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when `spark.sql.hive.metastore.jars` is equal to "maven", which could be very rare. 2. when `sharesHadoopClasses` is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download `hadoop-client` jar and put all the sub-module jars (e.g., `hadoop-common`, `hadoop-hdfs`) together with the Hive jars, and the Hadoop version used by `hadoop-client` is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the `IsolatedClientLoader` is still using Hadoop 2.x or even lower. ### Does this PR introduce _any_ user-facing change? This affects Spark users in one scenario: when `spark.sql.hive.metastore.jars` is set to `maven` AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to _not_ share Hadoop classes, but with the PR it will share Hadoop classes with Spark. ### How was this patch tested? Existing UTs. Closes #30284 from sunchao/SPARK-33376. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 15:41:04 +00:00
angerszhu	34f5e7ce77	[SPARK-33302][SQL] Push down filters through Expand ### What changes were proposed in this pull request? Push down filter through expand. For case below: ``` create table t1(pid int, uid int, sid int, dt date, suid int) using parquet; create table t2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vs AS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM t1 u join t2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 ``` Plan. before this pr: ``` == Physical Plan == (5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- (4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- (4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- (2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- (2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- (2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- (2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- (2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- (2) Project [uid#7, dt#9, suid#10] : +- (2) Filter isnotnull(uid#7) : +- (2) ColumnarToRow : +- FileScan parquet default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date,suid:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233] +- (1) Project [pid#11, vs#12, uid#13] +- (1) Filter isnotnull(uid#13) +- (1) ColumnarToRow +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [isnotnull(uid#13)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` Plan. after. this pr. : ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[years#0, appversion#2], functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L]) +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71] +- HashAggregate(keys=[years#0, appversion#2], functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2, sum#22L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[count(distinct uid#7)], output=[years#0, appversion#2, uusers#3L]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true, [id=#67] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[partial_count(distinct uid#7)], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, count#27L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7, 5), true, [id=#63] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Project [uid#7, dt#9, pid#11, vs#12] +- BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight, false :- Filter isnotnull(uid#7) : +- FileScan parquet default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, false] as bigint)),false), [id=#58] +- Filter ((CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND isnotnull(uid#13)) +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS), isnotnull..., Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` ### Why are the changes needed? Improve performance, filter more data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30278 from AngersZhuuuu/SPARK-33302. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:40:24 +00:00
Chao Sun	4934da56bc	[SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache ### What changes were proposed in this pull request? This changes `DropTableExec` to also invalidate caches referencing the table to be dropped, in a cascading manner. ### Why are the changes needed? In DSv1, `DROP TABLE` command also invalidate caches as described in [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). However in DSv2 the same command only drops the table but doesn't handle the caches. This could lead to correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. Now DSv2 `DROP TABLE` command also invalidates cache. ### How was this patch tested? Added a new UT Closes #30211 from sunchao/SPARK-33305. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:37:42 +00:00
xuewei.linxuewei	e3a768dd79	[SPARK-33391][SQL] element_at with CreateArray not respect one based index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes #30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 07:23:47 +00:00
Yuanjian Li	ad02ceda29	[SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table ### What changes were proposed in this pull request? - Call `spark.read.table` in `spark.table`. - Add comments for `spark.table` to emphasize it also support streaming temp view reading. ### Why are the changes needed? The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #30148 from xuanyuanking/SPARK-33244. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:46:45 +00:00
Terry Kim	90f6f39e42	[SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `LOAD DATA` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds ``` With this change, `LOAD DATA` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865) at scala.Option.foreach(Option.scala:407) ``` , which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30270 from imback82/load_data_cmd. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:28:06 +00:00
Gengliang Wang	a1f84d8714	[SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata ### What changes were proposed in this pull request? When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference. ### Why are the changes needed? For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`. The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`. ### Does this PR introduce _any_ user-facing change? Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true. ### How was this patch tested? Unit test Closes #30273 from gengliangwang/supportsExternalMetadata. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 04:43:32 +00:00
Gabor Somogyi	4ac8133866	[SPARK-33223][SS][UI] Structured Streaming Web UI state information ### What changes were proposed in this pull request? Structured Streaming UI is not containing state information. In this PR I've added it. ### Why are the changes needed? Missing state information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. <img width="1044" alt="Screenshot 2020-10-30 at 15 14 21" src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png"> Closes #30151 from gaborgsomogyi/SPARK-33223. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-10 11:22:35 +09:00
Peter Toth	84dc374611	[SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls ### What changes were proposed in this pull request? This PR modifies the `ExtractPythonUDFs` rule to deduplicate deterministic PythonUDF calls. Before this PR the dataframe: `df.withColumn("c", batchedPythonUDF(col("a"))).withColumn("d", col("c"))` has the plan: ``` (1) Project [value#1 AS a#4, pythonUDF1#15 AS c#7, pythonUDF1#15 AS d#10] +- BatchEvalPython [dummyUDF(value#1), dummyUDF(value#1)], [pythonUDF0#14, pythonUDF1#15] +- LocalTableScan [value#1] ``` After this PR the deterministic PythonUDF calls are deduplicated: ``` (1) Project [value#1 AS a#4, pythonUDF0#14 AS c#7, pythonUDF0#14 AS d#10] +- BatchEvalPython [dummyUDF(value#1)], [pythonUDF0#14] +- LocalTableScan [value#1] ``` ### Why are the changes needed? To fix a performance issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New and existing UTs. Closes #30203 from peter-toth/SPARK-33303-deduplicate-deterministic-udf-calls. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-09 19:27:36 +09:00
Linhong Liu	4e1c89400d	[SPARK-33140][SQL][FOLLOW-UP] Use sparkSession in AQE context when applying rules ### What changes were proposed in this pull request? After #30097, all rules are using `SparkSession.active` to get `SQLConf` and `SparkSession`. But in AQE, when applying the rules for the initial plan, we should use the spark session in AQE context. ### Why are the changes needed? Fix potential problem caused by using the wrong spark session ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing ut Closes #30294 from linhongliu-db/SPARK-33140-followup. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:44:58 +00:00
Yuming Wang	7a5647a93a	[SPARK-33385][SQL] Support bucket pruning for IsNaN ### What changes were proposed in this pull request? This pr add support bucket pruning on `IsNaN` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30291 from wangyum/SPARK-33385. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:20:31 +00:00
Yuming Wang	69799c514f	[SPARK-33372][SQL] Fix InSet bucket pruning ### What changes were proposed in this pull request? This pr fix `InSet` bucket pruning because of it's values should not be `Literal`: `cbd3fdea62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L253-L255)` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test: ```scala spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t") spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show ``` Before this PR \| After this PR -- \| -- ![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) \| ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png) Closes #30279 from wangyum/SPARK-33372. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:32:51 +00:00
Wenchen Fan	98730b7ee2	[SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer ### What changes were proposed in this pull request? This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer. ### Why are the changes needed? Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes. ### Does this PR introduce _any_ user-facing change? Yes. 1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT. 2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed. 3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed. 4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command) ### How was this patch tested? new tests Closes #29970 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:08:00 +00:00
Huaxin Gao	bfb257f078	[SPARK-32405][SQL] Apply table options while creating tables in JDBC Table Catalog ### What changes were proposed in this pull request? Currently in JDBCTableCatalog, we ignore the table options when creating table. ``` // TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog if (!properties.isEmpty) { logWarning("Cannot create JDBC table with properties, these properties will be " + "ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]")) } ``` ### Why are the changes needed? need to apply the table options when we create table ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new test Closes #30154 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 07:02:14 +00:00
Liang-Chi Hsieh	c269b53f07	[SPARK-33384][SS] Delete temporary file when cancelling writing to final path even underlying stream throwing error ### What changes were proposed in this pull request? In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file. This patch proposes to delete temporary even underlying stream throwing error. ### Why are the changes needed? To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30290 from viirya/SPARK-33384. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-08 18:44:26 -08:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
Hannah Amundson	1090b1b00a	[SPARK-32860][DOCS][SQL] Updating documentation about map support in Encoders ### What changes were proposed in this pull request? Javadocs updated for the encoder to include maps as a collection type ### Why are the changes needed? The javadocs were not updated with fix SPARK-16706 ### Does this PR introduce _any_ user-facing change? Yes, the javadocs are updated ### How was this patch tested? sbt was run to ensure it meets scalastyle Closes #30274 from hannahkamundson/SPARK-32860. Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com> Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-08 20:29:24 +09:00
Stuart White	09fa7ecae1	[SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and structs ### What changes were proposed in this pull request? The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit. Here's sample code to illustrate the behavior I'd like to clean up: ```scala val rows = Seq[String](null) .toDF("value") .withColumn("struct1", struct('value as "value1")) .withColumn("struct2", struct('value as "value1", 'value as "value2")) .withColumn("array1", array('value)) .withColumn("array2", array('value, 'value)) // Show the DataFrame using the "first" codepath. rows.show(truncate=false) +-----+-------+-------------+------+--------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+-------------+------+--------+ \|null \|{ null}\|{ null, null}\|[] \|[, null]\| +-----+-------+-------------+------+--------+ // Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path: rows.write.parquet("rows") spark.read.parquet("rows").show(truncate=false) +-----+-------+-------------+-------+-------------+ \|value\|struct1\|struct2 \|array1 \|array2 \| +-----+-------+-------------+-------+-------------+ \|null \|{ null}\|{ null, null}\|[ null]\|[ null, null]\| +-----+-------+-------------+-------+-------------+ ``` Notice: 1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}"). I think it's preferable to print it without the leading space (e.g. "\{null\}"). This is consistent with how non-null values are printed inside a struct. 2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space. I think both code paths should be consistent and print it without a leading space (e.g. "[null]"). The desired result of this PR is to product the following output via both code paths: ``` +-----+-------+------------+------+------------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+------------+------+------------+ \|null \|{null} \|{null, null}\|[null]\|[null, null]\| +-----+-------+------------+------+------------+ ``` This contribution is my original work and I license the work to the project under the project’s open source license. ### Why are the changes needed? To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs. ### Does this PR introduce _any_ user-facing change? Yes. This PR changes what is printed out by DataFrame.show(). ### How was this patch tested? I added new test cases in CastSuite.scala to cover the cases addressed by this PR. Closes #30189 from stwhit/show_nulls. Authored-by: Stuart White <stuart.white1@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-06 13:12:35 -08:00
Terry Kim	68c032c246	[SPARK-33364][SQL] Introduce the "purge" option in TableCatalog.dropTable for v2 catalog ### What changes were proposed in this pull request? This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed. Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110 ### Why are the changes needed? Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs. ### Does this PR introduce _any_ user-facing change? This PR introduces a new API in `TableCatalog`. ### How was this patch tested? Added a test. Closes #30267 from imback82/purge_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 22:00:45 -08:00
Prashant Sharma	733a468726	[SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE RENAME COLUMN ALTER TABLE UPDATE COLUMN NULLABILITY in the following MsSQLServer JDBC dialect according to official documentation. Write MsSqlServer integration tests for JDBC. ### Why are the changes needed? To add the support for alter table when interacting with MSSql Server. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? added tests Closes #30038 from ScrapCodes/mssql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-06 05:46:38 +00:00
Wenchen Fan	d16311051d	[SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments ### What changes were proposed in this pull request? 1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions. 2. Refine and add more comments. 3. Remove `isRelative` as it's useless. ### Why are the changes needed? code refinement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30261 from cloud-fan/window. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-06 05:20:25 +00:00
Dongjoon Hyun	90f35c663e	[MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer ### What changes were proposed in this pull request? This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by SPARK-31670 (https://github.com/apache/spark/pull/28490) ```scala - // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations + // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations ``` ### Why are the changes needed? Fix the wrong information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a comment change. Manually review. Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-06 12:46:26 +09:00
Wenchen Fan	cd4e3d3b0c	[SPARK-33360][SQL] Simplify DS v2 write resolution ### What changes were proposed in this pull request? Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery` ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30264 from cloud-fan/ds-minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 15:44:04 -08:00
Wenchen Fan	26ea417b14	[SPARK-33362][SQL] skipSchemaResolution should still require query to be resolved ### What changes were proposed in this pull request? Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved. ### Why are the changes needed? To prevent potential bugs that we skip resolve the input query. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #30265 from cloud-fan/ds-minor-2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 09:23:41 -08:00
Jungtaek Lim (HeartSaVioR)	21413b7dd4	[SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore ### What changes were proposed in this pull request? There's a concept of 'read-only' and 'read+write' state store in Spark which is defined "implicitly". Spark doesn't prevent write for 'read-only' state store; Spark just assumes read-only stateful operator will not modify the state store. Given it's not defined explicitly, the instance of state store has to be implemented as 'read+write' even it's being used as 'read-only', which sometimes brings confusion. For example, abort() in HDFSBackedStateStore - `d38f816748/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (L143-L155)` The comment sounds as if statement works differently between 'read-only' and 'read+write', but that's not true as both state store has state initialized as UPDATING (no difference). So 'read-only' state also creates the temporary file, initializes output streams to write to temporary file, closes output streams, and finally deletes the temporary file. This unnecessary operations are being done per batch/partition. This patch explicitly defines 'read-only' StateStore, and enables state store provider to create 'read-only' StateStore instance if requested. Relevant code paths are modified, as well as 'read-only' StateStore implementation for HDFSBackedStateStore is introduced. The new implementation gets rid of unnecessary operations explained above. In point of backward-compatibility view, the only thing being changed in public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to be changed to allow requesting 'read-only' StateStore; this patch adds default implementation which leverages 'read+write' StateStore but wrapping with 'write-protected' StateStore instance, so that custom providers don't need to change their code to reflect the change. But if the providers can optimize for read-only workload, they'll be happy to make a change. Please note that this patch makes ReadOnlyStateStore extend StateStore and being referred as StateStore, as StateStore is being used in so many places and it's not easy to support both traits if we differentiate them. So unfortunately these write methods are still exposed for read-only state; it just throws UnsupportedOperationException. ### Why are the changes needed? The new API opens the chance to optimize read-only state store instance compared with read+write state store instance. HDFSBackedStateStoreProvider is modified to provide read-only version of state store which doesn't deal with temporary file as well as state machine. ### Does this PR introduce any user-facing change? Clearly "no" for most end users, and also "no" for custom state store providers as it doesn't touch trait `StateStore` as well as provides default implementation for added method in trait `StateStoreProvider`. ### How was this patch tested? Modified UT. Existing UTs ensure the change doesn't break anything. Closes #26935 from HeartSaVioR/SPARK-30294. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-05 18:21:17 +09:00
HyukjinKwon	d530ed0ea8	Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends" This reverts commit `b8a440f098`.	2020-11-05 16:15:17 +09:00
Dongjoon Hyun	42c0b175ce	[SPARK-33338][SQL] GROUP BY using literal map should not fail ### What changes were proposed in this pull request? This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`. ### Why are the changes needed? This is a regression from Apache Spark 1.6.x. ```scala scala> sc.version res1: String = 1.6.3 scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show +---+ \|_c0\| +---+ \| v1\| +---+ ``` Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries. ```sql CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k SELECT map('k1', 'v1')[k] FROM t GROUP BY 1 SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k] SELECT map('k1', 'v1')[k] a FROM t GROUP BY a ``` BEFORE ```scala Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ``` AFTER ```sql spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1; v1 Time taken: 1.278 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]; v1 Time taken: 0.313 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a; v1 Time taken: 0.265 seconds, Fetched 1 row(s) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30246 from dongjoon-hyun/SPARK-33338. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-04 08:35:10 -08:00
Erik Krogen	ff724d23b6	[SPARK-33214][TEST][HIVE] Stop HiveExternalCatalogVersionsSuite from using a hard-coded location to store localized Spark binaries ### What changes were proposed in this pull request? This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs. ### Why are the changes needed? In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions: ``` // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to // avoid downloading Spark of different versions in each run. private val sparkTestingDir = new File("/tmp/test-spark") ``` However this doesn't work, since it gets deleted every time: ``` override def afterAll(): Unit = { try { Utils.deleteRecursively(wareHousePath) Utils.deleteRecursively(tmpDataDir) Utils.deleteRecursively(sparkTestingDir) } finally { super.afterAll() } } ``` It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it. ### Does this PR introduce _any_ user-facing change? Developer-facing changes only, as this is in a test. ### How was this patch tested? The test continues to execute as expected. Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 06:51:54 +00:00
Terry Kim	0ad35ba5f8	[SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ANALYZE TABLE` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior: ```scala sql("create temporary view t as select 1") sql("create database db") sql("create table db.t using csv as select 1") sql("use db") sql("ANALYZE TABLE t compute statistics") // Succeeds ``` With this change, ANALYZE TABLE above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856) ``` , which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`. ### How was this patch tested? Updated existing tests. Closes #30229 from imback82/parse_v1table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 06:50:37 +00:00
ulysses	1740b29b3f	[SPARK-33323][SQL] Add query resolved check before convert hive relation ### What changes were proposed in this pull request? Add query.resolved before convert hive relation. ### Why are the changes needed? For better error msg. ``` CREATE TABLE t STORED AS PARQUET AS SELECT * FROM ( SELECT c3 FROM ( SELECT c1, c2 from values(1,2) t(c1, c2) ) ) ``` Before this PR, we get such error msg ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: * at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes, error msg changed. ### How was this patch tested? Add test. Closes #30230 from ulysses-you/SPARK-33323. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 05:01:39 +00:00
Wenchen Fan	034070a23a	Revert "[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size" This reverts commit `0c943cd2fb`.	2020-11-04 12:30:38 +08:00
Chao Sun	d900c6ff49	[SPARK-33293][SQL][FOLLOW-UP] Rename TableWriteExec to TableWriteExecHelper ### What changes were proposed in this pull request? Rename `TableWriteExec` in `WriteToDataSourceV2Exec.scala` to `TableWriteExecHelper`. ### Why are the changes needed? See [discussion](https://github.com/apache/spark/pull/30193#discussion_r516412653). The former is too general. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30235 from sunchao/SPARK-33293-2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-03 14:53:01 -08:00
Max Gekk	bdabf60fb4	[SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for `from_json` ### What changes were proposed in this pull request? Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs. Closes #30201 ### Why are the changes needed? Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By linters. Closes #30226 from MaxGekk/from_json-common-schema-parsing-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:10:24 -08:00
Max Gekk	eecebd0302	[SPARK-33306][SQL][FOLLOWUP] Group DateType and TimestampType together in `needsTimeZone()` ### What changes were proposed in this pull request? In the PR, I propose to group `DateType` and `TimestampType` together in checking time zone needs in the `Cast.needsTimeZone()` method. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By the existing test `"SPARK-33306: Timezone is needed when cast Date to String"`. Closes #30223 from MaxGekk/WangGuangxin-SPARK-33306-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:07:18 -08:00
Yuming Wang	789d19cab5	[SPARK-33319][SQL][TEST] Add all built-in SerDes to HiveSerDeReadWriteSuite ### What changes were proposed in this pull request? This pr add all built-in SerDes to `HiveSerDeReadWriteSuite`. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe ### Why are the changes needed? We will upgrade Parquet, ORC and Avro, need to ensure compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30228 from wangyum/SPARK-33319. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 08:34:50 -08:00
Cheng Su	e52b858ef7	[SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE ### What changes were proposed in this pull request? As a followup comment from https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here we add add the physical plan rule DisableUnnecessaryBucketedScan into AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto bucketed scan work with AQE. The change is mostly in: * `AdaptiveSparkPlanExec.scala`: add physical plan rule `DisableUnnecessaryBucketedScan` * `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link for the file source scan exec operator, otherwise we lose the logical plan link information when AQE is enabled, and will get exception [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176). (for example, for query `SELECT * FROM bucketed_table` with AQE is enabled) * `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`, and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`, to make the plan verification work when AQE enabled. ### Why are the changes needed? It's reasonable to add the support to allow disabling unnecessary bucketed scan with AQE is enabled, this helps optimize the query when AQE is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `DisableUnnecessaryBucketedScanSuite`. Closes #30200 from c21/auto-bucket-aqe. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-02 06:44:07 +00:00
Prashant Sharma	6226ccc092	[SPARK-33095] Follow up, support alter table column rename ### What changes were proposed in this pull request? Support rename column for mysql dialect. ### Why are the changes needed? At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case. ### Does this PR introduce _any_ user-facing change? Yes, `column rename` with mysql dialect should work correctly. ### How was this patch tested? Added tests for rename column. Ran the tests to pass with both versions of mysql. * `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31` * `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0` Closes #30142 from ScrapCodes/mysql-dialect-rename. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-02 05:03:41 +00:00
Takuya UESHIN	b8a440f098	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 20:28:12 +09:00
wangguangxin.cn	69c27f49ac	[SPARK-33306][SQL] Timezone is needed when cast date to string ### What changes were proposed in this pull request? When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as `72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309)`. Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting. A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true ``` when we execute the following sql, ``` select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` ### Why are the changes needed? As described above, it's a bug here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add more UT Closes #30213 from WangGuangxin/SPARK-33306. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 15:14:46 -07:00
Chao Sun	c51e5fc14b	[SPARK-33293][SQL] Refactor WriteToDataSourceV2Exec and reduce code duplication ### What changes were proposed in this pull request? Refactor `WriteToDataSourceV2Exec` via removing code duplication around write to table logic: - renamed `AtomicTableWriteExec` to `TableWriteExec` so that the table write logic in this trait can be modified and shared with `CreateTableAsSelectExec`, `ReplaceTableAsSelectExec`, `AtomicCreateTableAsSelectExec ` and `AtomicReplaceTableAsSelectExec`. - similar to the above, renamed `writeToStagedTable` to `writeToTable` in `TableWriteExec`. - extended `writeToTable` so that it can handle both staged table as well as non-staged table. ### Why are the changes needed? Simplify the logic and remove duplication, to make this piece of code easier to maintain. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CIs with the existing test coverage. Closes #30193 from sunchao/SPARK-33293. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 10:01:31 -07:00
Chao Sun	32b78d3795	[SPARK-33290][SQL] REFRESH TABLE should invalidate cache even though the table itself may not be cached ### What changes were proposed in this pull request? In `CatalogImpl.refreshTable`, this moves the `uncacheQuery` call out of the condition `if (cache.nonEmpty)` so that it will be called whether the table itself is cached or not. ### Why are the changes needed? In the case like the following: ```sql CREATE TABLE t ...; CREATE VIEW t1 AS SELECT * FROM t; REFRESH TABLE t; ``` If the table `t` is refreshed, the view `t1` which is depending on `t` will not be invalidated. This could lead to incorrect result and is similar to [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). On the other hand, if we have: ```sql CREATE TABLE t ...; CACHE TABLE t; CREATE VIEW t1 AS SELECT * FROM t; REFRESH TABLE t; ``` Then the view `t1` will be refreshed. The behavior is somewhat inconsistent. ### Does this PR introduce _any_ user-facing change? Yes, with the change any cache that are depending on the table refreshed will be invalidated with the change. Previously this only happens if the table itself is cached. ### How was this patch tested? Added a new UT for the case. Closes #30187 from sunchao/SPARK-33290. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 09:49:18 -07:00
ulysses	d59f6a7095	[SPARK-33294][SQL] Add query resolved check before analyze InsertIntoDir ### What changes were proposed in this pull request? Add `query.resolved` before analyze `InsertIntoDir`. ### Why are the changes needed? For better error msg. ``` INSERT OVERWRITE DIRECTORY '/tmp/file' USING PARQUET SELECT * FROM ( SELECT c3 FROM ( SELECT c1, c2 from values(1,2) t(c1, c2) ) ) ``` Before this PR, we get such error msg ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: * at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes, error msg changed. ### How was this patch tested? New test. Closes #30197 from ulysses-you/SPARK-33294. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-30 08:18:10 +00:00
angerszhu	0c943cd2fb	[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size ### What changes were proposed in this pull request? Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size. Since we can't decide whether it's a but and some use need it behavior same as Hive. ### Why are the changes needed? Provides a compatible choice between historical behavior and Hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30156 from AngersZhuuuu/SPARK-33284. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-30 14:11:25 +09:00
Max Gekk	343e0bb3ad	[SPARK-33286][SQL] Improve the error message about schema parsing by `from_json/from_csv` # What changes were proposed in this pull request? In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers: - DataType.fromJson (except CSV) - CatalystSqlParser.parseDataType - CatalystSqlParser.parseTableSchema Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users. ### Why are the changes needed? Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example: ```scala val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}""" df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show() ``` The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out: ``` mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '{' expecting {'ADD', 'AFTER', ... }(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes for the example above: ``` Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field. Failed fallback parsing: Cannot parse the data type: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ Failed fallback parsing: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### How was this patch tested? - By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`. - Add new test to `JsonFunctionsSuite`. - Re-gen results for `json-functions.sql`. Closes #30183 from MaxGekk/fromDDL-error-msg. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-30 11:18:47 +09:00
Dongjoon Hyun	838791bf0b	[SPARK-33292][SQL] Make Literal ArrayBasedMapData string representation disambiguous ### What changes were proposed in this pull request? This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`. ### Why are the changes needed? Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`. BEFORE ```scala scala> spark.version res0: String = 2.4.7 scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#0, 'map(key1, value1) AS b#1] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#0, map(key1, value1) AS b#1] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- OneRowRelation == Physical Plan == (1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- Scan OneRowRelation[] ``` AFTER* ```scala scala> spark.version res0: String = 3.1.0-SNAPSHOT scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#4, 'map(key1, value1) AS b#5] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#4, map(key1, value1) AS b#5] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- OneRowRelation == Physical Plan == (1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- (1) Scan OneRowRelation[] ``` ### Does this PR introduce _any_ user-facing change? Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes #30190 from dongjoon-hyun/SPARK-33292. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-29 19:10:01 -07:00
luluorta	cbd3fdea62	[SPARK-33008][SQL] Division by zero on divide-like operations returns incorrect result ### What changes were proposed in this pull request? In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #29882 from luluorta/SPARK-33008. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-29 16:44:17 +00:00
Liang-Chi Hsieh	056b62264b	[SPARK-33263][SS] Configurable StateStore compression codec ### What changes were proposed in this pull request? This patch proposes to make StateStore compression codec configurable. ### Why are the changes needed? Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. For example, we can choose zstd codec and zstd is configurable with different compression level. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can config different codec for StateStore. ### How was this patch tested? Unit test. Closes #30162 from viirya/SPARK-33263. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-29 07:44:44 -07:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
Nathan Wreggit	c592ae6ed8	[SQL][MINOR] Update from_unixtime doc ### What changes were proposed in this pull request? This PR fixes from_unixtime documentation to show that fmt is optional parameter. ### Does this PR introduce _any_ user-facing change? Yes, documentation update. Before change: ![image](https://user-images.githubusercontent.com/4176173/97497659-18c6cc80-1928-11eb-93d8-453ef627ac7c.png) After change: ![image](https://user-images.githubusercontent.com/4176173/97496153-c5537f00-1925-11eb-8102-457e85e019d5.png) ### How was this patch tested? Style check using: ./dev/run-tests Manual check and screenshotting with: ./sql/create-docs.sh Manual verification of behavior with latest spark-sql binary. Closes #30176 from Obbay2/from_unixtime_doc. Authored-by: Nathan Wreggit <obbay2@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:28:50 +09:00
Wenchen Fan	2639ad43cb	[SPARK-33272][SQL] prune the attributes mapping in QueryPlan.transformUpWithNewOutput ### What changes were proposed in this pull request? For complex query plans, `QueryPlan.transformUpWithNewOutput` will keep accumulating the attributes mapping to be propagated, which may hurt performance. This PR prunes the attributes mapping before propagating. ### Why are the changes needed? A simple perf improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30173 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-29 07:37:16 +09:00
Jungtaek Lim (HeartSaVioR)	a744fea3be	[SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values contains null ### What changes were proposed in this pull request? This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue. ### Why are the changes needed? This is an obvious bug as `In` filter doesn't care about null value when calculating hash code. ### Does this PR introduce _any_ user-facing change? Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail. ### How was this patch tested? UT added. The new UT fails without the PR and passes with the PR. Closes #30170 from HeartSaVioR/SPARK-33267. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 10:00:29 -07:00
Takeshi Yamamuro	a6216e2446	[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30169 from maropu/FixPythonUDTCast. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 08:33:02 -07:00
zky.zhoukeyong	b26ae98407	[SPARK-33208][SQL] Update the document of SparkSession#sql Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b ### What changes were proposed in this pull request? Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries. ### Why are the changes needed? To clarify the behavior of SparkSession#sql. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No needed. Closes #30168 from waitinfuture/master. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 13:17:28 +00:00
gengjiaan	3c3ad5f7c0	[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction ### What changes were proposed in this pull request? Spark SQL supports some window function like `NTH_VALUE`. If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations. For example: if we execute the SQL show below: ``` SELECT NTH_VALUE(col, 2) OVER(ORDER BY rank UNBOUNDED PRECEDING AND CURRENT ROW) FROM tab; ``` The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2. `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler. ### Why are the changes needed? Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29800 from beliefer/optimize-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 06:40:23 +00:00
allisonwang-db	9fb45361fd	[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts ### What changes were proposed in this pull request? This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix. ### Why are the changes needed? A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts. ``` Sort(orders, global = True, ...) Sort(orders, global = False, ...) ``` Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit tests Closes #30093 from allisonwang-db/fix-sort. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:51:47 +00:00
Terry Kim	528160f001	[SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("DROP TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("DROP TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior. ### How was this patch tested? Added a new test Closes #30079 from imback82/drop_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:44:55 +00:00
Jungtaek Lim (HeartSaVioR)	fcf8aa59b5	[SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 session catalog ### What changes were proposed in this pull request? This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog. ### Why are the changes needed? The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation. This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list. ### Does this PR introduce _any_ user-facing change? Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog. ### How was this patch tested? New UT added. Closes #30147 from HeartSaVioR/SPARK-33240. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 03:31:11 +00:00
Ankur Dave	3f2a2b5fe6	[SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream ### What changes were proposed in this pull request? The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`. ```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 ``` When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes. Closes #30160 from ankurdave/SPARK-33260. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-27 13:20:22 -07:00
Huaxin Gao	f284218dae	[SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (Postgres dialect) ### What changes were proposed in this pull request? Override the default SQL strings in Postgres Dialect for: - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala` ### Why are the changes needed? supports Postgres specific ALTER TABLE syntax. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test `PostgreSQLIntegrationSuite` Closes #30089 from huaxingao/postgres_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-27 15:04:53 +00:00
tanel.kiis@gmail.com	281f99c70b	[SPARK-33225][SQL] Extract AliasHelper trait ### What changes were proposed in this pull request? Extract methods related to handling Aliases to a trait. ### Why are the changes needed? Avoid code duplication ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs cover this Closes #30134 from tanelk/SPARK-33225_aliasHelper. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-27 22:53:05 +09:00
xuewei.linxuewei	537a49fc09	[SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of Rule[QueryPlan] ### What changes were proposed in this pull request? Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan]. In this PR. * remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SQLConf.get to replace the original SQLConf instance. * remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SparkSession.active to replace the original SparkSession instance. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test Closes #30097 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-27 12:40:57 +00:00
angerszhu	e43cd8ccef	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:25:53 +09:00
Steve Loughran	02fa19f102	[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes #30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-26 12:31:05 -07:00
Cheng Su	1042d49bf9	[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) ### What changes were proposed in this pull request? This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate. ### Why are the changes needed? Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work. Closes #30138 from c21/enable-auto-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 20:23:24 +09:00
Yuning Zhang	a21945ce6c	[SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime. ### Why are the changes needed? `spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### Does this PR introduce _any_ user-facing change? Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### How was this patch tested? modified unit test Closes #30108 from yuningzh-db/dynamic-analyzer-max-iterations. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 16:19:06 +09:00
Cheng Su	d87a0bb2ca	[SPARK-32862][SS] Left semi stream-stream join ### What changes were proposed in this pull request? This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`): * For left side input row, check if there's a match on right side state store. * if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store). * if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store). * For right side input row, check if there's a match on left side state store. * For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics. * State store eviction: evict rows from left/right side state store below watermark, same as inner join. Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup. ### Why are the changes needed? Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30076 from c21/stream-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-10-26 13:33:06 +09:00
HyukjinKwon	369cc614f3	Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive" This reverts commit `56ab60fb7a`.	2020-10-26 11:38:48 +09:00
angerszhu	56ab60fb7a	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-26 11:20:29 +09:00
Takeshi Yamamuro	87b498462b	[SPARK-33228][SQL] Don't uncache data when replacing a view having the same logical plan ### What changes were proposed in this pull request? SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows; ``` // Spark v2.4.6+ scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // Until v2.4.6 scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() 20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30140 from maropu/FixBugInReplaceView. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-25 16:15:55 -07:00
Jungtaek Lim (HeartSaVioR)	0c66a88d1d	[SPARK-29438][SS][FOLLOWUP] Add regression tests for Streaming Aggregation and flatMapGroupsWithState ### What changes were proposed in this pull request? This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here: https://github.com/apache/spark/pull/26162#issuecomment-576929692 > LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations? For now they're not broken, but even better if we have UTs to prevent the case for the future. ### Why are the changes needed? New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #27333 from HeartSaVioR/SPARK-29438-add-regression-test. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-24 15:36:41 -07:00
Kent Yao	82d500a05c	[SPARK-33193][SQL][TEST] Hive ThriftServer JDBC Database MetaData API Behavior Auditing ### What changes were proposed in this pull request? Add a test case to audit all JDBC metadata behaviors to check and prevent potential APIs silent changing from both the upstream hive-jdbc module or the Spark thrift server side. Forked from my kyuubi project here https://github.com/yaooqinn/kyuubi/blob/master/externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkOperationSuite.scala ### Why are the changes needed? Make the SparkThriftServer safer to evolve. ### Does this PR introduce _any_ user-facing change? dev only ### How was this patch tested? new tests Closes #30101 from yaooqinn/SPARK-33193. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-23 13:34:33 -07:00
Kent Yao	e21bb710e5	[SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET ### What changes were proposed in this pull request? #### case the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`, and the specially handled config `spark.sql.warehouse.dir` the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala ```scala import java.lang.reflect.Field import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} object WarehouseSCBeforeSS extends App { val wh = "spark.sql.warehouse.dir" val td = "spark.sql.globalTempDatabase" val custom = "spark.sql.custom" val conf = new SparkConf() .setMaster("local") .setAppName("SPARK-32991") .set(wh, "./data1") .set(td, "bob") val sc = new SparkContext(conf) val spark = SparkSession.builder() .config(wh, "./data2") .config(td, "alice") .config(custom, "kyao") .getOrCreate() val confField: Field = spark.sharedState.getClass.getDeclaredField("conf") confField.setAccessible(true) private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf] println() println(s"=====> SharedState: $wh=${shared.get(wh)}") println(s"=====> SharedState: $td=${shared.get(td)}") println(s"=====> SharedState: $custom=${shared.get(custom, "")}") println(s"=====> SessionState: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState: $td=${spark.conf.get(td)}") println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}") val spark2 = SparkSession.builder().config(td, "fred").getOrCreate() println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}") SparkSession.setActiveSession(spark) spark.sql("RESET") println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}") println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}") val spark3 = SparkSession.builder().getOrCreate() println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}") } ``` #### outputs and analysis ``` // 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession //=====> SharedState: spark.sql.warehouse.dir=./data1 // 2. ⏬ //=====> SharedState: spark.sql.globalTempDatabase=alice //=====> SharedState: spark.sql.custom=kyao //=====> SessionState: spark.sql.warehouse.dir=./data2 //=====> SessionState: spark.sql.globalTempDatabase=alice //=====> SessionState: spark.sql.custom=kyao //=====> SessionState 2: spark.sql.warehouse.dir=./data2 //=====> SessionState 2: spark.sql.globalTempDatabase=alice //=====> SessionState 2: spark.sql.custom=kyao // 2'.🔼 OK until here // 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed //=====> SessionState RESET: spark.sql.warehouse.dir=./data1 //=====> SessionState RESET: spark.sql.globalTempDatabase=bob //=====> SessionState RESET: spark.sql.custom= // 4. Then the SparkSessions created after RESET will be corrected. //=====> SessionState 3: spark.sql.warehouse.dir=./data1 //=====> SessionState 3: spark.sql.globalTempDatabase=bob //=====> SessionState 3: spark.sql.custom= ``` In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command. `SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem. ### Why are the changes needed? bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later. ### Does this PR introduce _any_ user-facing change? yes, before this change when you use programming API and call RESET, all configs will be reset to SparkContext.conf, now they go to SparkSession.sharedState.conf ### How was this patch tested? new tests Closes #30045 from yaooqinn/SPARK-32991. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-23 05:52:38 +00:00
Liang-Chi Hsieh	87b32f65ef	[MINOR][DOCS][TESTS] Fix PLAN_CHANGE_LOG_LEVEL document ### What changes were proposed in this pull request? `PLAN_CHANGE_LOG_LEVEL` config document is wrong. This is to fix it. ### Why are the changes needed? Fix wrong doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #30136 from viirya/minor-sqlconf. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 13:35:46 +09:00
Ankit Srivastava	3819d39607	[SPARK-32998][BUILD] Add ability to override default remote repos with internal one ### What changes were proposed in this pull request? - Building spark internally in orgs where access to outside internet is not allowed takes a long time because unsuccessful attempts are made to download artifacts from repositories which are not accessible. The unsuccessful attempts unnecessarily add significant amount of time to the build. I have seen a difference of up-to 1hr for some runs. - Adding 1 environment variables that should be present that the start of the build and if they exist, override the default repos defined in the code and scripts. envVariables: - DEFAULT_ARTIFACT_REPOSITORY=https://artifacts.internal.com/libs-release/ ### Why are the changes needed? To allow orgs to build spark internally without relying on external repositories for artifact downloads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Multiple builds with and without env variables set. Closes #29874 from ankits/SPARK-32998. Authored-by: Ankit Srivastava <ankit_srivastava@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-22 16:35:55 -07:00
Max Gekk	a03d77d326	[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 15:57:03 +00:00
yangjie01	b38f3a5557	[SPARK-32978][SQL] Make sure the number of dynamic part metric is correct ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32978. The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side. The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct. ### Why are the changes needed? The the number of dynamic part metric we display on the UI should be correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test case refer to described in SPARK-32978 Closes #30026 from LuciferYang/SPARK-32978. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 14:01:07 +00:00
angerszhu	a1629b4a57	[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location ### What changes were proposed in this pull request? Support `spark.sql.hive.metastore.jars` use HDFS location. When user need to use path to set hive metastore jars, you should set `spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path` since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths. To keep both two way supported ### Why are the changes needed? All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node. ### Does this PR introduce _any_ user-facing change? User can use HDFS location to store hive metastore jars ### How was this patch tested? Manuel tested. Closes #29881 from AngersZhuuuu/SPARK-32852. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:53:01 +00:00
Prashant Sharma	8cae7f88b0	[SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. ### Why are the changes needed? Improved code coverage and support mysql dialect for jdbc. ### Does this PR introduce _any_ user-facing change? Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### How was this patch tested? Added tests. Closes #30025 from ScrapCodes/mysql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:51:42 +00:00
Xuedong Luan	d9ee33cfb9	[SPARK-26533][SQL] Support query auto timeout cancel on thriftserver ### What changes were proposed in this pull request? Support query auto cancelling when running too long on thriftserver. This is the rework of #28991 and the credit should be the original author, leoluan2009. Closes #28991 ### Why are the changes needed? For some cases, we use thriftserver as long-running applications. Some times we want all the query need not to run more than given time. In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #29933 from maropu/pr28991. Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Luan <luanxuedong2009@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-22 17:23:10 +09:00
gengjiaan	eb33bcb4b2	[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE ### What changes were proposed in this pull request? `REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string. But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore: AddressLine1 -> address_line_1 If we support the parameter position, we can do like this(e.g. Oracle): ``` WITH strings as ( SELECT 'AddressLine1' s FROM dual union all SELECT 'ZipCode' s FROM dual union all SELECT 'Country' s FROM dual ) SELECT s "STRING", lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING" FROM strings; ``` The output: ``` STRING MODIFIED_STRING -------------------- -------------------- AddressLine1 address_line_1 ZipCode zip_code Country country ``` There are some mainstream database support the syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490 Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace Redshift https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html ### Why are the changes needed? The parameter position for `REGEXP_REPLACE` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #29891 from beliefer/add-position-for-regex_replace. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 07:59:49 +00:00
Chao Sun	cb3fa6c936	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? This serves two purposes: - to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop. - avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #29843 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-10-22 03:21:34 +00:00
Max Gekk	ba13b94f6b	[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 03:04:29 +00:00
Max Gekk	bbf2d6f6df	[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing ### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-22 10:03:41 +09:00
Gabor Somogyi	fbb6843620	[SPARK-32229][SQL] Fix PostgresConnectionProvider and MSSQLConnectionProvider by accessing wrapped driver ### What changes were proposed in this pull request? Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`. ### Why are the changes needed? `DriverWrapper` is not considered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Closes #30024 from gaborgsomogyi/SPARK-32229. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-20 15:14:38 +09:00
Max Gekk	a44e008de3	[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing ### What changes were proposed in this pull request? 1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0. 2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`. 3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one. 4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag: - No rebasing in testing when the SQL config `spark.test.forceNoRebase` is set to `true` - No rebasing if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`. - With rebasing if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`. - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`. New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of: - To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED. - To have different modes for INT96 and DATE in load (or in save). - To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default. ### Why are the changes needed? 1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed. 2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date). 3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps. ### Does this PR introduce _any_ user-facing change? It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`. ### How was this patch tested? - Added a test to check the metadata key `org.apache.spark.int96NoRebase` - By `ParquetIOSuite` Closes #30056 from MaxGekk/parquet-rebase-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 14:58:59 +09:00
Nan Zhu	35133901f7	[SPARK-32351][SQL] Show partially pushed down partition filters in explain() ### What changes were proposed in this pull request? Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output. This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain(). ### Why are the changes needed? without the patch, the explained plan is inconsistent with what is actually executed <b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1))) +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]] ``` <b> with change </b> the plan looks like (the actually executed partition filters are exhibited) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1))) +- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))] ``` ### Does this PR introduce _any_ user-facing change no ### How was this patch tested? Unit test. Closes #29831 from CodingCat/SPARK-32351. Lead-authored-by: Nan Zhu <nanzhu@uber.com> Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 11:13:16 +09:00
Liang-Chi Hsieh	66c5e01322	[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase ### What changes were proposed in this pull request? This patch proposes to add more optimization to `UpdateFields` expression chain. And optimize `UpdateFields` early in analysis phase. ### Why are the changes needed? `UpdateFields` can manipulate complex nested data, but using `UpdateFields` can easily create inefficient expression chain. We should optimize it further. Because when manipulating deeply nested schema, the `UpdateFields` expression tree could be too complex to analyze, this change optimizes `UpdateFields` early in analysis phase. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29812 from viirya/SPARK-32941. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-19 10:35:34 -07:00
Max Gekk	26b13c70c3	[SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources ### What changes were proposed in this pull request? 1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources. 2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs. 3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`. 4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`. ### Why are the changes needed? To improve test coverage and test all built-in file-based datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #30067 from MaxGekk/ds-options-common-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 17:47:49 +09:00
angerszhu	f8277d3aa3	[SPARK-32069][CORE][SQL] Improve error message on reading unexpected directory ### What changes were proposed in this pull request? Improve error message on reading unexpected directory ### Why are the changes needed? Improve error message on reading unexpected directory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ut Closes #30027 from AngersZhuuuu/SPARK-32069. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-18 19:02:21 -07:00
tanel.kiis@gmail.com	ce498943d2	[SPARK-33177][SQL] CollectList and CollectSet should not be nullable ### What changes were proposed in this pull request? Mark `CollectList` and `CollectSet` as non-nullable. ### Why are the changes needed? `CollectList` and `CollectSet` SQL expressions never return null value. Marking them as non-nullable can have some performance benefits, because some optimizer rules apply only to non-nullable expressions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Did not find any existing tests on the nullability of aggregate functions. Closes #30087 from tanelk/SPARK-33177_collect. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 09:50:59 +09:00
Liang-Chi Hsieh	3010e9044e	[SPARK-33170][SQL] Add SQL config to control fast-fail behavior in FileFormatWriter ### What changes were proposed in this pull request? This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default. ### Why are the changes needed? In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry. Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases. We are going to add a config we can control this behavior and set it false for fast-fail by default. ### Does this PR introduce _any_ user-facing change? Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it. ### How was this patch tested? Unit test. Closes #30073 from viirya/SPARK-33170. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-17 21:02:25 -07:00
Liang-Chi Hsieh	2c4599db4b	[MINOR][SS][DOCS] Update Structured Streaming guide doc and update code typo ### What changes were proposed in this pull request? This is a minor change to update structured-streaming-programming-guide and typos in code. ### Why are the changes needed? Keep the user-facing document correct and updated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #30074 from viirya/ss-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 22:18:12 -07:00
Liang-Chi Hsieh	e574fcd230	[SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns ### What changes were proposed in this pull request? SPARK-29358 added support for `unionByName` to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns. The behavior before this PR: ```scala scala> val df1 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2, 'a', id + 3) c1") scala> val df2 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2) c1") scala> df1.unionByName(df2, true).printSchema org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c:bigint,b:bigint> <> struct<c:bigint,b:bigint,a:bigint> at the second column of the second table;; 'Union false, false :- Project [id#0L AS c0#2L, named_struct(c, (id#0L + cast(1 as bigint)), b, (id#0L + cast(2 as bigint)), a, (id#0L + cast(3 as bigint))) AS c1#3] : +- Range (0, 1, step=1, splits=Some(12)) +- Project [c0#8L, c1#9] +- Project [id#6L AS c0#8L, named_struct(c, (id#6L + cast(1 as bigint)), b, (id#6L + cast(2 as bigint))) AS c1#9] +- Range (0, 1, step=1, splits=Some(12)) ``` The behavior after this PR: ```scala scala> df1.unionByName(df2, true).printSchema root \|-- c0: long (nullable = false) \|-- c1: struct (nullable = false) \| \|-- a: long (nullable = true) \| \|-- b: long (nullable = false) \| \|-- c: long (nullable = false) scala> df1.unionByName(df2, true).show() +---+-------------+ \| c0\| c1\| +---+-------------+ \| 0\| {3, 2, 1}\| \| 0\|{ null, 2, 1}\| +---+-------------+ ``` ### Why are the changes needed? The `allowMissingColumns` of `unionByName` is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can union two datasets with different schema with different structs. ### How was this patch tested? Unit tests. Closes #29587 from viirya/SPARK-32376. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-16 14:48:14 -07:00
Max Gekk	acb79f52db	[MINOR][SQL] Re-use `binaryToSQLTimestamp()` in `ParquetRowConverter` ### What changes were proposed in this pull request? The function `binaryToSQLTimestamp()` is used by Parquet Vectorized reader. Parquet MR reader has similar code for de-serialization of INT96 timestamps. In this PR, I propose to de-duplicate code and re-use `binaryToSQLTimestamp()`. ### Why are the changes needed? This should improve maintenance, and should allow to avoid errors while changing Vectorized and regular parquet readers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites, for instance `ParquetIOSuite`. Closes #30069 from MaxGekk/int96-common-serde. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 14:27:27 -07:00
Dongjoon Hyun	ab0bad9544	[SPARK-33171][INFRA] Mark ParquetVFilterSuite/ParquetVSchemaPruningSuite as ExtendedSQLTest ### What changes were proposed in this pull request? This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`. - ParquetV1FilterSuite/ParquetV2FilterSuite - ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite ### Why are the changes needed? Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job. BEFORE - https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes) AFTER - https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Github Action with the reduced time. Closes #30068 from dongjoon-hyun/MOVE3. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 12:52:45 -07:00
Kent Yao	2507301705	[SPARK-33159][SQL] Use hive-service-rpc as dependency instead of inlining the generated code ### What changes were proposed in this pull request? Hive's `hive-service-rpc` module started since hive-2.1.0 and it contains only the thrift IDL file and the code generated by it. Removing the inlined code will help maintain and upgrade builtin hive versions ### Why are the changes needed? to simply the code. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing CI Closes #30055 from yaooqinn/SPARK-33159. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-16 09:37:54 -07:00
neko	e029e891ab	[SPARK-33145][WEBUI] Fix when `Succeeded Jobs` has many child url elements,they will extend over the edge of the page ### What changes were proposed in this pull request? In Execution web page, when `Succeeded Job`(or Failed Jobs) has many child url elements,they will extend over the edge of the page. ### Why are the changes needed? To make the page more friendly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Munual test result shows as below: ![fixed](https://user-images.githubusercontent.com/52202080/95977319-50734600-0e4b-11eb-93c0-b8deb565bcd8.png) Closes #30035 from akiyamaneko/sql_execution_job_overflow. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-10-16 23:13:22 +08:00
ulysses	3ae1520185	[SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name ### What changes were proposed in this pull request? Correct the resolution of having clause. ### Why are the changes needed? Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example: ``` -- Works resolved by `ResolveReferences` select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 ``` It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? add test. Closes #30029 from ulysses-you/SPARK-33131. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:26:27 +00:00
gengjiaan	b69e0651fe	[SPARK-33126][SQL] Simplify offset window function(Remove direction field) ### What changes were proposed in this pull request? The current `Lead`/`Lag` extends `OffsetWindowFunction`. `OffsetWindowFunction` contains field `direction` and use `direction` to calculates the `boundary`. We can use single literal expression unify the two properties. For example: 3 means `direction` is Asc and `boundary` is 3. -3 means `direction` is Desc and `boundary` is -3. ### Why are the changes needed? Improve the current implement of `Lead`/`Lag`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30023 from beliefer/SPARK-33126. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:11:57 +00:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Takeshi Yamamuro	a5c17de241	[SPARK-33165][SQL][TEST] Remove dependencies(scalatest,scalactic) from Benchmark ### What changes were proposed in this pull request? This PR proposes to remove `assert` from `Benchmark` for making it easier to run benchmark codes via `spark-submit`. ### Why are the changes needed? Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we need to pass the proper jars of `scalatest` and `scalactic`; - scalatest-core_2.12-3.2.0.jar - scalatest-compatible-3.2.0.jar - scalactic_2.12-3.0.jar ``` ./bin/spark-submit --jars scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` This update can make developers submit benchmark codes without these dependencies; ``` ./bin/spark-submit --jars ./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #30064 from maropu/RemoveDepInBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 11:39:09 +09:00
Huaxin Gao	bf594a9788	[SPARK-32402][SQL][FOLLOW-UP] Add case sensitivity tests for column resolution in ALTER TABLE ### What changes were proposed in this pull request? Add case sensitivity tests for column resolution in ALTER TABLE ### Why are the changes needed? To make sure `spark.sql.caseSensitive` works for `ResolveAlterTableChanges` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #30063 from huaxingao/caseSensitivity. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 11:04:35 +09:00
Max Gekk	38c05af1d5	[SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up https://github.com/apache/spark/pull/28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes #30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-16 10:28:15 +09:00
Denis Pyshev	ba69d68d91	[SPARK-33080][BUILD] Replace fatal warnings snippet ### What changes were proposed in this pull request? Current solution in build file to enable build failure on compilation warnings with exclusion of deprecation ones is not portable after SBT version 1.3.13 (build import fails with compilation error with SBT 1.4) and could be replaced with more robust and maintainable, especially since Scala 2.13.2 with similar built-in functionality. Additionally, warnings were fixed to pass the build, with as few changes as possible: warnings in 2.12 compilation fixed in code, warnings in 2.13 compilation covered by configuration to be addressed separately ### Why are the changes needed? Unblocks upgrade to SBT after 1.3.13. Enhances build file maintainability. Allows fine tune of warnings configuration in scope of Scala 2.13 compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/sbt`'s `compile` and `Test/compile` for both Scala 2.12 and 2.13 profiles. Closes #29995 from gemelen/feature/warnings-reporter. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-15 14:49:43 -05:00
Liang-Chi Hsieh	9e3746469c	[SPARK-33078][SQL] Add config for json expression optimization ### What changes were proposed in this pull request? This proposes to add a config for json expression optimization. ### Why are the changes needed? For the new Json expression optimization rules, it is safer if we can disable it using SQL config. ### Does this PR introduce _any_ user-facing change? Yes, users can disable json expression optimization rule. ### How was this patch tested? Unit test Closes #30047 from viirya/SPARK-33078. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-15 12:38:10 -07:00
Huaxin Gao	31f7097ce0	[SPARK-32402][SQL][FOLLOW-UP] Use quoted column name for JDBCTableCatalog.alterTable ### What changes were proposed in this pull request? I currently have unquoted column names in alter table, e.g. ```ALTER TABLE "test"."alt_table" DROP COLUMN c1``` should change to quoted column name ```ALTER TABLE "test"."alt_table" DROP COLUMN "c1"``` ### Why are the changes needed? We should always use quoted identifiers in JDBC SQLs, e.g. ```CREATE TABLE "test"."abc" ("col" INTEGER ) ``` or ```INSERT INTO "test"."abc" ("col") VALUES (?)```. Using unquoted column name in alterTable causes problems, for example: ``` sql("CREATE TABLE h2.test.alt_table (c1 INTEGER, c2 INTEGER) USING _") sql("ALTER TABLE h2.test.alt_table DROP COLUMN c1") org.apache.spark.sql.AnalysisException: Failed table altering: test.alt_table; ...... Caused by: org.h2.jdbc.JdbcSQLException: Column "C1" not found; SQL statement: ALTER TABLE "test"."alt_table" DROP COLUMN c1 [42122-195] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30041 from huaxingao/alter_table_followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-15 15:33:23 +00:00
manuzhang	77a8efbc05	[SPARK-32932][SQL] Do not use local shuffle reader at final stage on write command ### What changes were proposed in this pull request? Do not use local shuffle reader at final stage if the root node is write command. ### Why are the changes needed? Users usually repartition with partition column on dynamic partition overwrite. AQE could break it by removing physical shuffle with local shuffle reader. That could lead to a large number of output files, even exceeding the file system limit. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #29797 from manuzhang/spark-32932. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-15 05:53:32 +00:00
Dongjoon Hyun	ec34a001ad	[SPARK-33153][SQL][TESTS] Ignore Spark 2.4 in HiveExternalCatalogVersionsSuite on Python 3.8/3.9 ### What changes were proposed in this pull request? This PR aims to ignore Apache Spark 2.4.x distribution in HiveExternalCatalogVersionsSuite if Python version is 3.8 or 3.9. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is broken on the latest OS like `Ubuntu 20.04` because its default Python version is 3.8. PySpark 2.4.x doesn't work on Python 3.8 due to SPARK-29536. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ python3 --version Python 3.8.5 $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" ... [info] All tests passed. [info] Passed: Total 1, Failed 0, Errors 0, Passed 1 ``` Closes #30044 from dongjoon-hyun/SPARK-33153. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 20:48:13 -07:00
Wenchen Fan	f3ad32f4b6	[SPARK-33026][SQL][FOLLOWUP] metrics name should be numOutputRows ### What changes were proposed in this pull request? Follow the convention and rename the metrics `numRows` to `numOutputRows` ### Why are the changes needed? `FilterExec`, `HashAggregateExec`, etc. all use `numOutputRows` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30039 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-14 16:17:28 +00:00
Jungtaek Lim (HeartSaVioR)	8e5cb1d276	[SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes #30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 08:30:03 -07:00
Richard Penney	d8c4a47ea1	[SPARK-33061][SQL] Expose inverse hyperbolic trig functions through sql.functions API This patch is a small extension to change-request SPARK-28133, which added inverse hyperbolic functions to the SQL interpreter, but did not include those methods within the Scala `sql.functions._` API. This patch makes `acosh`, `asinh` and `atanh` functions available through the Scala API. Unit-tests have been added to `sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala`. Manual testing has been done via `spark-shell`, using the following recipe: ``` val df = spark.range(0, 11) .toDF("x") .withColumn("x", ($"x" - 5) / 2.0) val hyps = df.withColumn("tanh", tanh($"x")) .withColumn("sinh", sinh($"x")) .withColumn("cosh", cosh($"x")) val invhyps = hyps.withColumn("atanh", atanh($"tanh")) .withColumn("asinh", asinh($"sinh")) .withColumn("acosh", acosh($"cosh")) invhyps.show ``` which produces the following output: ``` +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ \| x\| tanh\| sinh\| cosh\| atanh\| asinh\| acosh\| +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ \|-2.5\| -0.9866142981514303\|-6.0502044810397875\| 6.132289479663686\| -2.500000000000001\|-2.4999999999999956\| 2.5\| \|-2.0\| -0.9640275800758169\| -3.626860407847019\|3.7621956910836314\|-2.0000000000000004\|-1.9999999999999991\| 2.0\| \|-1.5\| -0.9051482536448664\|-2.1292794550948173\| 2.352409615243247\|-1.4999999999999998\|-1.4999999999999998\| 1.5\| \|-1.0\| -0.7615941559557649\|-1.1752011936438014\| 1.543080634815244\| -1.0\| -1.0\| 1.0\| \|-0.5\|-0.46211715726000974\|-0.5210953054937474\|1.1276259652063807\| -0.5\|-0.5000000000000002\|0.4999999999999998\| \| 0.0\| 0.0\| 0.0\| 1.0\| 0.0\| 0.0\| 0.0\| \| 0.5\| 0.46211715726000974\| 0.5210953054937474\|1.1276259652063807\| 0.5\| 0.5\|0.4999999999999998\| \| 1.0\| 0.7615941559557649\| 1.1752011936438014\| 1.543080634815244\| 1.0\| 1.0\| 1.0\| \| 1.5\| 0.9051482536448664\| 2.1292794550948173\| 2.352409615243247\| 1.4999999999999998\| 1.5\| 1.5\| \| 2.0\| 0.9640275800758169\| 3.626860407847019\|3.7621956910836314\| 2.0000000000000004\| 2.0\| 2.0\| \| 2.5\| 0.9866142981514303\| 6.0502044810397875\| 6.132289479663686\| 2.500000000000001\| 2.5\| 2.5\| +----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+ ``` Closes #29938 from rwpenney/fix/inverse-hyperbolics. Authored-by: Richard Penney <rwp@rwpenney.uk> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-14 08:48:55 -05:00
Max Gekk	05a62dcada	[SPARK-33134][SQL] Return partial results only for root JSON objects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default PERMISSIVE mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ \|event\| +-----+ \| null\| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes #30031 from MaxGekk/json-skip-row-wrong-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-14 12:13:54 +09:00
Prashant Sharma	304ca1ec93	[SPARK-33129][BUILD][DOCS] Updating the build/sbt references to test-only with testOnly for SBT 1.3.x ### What changes were proposed in this pull request? test-only - > testOnly in docs across the project. ### Why are the changes needed? Since the sbt version is updated, the older way or running i.e. `test-only` is no longer valid. ### Does this PR introduce _any_ user-facing change? docs update. ### How was this patch tested? Manually. Closes #30028 from ScrapCodes/fix-build/sbt-sample. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-13 09:21:06 -07:00
xuewei.linxuewei	dc697a8b59	[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero ### What changes were proposed in this pull request? As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result. Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard. ### Why are the changes needed? SQL correctness issue. ### Does this PR introduce any user-facing change? Yes. See sql-migration-guide In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`. ### How was this patch tested? Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior. Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior. Closes #29983 from leanken/leanken-SPARK-13860. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:21:45 +00:00
gengjiaan	2b7239edfb	[SPARK-33125][SQL] Improve the error when Lead and Lag are not allowed to specify window frame ### What changes were proposed in this pull request? Except for Postgresql, other data sources (for example: vertica, oracle, redshift, mysql, presto) are not allowed to specify window frame for the Lead and Lag functions. But the current error message is not clear enough. `Window Frame $f must match the required frame` This PR will use the following error message. `Cannot specify window frame for lead function` ### Why are the changes needed? Make clear error message. ### Does this PR introduce _any_ user-facing change? Yes Users will see the clearer error message. ### How was this patch tested? Jenkins test. Closes #30021 from beliefer/SPARK-33125. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:12:17 +00:00
Huaxin Gao	af3e2f7d58	[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) ### What changes were proposed in this pull request? - Override the default SQL strings in the DB2 Dialect for: * ALTER TABLE UPDATE COLUMN TYPE * ALTER TABLE UPDATE COLUMN NULLABILITY - Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala ### Why are the changes needed? In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new integration test suite: $ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite" Closes #29972 from huaxingao/db2_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:57:54 +00:00
Chao Sun	feee8da14b	[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types ### What changes were proposed in this pull request? In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources. Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case: ```sql cast(e as double) < 1.9 ``` assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to: ```sql e < cast(1.9 as short) ``` as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false. To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9. <img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png"> Now in the case of truncate, we'd convert the original expression to: ```sql e <= cast(1.9 as short) ``` instead, so that the conversion also is valid when `e` is 1. For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works. ### Why are the changes needed? For queries such as: ```sql SELECT * FROM tbl WHERE short_col < 100.5 ``` The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29792 from sunchao/SPARK-32858. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:44:20 +00:00
tanel.kiis@gmail.com	17eebd7209	[SPARK-32295][SQL] Add not null and size > 0 filters before inner explode/inline to benefit from predicate pushdown ### What changes were proposed in this pull request? Add `And(IsNotNull(e), GreaterThan(Size(e), Literal(0)))` filter before Explode, PosExplode and Inline, when `outer = false`. Removed unused `InferFiltersFromConstraints` from `operatorOptimizationRuleSet` to avoid confusion that happened during the review process. ### Why are the changes needed? Predicate pushdown will be able to move this new filter down through joins and into data sources for performance improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29092 from tanelk/SPARK-32295. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-13 20:11:04 +09:00
Yuming Wang	e34f2d8df2	[SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM ### What changes were proposed in this pull request? `ScalarSubquery` should returns the first two rows. ### Why are the changes needed? To avoid Driver OOM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test: `d6f3138352/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (L147-L154)` Closes #30016 from wangyum/SPARK-33119. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-13 17:41:55 +09:00
Pablo	819f12ee2f	[SPARK-33118][SQL] CREATE TEMPORARY TABLE fails with location ### What changes were proposed in this pull request? We have a problem when you use CREATE TEMPORARY TABLE with LOCATION ```scala spark.range(3).write.parquet("/tmp/testspark1") sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/tmp/testspark1')") sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/tmp/testspark1'") ``` ```scala org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ``` This bug was introduced by SPARK-30507. sparksqlparser --> visitCreateTable --> visitCreateTableClauses --> cleanTableOptions extract the path from the options but in this case CreateTempViewUsing need the path in the options map. ### Why are the changes needed? To fix the problem ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit testing and manual testing Closes #30014 from planga82/bugfix/SPARK-33118_create_temp_table_location. Authored-by: Pablo <pablo.langa@stratio.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-12 14:18:34 -07:00
xuewei.linxuewei	b27a287ff2	[SPARK-33016][SQL] Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on ### What changes were proposed in this pull request? With following scenario when AQE is on, SQLMetrics could be incorrect. 1. Stage A and B are created, and UI updated thru event onAdaptiveExecutionUpdate. 2. Stage A and B are running. Subquery in stage A keep updating metrics thru event onAdaptiveSQLMetricUpdate. 3. Stage B completes, while stage A's subquery is still running, updating metrics. 4. Completion of stage B triggers new stage creation and UI update thru event onAdaptiveExecutionUpdate again (just like step 1). So decided to make a trade off of keeping more duplicate SQLMetrics without deleting them when AQE with newPlan updated. ### Why are the changes needed? Make SQLMetrics behavior 100% correct. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated SQLAppStatusListenerSuite. Closes #29965 from leanken/leanken-SPARK-33016. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-12 14:48:40 +00:00
Takeshi Yamamuro	a0e324460e	[SPARK-32704][SQL][FOLLOWUP] Corrects version values of plan logging configs in SQLConf ### What changes were proposed in this pull request? This PR intends to correct version values (`3.0.0` -> `3.1.0`) of three configs below in `SQLConf`: - spark.sql.planChangeLog.level - spark.sql.planChangeLog.rules - spark.sql.planChangeLog.batches This PR comes from https://github.com/apache/spark/pull/29544#discussion_r503049350. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30015 from maropu/pr29544-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 22:54:31 +09:00
Liang-Chi Hsieh	78c0967bbe	[SPARK-33092][SQL] Support subexpression elimination in ProjectExec ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination support into `ProjectExec`. It can be controlled by `spark.sql.subexpressionElimination.enabled` config. Before this change: ```scala val df = spark.read.option("header", true).csv("/tmp/test.csv") df.withColumn("my_map", expr("str_to_map(foo, '&', '=')")).select(col("my_map")("foo"), col("my_map")("bar"), col("my_map")("baz")).debugCodegen ``` L27-40: first `str_to_map`. L68:81: second `str_to_map`. L109-122: third `str_to_map`. ``` /* 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / boolean project_isNull_0 = true; / 026 / UTF8String project_value_0 = null; / 027 / boolean project_isNull_1 = true; / 028 / MapData project_value_1 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_1 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / if (!project_isNull_1) { / 042 / project_isNull_0 = false; // resultCode could change nullability. / 043 / / 044 / final int project_length_0 = project_value_1.numElements(); / 045 / final ArrayData project_keys_0 = project_value_1.keyArray(); / 046 / final ArrayData project_values_0 = project_value_1.valueArray(); / 047 / / 048 / int project_index_0 = 0; / 049 / boolean project_found_0 = false; / 050 / while (project_index_0 < project_length_0 && !project_found_0) { / 051 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 052 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 053 / project_found_0 = true; / 054 / } else { / 055 / project_index_0++; / 056 / } / 057 / } / 058 / / 059 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 060 / project_isNull_0 = true; / 061 / } else { / 062 / project_value_0 = project_values_0.getUTF8String(project_index_0); / 063 / } / 064 / / 065 / } / 066 / boolean project_isNull_6 = true; / 067 / UTF8String project_value_6 = null; / 068 / boolean project_isNull_7 = true; / 069 / MapData project_value_7 = null; / 070 / / 071 / if (!project_exprIsNull_0_0) { / 072 / project_isNull_7 = false; // resultCode could change nullability. / 073 / / 074 / UTF8String[] project_kvs_1 = project_expr_0_0.split(((UTF8String) references[5] / literal /), -1); / 075 / for(UTF8String kvEntry: project_kvs_1) { / 076 / UTF8String[] kv = kvEntry.split(((UTF8String) references[6] / literal /), 2); / 077 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 078 / } / 079 / project_value_7 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).build(); / 080 / / 081 / } / 082 / if (!project_isNull_7) { / 083 / project_isNull_6 = false; // resultCode could change nullability. / 084 / / 085 / final int project_length_1 = project_value_7.numElements(); / 086 / final ArrayData project_keys_1 = project_value_7.keyArray(); / 087 / final ArrayData project_values_1 = project_value_7.valueArray(); / 088 / / 089 / int project_index_1 = 0; / 090 / boolean project_found_1 = false; / 091 / while (project_index_1 < project_length_1 && !project_found_1) { / 092 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 093 / if (project_key_1.equals(((UTF8String) references[7] / literal /))) { / 094 / project_found_1 = true; / 095 / } else { / 096 / project_index_1++; / 097 / } / 098 / } / 099 / / 100 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 101 / project_isNull_6 = true; / 102 / } else { / 103 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 104 / } / 105 / / 106 / } / 107 / boolean project_isNull_12 = true; / 108 / UTF8String project_value_12 = null; / 109 / boolean project_isNull_13 = true; / 110 / MapData project_value_13 = null; / 111 / / 112 / if (!project_exprIsNull_0_0) { / 113 / project_isNull_13 = false; // resultCode could change nullability. / 114 / / 115 / UTF8String[] project_kvs_2 = project_expr_0_0.split(((UTF8String) references[9] / literal /), -1); / 116 / for(UTF8String kvEntry: project_kvs_2) { / 117 / UTF8String[] kv = kvEntry.split(((UTF8String) references[10] / literal /), 2); / 118 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 119 / } / 120 / project_value_13 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).build(); / 121 / / 122 / } ... ``` After this change: L27-40 evaluates the common map variable. ``` / 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / boolean project_isNull_0 = true; / 028 / MapData project_value_0 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_0 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / / 042 / boolean project_isNull_4 = true; / 043 / UTF8String project_value_4 = null; / 044 / / 045 / if (!project_isNull_0) { / 046 / project_isNull_4 = false; // resultCode could change nullability. / 047 / / 048 / final int project_length_0 = project_value_0.numElements(); / 049 / final ArrayData project_keys_0 = project_value_0.keyArray(); / 050 / final ArrayData project_values_0 = project_value_0.valueArray(); / 051 / / 052 / int project_index_0 = 0; / 053 / boolean project_found_0 = false; / 054 / while (project_index_0 < project_length_0 && !project_found_0) { / 055 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 056 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 057 / project_found_0 = true; / 058 / } else { / 059 / project_index_0++; / 060 / } / 061 / } / 062 / / 063 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 064 / project_isNull_4 = true; / 065 / } else { / 066 / project_value_4 = project_values_0.getUTF8String(project_index_0); / 067 / } / 068 / / 069 / } / 070 / boolean project_isNull_6 = true; / 071 / UTF8String project_value_6 = null; / 072 / / 073 / if (!project_isNull_0) { / 074 / project_isNull_6 = false; // resultCode could change nullability. / 075 / / 076 / final int project_length_1 = project_value_0.numElements(); / 077 / final ArrayData project_keys_1 = project_value_0.keyArray(); / 078 / final ArrayData project_values_1 = project_value_0.valueArray(); / 079 / / 080 / int project_index_1 = 0; / 081 / boolean project_found_1 = false; / 082 / while (project_index_1 < project_length_1 && !project_found_1) { / 083 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 084 / if (project_key_1.equals(((UTF8String) references[4] / literal /))) { / 085 / project_found_1 = true; / 086 / } else { / 087 / project_index_1++; / 088 / } / 089 / } / 090 / / 091 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 092 / project_isNull_6 = true; / 093 / } else { / 094 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 095 / } / 096 / / 097 / } / 098 / boolean project_isNull_8 = true; / 099 / UTF8String project_value_8 = null; / 100 / ... ``` When the code is split into separated method: ``` / 026 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 027 / // common sub-expressions / 028 / / 029 / MapData project_subExprValue_0 = project_subExpr_0(project_exprIsNull_0_0, project_expr_0_0); / 030 / ... / 140 / private MapData project_subExpr_0(boolean project_exprIsNull_0_0, org.apache.spark.unsafe.types.UTF8String project_expr_0_0) { / 141 / boolean project_isNull_0 = true; / 142 / MapData project_value_0 = null; / 143 / / 144 / if (!project_exprIsNull_0_0) { / 145 / project_isNull_0 = false; // resultCode could change nullability. / 146 / / 147 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 148 / for(UTF8String kvEntry: project_kvs_0) { / 149 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 150 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 151 / } / 152 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 153 / / 154 / } / 155 / project_subExprIsNull_0 = project_isNull_0; / 156 / return project_value_0; / 157 */ } ``` ### Why are the changes needed? Users occasionally write repeated expression in projection. It is also possibly that query optimizer optimizes a query to evaluate same expression many times in a Project. Currently in ProjectExec, we don't support subexpression elimination in Whole-stage codegen. We can support it to reduce redundant evaluation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `spark.sql.subexpressionElimination.enabled` is enabled by default. So that's said we should pass all tests with this change. Closes #29975 from viirya/SPARK-33092. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 16:54:21 +09:00
Yuming Wang	543d59dfbf	[SPARK-33107][BUILD][FOLLOW-UP] Remove com.twitter:parquet-hadoop-bundle:1.6.0 and orc.classifier ### What changes were proposed in this pull request? This pr removes `com.twitter:parquet-hadoop-bundle:1.6.0` and `orc.classifier`. ### Why are the changes needed? To make code more clear and readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30005 from wangyum/SPARK-33107. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-11 21:54:56 -07:00
Gabor Somogyi	4af1ac9384	[SPARK-32047][SQL] Add JDBC connection provider disable possibility ### What changes were proposed in this pull request? At the moment there is no possibility to turn off JDBC authentication providers which exists on the classpath. This can be problematic because service providers are loaded with service loader. In this PR I've added `spark.sql.sources.disabledJdbcConnProviderList` configuration possibility (default: empty). ### Why are the changes needed? No possibility to turn off JDBC authentication providers. ### Does this PR introduce _any_ user-facing change? Yes, it introduces new configuration option. ### How was this patch tested? * Existing + newly added unit tests. * Existing integration tests. Closes #29964 from gaborgsomogyi/SPARK-32047. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-12 12:24:54 +09:00
Yuming Wang	5e170140b0	[SPARK-33107][SQL] Remove hive-2.3 workaround code ### What changes were proposed in this pull request? This pr remove `hive-2.3` workaround code. ### Why are the changes needed? Make code more clear and readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #29996 from wangyum/SPARK-33107. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-10 16:41:42 -07:00
Gabor Somogyi	1e63dcc8f0	[SPARK-33102][SQL] Use stringToSeq on SQL list typed parameters ### What changes were proposed in this pull request? While I've implemented JDBC provider disable functionality it has been popped up [here](https://github.com/apache/spark/pull/29964#discussion_r501786746) that `Utils.stringToSeq` must be used when String list type SQL parameter handled. In this PR I've fixed the problematic parameters. ### Why are the changes needed? `Utils.stringToSeq` must be used when String list type SQL parameter handled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #29989 from gaborgsomogyi/SPARK-33102. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-10 13:53:09 +09:00
HyukjinKwon	2e07ed3041	[SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script ### What changes were proposed in this pull request? This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script. - `test-hive1.2` title is not used anymore in Jenkins - Remove some comments related to Hive 1.2 - Remove unused codes in `OrcFilters.scala` Hive - Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests added at SPARK-19809 and SPARK-22267 ### Why are the changes needed? To remove unused codes & improve test coverage ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ran the unit tests. Also It will be tested in CI in this PR. Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:04:26 -07:00
Jungtaek Lim (HeartSaVioR)	edb140eb5c	[SPARK-32896][SS] Add DataStreamWriter.table API ### What changes were proposed in this pull request? This PR proposes to add `DataStreamWriter.table` to specify the output "table" to write from the streaming query. ### Why are the changes needed? For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the `DataStreamWriter.format(provider)` and wish the provider can handle it as same as we do with catalog table. With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following: ```scala // assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo") val query = inputDF .writeStream .table("testcat.ns.table1") .option(...) .start() ``` ### Does this PR introduce _any_ user-facing change? Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change. ### How was this patch tested? New unit tests. Closes #29767 from HeartSaVioR/SPARK-32896. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:01:54 -07:00
ulysses	a9077299d7	[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString ### What changes were proposed in this pull request? Add distinct info at `UnresolvedFunction.toString`. ### Why are the changes needed? Make `UnresolvedFunction` info complete. ``` create table test (c1 int, c2 int); explain extended select sum(distinct c1) from test; -- before this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum('c1), None)] +- 'UnresolvedRelation [test] -- after this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum(distinct 'c1), None)] +- 'UnresolvedRelation [test] ``` ### Does this PR introduce _any_ user-facing change? Yes, get distinct info during sql parse. ### How was this patch tested? manual test. Closes #29586 from ulysses-you/SPARK-32743. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-09 09:25:22 +09:00
Max Gekk	c5f6af9f17	[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("orc").options(conf).load(path) ``` The underlying file system will not receive the conf options. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `OrcSourceSuite`. Closes #29976 from MaxGekk/orc-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-08 11:59:30 -07:00
HyukjinKwon	5effa8ea26	[SPARK-33091][SQL] Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema ### What changes were proposed in this pull request? This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly. When you use `map`, it might be lazily evaluated and not executed. To avoid this, we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues. ### Why are the changes needed? To avoid potential issues from `map` being lazy and not executed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran related tests. CI in this PR should verify. Closes #29974 from HyukjinKwon/SPARK-32646. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-08 16:29:15 +09:00
Max Gekk	7d6e3fb998	[SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog ### What changes were proposed in this pull request? 1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes. 2. Replace H2 exception `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`. 3. Add `H2Dialect` ### Why are the changes needed? Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`. Closes #29952 from MaxGekk/jdbcv2-classify-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 05:28:33 +00:00
Terry Kim	1c781a4354	[SPARK-32282][SQL] Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection ### What changes were proposed in this pull request? This PR proposes to improve `EnsureRquirement.reorderJoinKeys` to handle the following scenarios: 1. If the keys cannot be reordered to match the left-side `HashPartitioning`, consider the right-side `HashPartitioning`. 2. Handle `PartitioningCollection`, which may contain `HashPartitioning` ### Why are the changes needed? 1. For the scenario 1), the current behavior matches either the left-side `HashPartitioning` or the right-side `HashPartitioning`. This means that if both sides are `HashPartitioning`, it will try to match only the left side. The following will not consider the right-side `HashPartitioning`: ``` val df1 = (0 until 10).map(i => (i % 5, i % 13)).toDF("i1", "j1") val df2 = (0 until 10).map(i => (i % 7, i % 11)).toDF("i2", "j2") df1.write.format("parquet").bucketBy(4, "i1", "j1").saveAsTable("t1")df2.write.format("parquet").bucketBy(4, "i2", "j2").saveAsTable("t2") val t1 = spark.table("t1") val t2 = spark.table("t2") val join = t1.join(t2, t1("i1") === t2("j2") && t1("i1") === t2("i2")) join.explain == Physical Plan == (5) SortMergeJoin [i1#26, i1#26], [j2#31, i2#30], Inner :- (2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#69] : +- (1) Project [i1#26, j1#27] : +- (1) Filter isnotnull(i1#26) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4 +- (4) Sort [j2#31 ASC NULLS FIRST, i2#30 ASC NULLS FIRST], false, 0. +- Exchange hashpartitioning(j2#31, i2#30, 4), true, [id=#79]. <===== This can be removed +- (3) Project [i2#30, j2#31] +- (3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30)) +- (3) ColumnarToRow +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4 ``` 2. For the scenario 2), the current behavior does not handle `PartitioningCollection`: ``` val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1") val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("i2", "j2") val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3") val join = df1.join(df2, df1("i1") === df2("i2") && df1("j1") === df2("j2")) // PartitioningCollection val join2 = join.join(df3, join("j1") === df3("j3") && join("i1") === df3("i3")) join2.explain == Physical Plan == (9) SortMergeJoin [j1#8, i1#7], [j3#30, i3#29], Inner :- (6) Sort [j1#8 ASC NULLS FIRST, i1#7 ASC NULLS FIRST], false, 0. <===== This can be removed : +- Exchange hashpartitioning(j1#8, i1#7, 5), true, [id=#58] <===== This can be removed : +- (5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner : :- (2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#45] : : +- (1) Project [_1#2 AS i1#7, _2#3 AS j1#8] : : +- (1) LocalTableScan [_1#2, _2#3] : +- (4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#51] : +- (3) Project [_1#13 AS i2#18, _2#14 AS j2#19] : +- (3) LocalTableScan [_1#13, _2#14] +- (8) Sort [j3#30 ASC NULLS FIRST, i3#29 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(j3#30, i3#29, 5), true, [id=#64] +- (7) Project [_1#24 AS i3#29, _2#25 AS j3#30] +- (7) LocalTableScan [_1#24, _2#25] ``` ### Does this PR introduce _any_ user-facing change? Yes, now from the above examples, the shuffle/sort nodes pointed by `This can be removed` are now removed: 1. Senario 1): ``` == Physical Plan == (4) SortMergeJoin [i1#26, i1#26], [i2#30, j2#31], Inner :- (2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#67] : +- (1) Project [i1#26, j1#27] : +- (1) Filter isnotnull(i1#26) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4 +- (3) Sort [i2#30 ASC NULLS FIRST, j2#31 ASC NULLS FIRST], false, 0 +- (3) Project [i2#30, j2#31] +- (3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30)) +- (3) ColumnarToRow +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4 ``` 2. Scenario 2): ``` == Physical Plan == (8) SortMergeJoin [i1#7, j1#8], [i3#29, j3#30], Inner :- (5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner : :- (2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#43] : : +- (1) Project [_1#2 AS i1#7, _2#3 AS j1#8] : : +- (1) LocalTableScan [_1#2, _2#3] : +- (4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#49] : +- (3) Project [_1#13 AS i2#18, _2#14 AS j2#19] : +- (3) LocalTableScan [_1#13, _2#14] +- (7) Sort [i3#29 ASC NULLS FIRST, j3#30 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i3#29, j3#30, 5), true, [id=#58] +- (6) Project [_1#24 AS i3#29, _2#25 AS j3#30] +- *(6) LocalTableScan [_1#24, _2#25] ``` ### How was this patch tested? Added tests. Closes #29074 from imback82/reorder_keys. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 04:58:41 +00:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
Max Gekk	23afc930ae	[SPARK-26499][SQL][FOLLOWUP] Print the loading provider exception starting from the INFO level ### What changes were proposed in this pull request? 1. Don't print the exception in the error message while loading a built-in provider. 2. Print the exception starting from the INFO level. Up to the INFO level, the output is: ``` 17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider. ``` and starting from the INFO level: ``` 17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider. 17:48:32.342 INFO org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception: java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41) ``` ### Why are the changes needed? To avoid "noise" in logs while running tests. Currently, logs are blown up: ``` org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception: java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41) ... at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Intentional Exception at org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider.<init>(IntentionallyFaultyConnectionProvider.scala:26) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite" ``` Closes #29968 from MaxGekk/gaborgsomogyi-SPARK-32001-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 13:50:15 -07:00
Dongjoon Hyun	a127387a53	[SPARK-33082][SQL] Remove hive-1.2 workaround code ### What changes were proposed in this pull request? This PR removes old Hive-1.2 profile related workaround code. ### Why are the changes needed? To simply the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #29961 from dongjoon-hyun/SPARK-HIVE12. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 12:27:23 -07:00
Takeshi Yamamuro	94d648dff5	[SPARK-33036][SQL] Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner ### What changes were proposed in this pull request? This PR intends to refactor code in `RewriteCorrelatedScalarSubquery` for replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one. This PR comes from the talk with cloud-fan in https://github.com/apache/spark/pull/29585#discussion_r490371252. ### Why are the changes needed? To improve code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29913 from maropu/RefactorRewriteCorrelatedScalarSubquery. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-07 20:16:40 +09:00
Terry Kim	7e99fcd64e	[SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE testcat.ns") sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t Describing columns is not supported for v2 tables.; org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.; ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE spark_catalog.test") sql("DESCRIBE t i").show // 't' is resolved to a temp view +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| i\| \|data_type\| int\| \| comment\| NULL\| +---------+----------+ ``` ### Does this PR introduce _any_ user-facing change? After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29880 from imback82/describe_column_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 06:33:20 +00:00
Max Gekk	aea78d2c8c	[SPARK-33034][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect) ### What changes were proposed in this pull request? 1. Override the default SQL strings in the Oracle Dialect for: - ALTER TABLE ADD COLUMN - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY 2. Add new docker integration test suite `jdbc/v2/OracleIntegrationSuite.scala` ### Why are the changes needed? In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some `ALTER TABLE` at the moment. This PR supports Oracle specific `ALTER TABLE`. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new integration test suite: ``` $ ./build/sbt -Pdocker-integration-tests "test-only *.OracleIntegrationSuite" ``` Closes #29912 from MaxGekk/jdbcv2-oracle-alter-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 04:48:57 +00:00
Max Gekk	584f90c82e	[SPARK-33067][SQL][TESTS][FOLLOWUP] Check error messages in JDBCTableCatalogSuite ### What changes were proposed in this pull request? Get error message from the expected exception, and check that they are reasonable. ### Why are the changes needed? To improve tests by expecting particular error messages. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `JDBCTableCatalogSuite`. Closes #29957 from MaxGekk/jdbcv2-negative-tests-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 09:29:30 +09:00
Liang-Chi Hsieh	57ed5a829b	[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain ### What changes were proposed in this pull request? This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`. ### Why are the changes needed? Simplify complex expression tree that could be produced by query optimization or user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29942 from viirya/SPARK-33007. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 16:59:23 -07:00
Kousuke Saruta	3b2a38d735	[SPARK-32511][SQL][FOLLOWUP] Fix the broken build for Scala 2.13 with Maven ### What changes were proposed in this pull request? This PR fixes the broken build for Scala 2.13 with Maven. https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966 #29795 was merged though it doesn't successfully finish the build for Scala 2.13 ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package` Closes #29954 from sarutak/hotfix-seq. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 09:40:16 -07:00
Kent Yao	17d309dfac	[SPARK-32963][SQL] empty string should be consistent for schema name in SparkGetSchemasOperation ### What changes were proposed in this pull request? This PR makes the empty string for schema name pattern match the global temp view as same as it works for other databases. This PR also add new tests to covering different kinds of wildcards to verify the SparkGetSchemasOperation ### Why are the changes needed? When the schema name is empty string, it is considered as "." and can match all databases in the catalog. But when it can not match the global temp view as it is not converted to "." ### Does this PR introduce _any_ user-facing change? yes , JDBC operation like `statement.getConnection.getMetaData..getSchemas(null, "")` now also provides the global temp view in the result set. ### How was this patch tested? new tests Closes #29834 from yaooqinn/SPARK-32963. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 16:01:10 +00:00
Wenchen Fan	ec6fccb922	[SPARK-32243][SQL][FOLLOWUP] Fix compilation in HiveSessionCatalog Fix a mistake when merging https://github.com/apache/spark/pull/29054 Closes #29955 from cloud-fan/hot-fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 14:33:34 +00:00
angerszhu	ddc7012b3d	[SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error ### What changes were proposed in this pull request? When we create a UDAF function use class extended `UserDefinedAggregeteFunction`, when we call the function, in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, but it will catch error such as the function need 2 parameter and we only give 1, throw exception only show ``` No handler for UDF/UDAF/UDTF xxxxxxxx ``` This is confused for develop , we should show error thrown by super method too, For this pr's UT : Before change, throw Exception like ``` No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` After this pr, throw exception ``` Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1; Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` ### Why are the changes needed? Show more detail error message when define UDAF ### Does this PR introduce _any_ user-facing change? People will see more detail error message when use spark sql's UDAF in hive support Mode ### How was this patch tested? Added UT Closes #29054 from AngersZhuuuu/SPARK-32243. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 09:09:19 +00:00
fqaiser94@gmail.com	2793347972	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:53:30 +00:00
Takeshi Yamamuro	4adc2822a3	[SPARK-33035][SQL] Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput ### What changes were proposed in this pull request? This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this; ``` case class TestRule extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput { case p Project(projList, _) => val newPlan = p.copy(projectList = projList.map { _.transform { // Assigns a new `ExprId` for references case a: AttributeReference => Alias(a, a.name)() }}.asInstanceOf[Seq[NamedExpression]]) val attrMapping = p.output.zip(newPlan.output) newPlan -> attrMapping } } ``` Then, this rule is applied into a plan below; ``` (3) Project [a#5, b#6] +- (2) Project [a#5, b#6] +- (1) Project [a#5, b#6] +- LocalRelation <empty>, [a#5, b#6] ``` In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule. ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `QueryPlanSuite`. Closes #29911 from maropu/QueryPlanBug. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:32:55 +00:00
Max Gekk	9870cf9c08	[SPARK-33067][SQL][TESTS] Add negative checks to JDBC v2 Table Catalog tests ### What changes were proposed in this pull request? Add checks for the cases when JDBC v2 Table Catalog commands fail. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `JDBCTableCatalogSuite`. Closes #29945 from MaxGekk/jdbcv2-negative-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-06 13:01:57 +09:00
Dongjoon Hyun	008a2ad1f8	[SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1) ### What changes were proposed in this pull request? As of today, - SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository. - SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions. This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0. ``` <hive.group>org.spark-project.hive</hive.group> <hive.version>1.2.1.spark2</hive.version> ``` For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it. ### Why are the changes needed? - First, Apache Spark community should not use the unofficial forked release of another Apache project. - Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`. ### How was this patch tested? 1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366) 2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382) 3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.) 4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected) Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-05 15:29:56 -07:00
allisonwang-db	14aeab3b27	[SPARK-33038][SQL] Combine AQE initial and current plan string when two plans are the same ### What changes were proposed in this pull request? This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers: Before ```scala AdaptiveSparkPlan isFinalPlan=false +- == Current Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... +- == Initial Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... ``` After ```scala AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... ``` For SQL `EXPLAIN` output: Before ```scala AdaptiveSparkPlan (8) +- == Current Plan == Sort (7) +- Exchange (6) ... +- == Initial Plan == Sort (7) +- Exchange (6) ... ``` After ```scala AdaptiveSparkPlan (8) +- Sort (7) +- Exchange (6) ... ``` ### Why are the changes needed? To simplify the AQE plan string by removing the redundant plan information. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Modified the existing unit test. Closes #29915 from allisonwang-db/aqe-explain. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-10-05 09:30:27 -07:00
Yuming Wang	023eb482b2	[SPARK-32914][SQL] Avoid constructing dataType multiple times ### What changes were proposed in this pull request? Some expression's data type not a static value. It needs to be constructed a new object when calling `dataType` function. E.g.: `CaseWhen`. We should avoid constructing dataType multiple times because it may be used many times. E.g.: [`HyperLogLogPlusPlus.update`](`10edeafc69/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (L122)`). ### Why are the changes needed? Improve query performance. for example: ```scala spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show ``` Profiling result: ``` -- Execution profile --- Total samples : 18365 Frame buffer usage : 2.6688% --- 58443254327 ns (31.82%), 5844 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int, StarTask&) [ 1] StealTask::do_it(GCTaskManager, unsigned int) [ 2] GCTaskThread::run() [ 3] java_start(Thread) [ 4] start_thread --- 6140668667 ns (3.34%), 614 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek() [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator) [ 2] StealTask::do_it(GCTaskManager, unsigned int) [ 3] GCTaskThread::run() [ 4] java_start(Thread) [ 5] start_thread --- 5679994036 ns (3.09%), 568 samples [ 0] scala.collection.generic.Growable.$plus$plus$eq [ 1] scala.collection.generic.Growable.$plus$plus$eq$ [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1 [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply [ 6] scala.collection.immutable.List.foreach [ 7] scala.collection.generic.GenericTraversableTemplate.flatten [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$ [ 9] scala.collection.AbstractTraversable.flatten [10] org.apache.spark.internal.config.ConfigEntry.readString [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom [12] org.apache.spark.sql.internal.SQLConf.getConf [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis [14] org.apache.spark.sql.types.DataType.sameType [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1 [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl [19] scala.collection.IndexedSeqOptimized.forall [20] scala.collection.IndexedSeqOptimized.forall$ [21] scala.collection.mutable.ArrayBuffer.forall [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$ [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$ [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2 [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7 [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test and benchmark test: Benchmark code \| Before this PR(Milliseconds) \| After this PR(Milliseconds) --- \| --- \| --- spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").collect() \| 56462 \| 3794 Closes #29790 from wangyum/SPARK-32914. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 22:00:42 +09:00
Yuning Zhang	0fb2574d4e	[SPARK-33042][SQL][TEST] Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime. ### Why are the changes needed? Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156 However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test Closes #29919 from yuningzh-db/add_optimizer_test. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 20:25:57 +09:00
Liang-Chi Hsieh	37c806af2b	[SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for `JsonToStructs` expression if we only require some fields from it. ### Why are the changes needed? `JsonToStructs` takes a schema parameter used to tell `JacksonParser` what fields are needed to parse. If `JsonToStructs` is followed by `GetStructField`. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29900 from viirya/SPARK-32958. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-03 14:55:02 -07:00
Takeshi Yamamuro	82721ce00b	[SPARK-32741][SQL][FOLLOWUP] Run plan integrity check only for effective plan changes ### What changes were proposed in this pull request? (This is a followup PR of #29585) The PR modified `RuleExecutor#isPlanIntegral` code for checking if a plan has globally-unique attribute IDs, but this check made Jenkins maven test jobs much longer (See [the Dongjoon comment](https://github.com/apache/spark/pull/29585#issuecomment-702461314) and thanks, dongjoon-hyun !). To recover running time for the Jenkins tests, this PR intends to update the code to run plan integrity check only for effective plans. ### Why are the changes needed? To recover running time for Jenkins tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29928 from maropu/PR29585-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 22:16:19 +09:00
Yuming Wang	9996e252ad	[SPARK-33026][SQL] Add numRows to metric of BroadcastExchangeExec ### What changes were proposed in this pull request? This pr adds `numRows` to the metric and runtimeStatistics of `BroadcastExchangeExec`. ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need row count. The [ShuffleExchangeExec](`1c6dff7b5f/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (L127)`) have added the row count, but `BroadcastExchangeExec` missing the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29904 from wangyum/SPARK-33026. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-01 23:01:31 -07:00
Gabor Somogyi	991f7e81d4	[SPARK-32001][SQL] Create JDBC authentication provider developer API ### What changes were proposed in this pull request? At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider). The PR contains the following changes: * Added connection provider developer API * Made JDBC connection providers constructor to noarg => needed to load them w/ service loader * Connection providers are now loaded w/ service loader * Added tests to load providers independently * Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config ### Why are the changes needed? No custom authentication possibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Existing + additional unit tests * Docker integration tests * Tested manually the newly created external JDBC connection provider Closes #29024 from gaborgsomogyi/SPARK-32001. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-02 13:04:40 +09:00
Cheng Su	d6f3138352	[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 09:01:15 +09:00
ulysses	e62d24717e	[SPARK-32585][SQL] Support scala enumeration in ScalaReflection ### What changes were proposed in this pull request? Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark. ### Why are the changes needed? We support java enum but failed with scala enum, it's better to keep the same behavior. Here is a example. ``` package test object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import TestEnum._ case class TestClass(i: Int, e: TestEnum) { } import test._ Seq(TestClass(1, TestEnum.E1)).toDS ``` Before this PR ``` Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum - field (class: "scala.Enumeration.Value", name: "e") - root class: "test.TestClass" at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881) ``` After this PR `org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]` ### Does this PR introduce _any_ user-facing change? Yes, user can make case class which include scala enumeration field as dataset. ### How was this patch tested? Add test. Closes #29403 from ulysses-you/SPARK-32585. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-10-01 15:58:01 -04:00
yangjie01	0963fcd848	[SPARK-33024][SQL] Fix CodeGen fallback issue of UDFSuite in Scala 2.13 ### What changes were proposed in this pull request? After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow: ``` - SPARK-32459: UDF should not fail on WrappedArray * FAILED * Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)" ``` The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method. The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0 * 1 TEST FAILED * ``` After ``` Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0 All tests passed. ``` Closes #29903 from LuciferYang/fix-udfsuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-01 08:37:07 -05:00

... 8 9 10 11 12 ...

10945 commits