ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
PengLei	87d49cbcb1	[SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table ### What changes were proposed in this pull request? Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive. ### Why are the changes needed? At now the resolver is `_ == _` of `findNestedField` called by `checkColumnNotExists` Add `alter.conf.resolver` to it. [SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Closes #33618 from Peng-Lei/sensitive-cloumn-name. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:04:13 +09:00
Yuming Wang	4a6afb4875	[SPARK-36280][SQL] Remove redundant aliases after RewritePredicateSubquery ### What changes were proposed in this pull request? Remove redundant aliases after `RewritePredicateSubquery`. For example: ```scala sql("CREATE TABLE t1 USING parquet AS SELECT id AS a, id AS b, id AS c FROM range(10)") sql("CREATE TABLE t2 USING parquet AS SELECT id AS x, id AS y FROM range(8)") sql( """ \|SELECT * \|FROM t1 \|WHERE a IN (SELECT x \| FROM (SELECT x AS x, \| Rank() OVER (partition BY x ORDER BY Sum(y) DESC) AS ranking \| FROM t2 \| GROUP BY x) tmp1 \| WHERE ranking <= 5) \|""".stripMargin).explain ``` Before this PR: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [a#10L], [x#7L], LeftSemi, BuildRight, false :- FileScan parquet default.t1[a#10L,b#11L,c#12L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#68] +- Project [x#7L] +- Filter (ranking#8 <= 5) +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST] +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#62] +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)]) +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59] +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)]) +- FileScan parquet default.t2[x#15L,y#16L] ``` After this PR: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [a#10L], [x#15L], LeftSemi, BuildRight, false :- FileScan parquet default.t1[a#10L,b#11L,c#12L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#67] +- Project [x#15L] +- Filter (ranking#8 <= 5) +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST] +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0 +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)]) +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59] +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)]) +- FileScan parquet default.t2[x#15L,y#16L] ``` ### Why are the changes needed? Reduce shuffle to improve query performance. This change can benefit TPC-DS q70. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33509 from wangyum/SPARK-36280. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-03 13:56:59 -07:00
Wenchen Fan	7cb9c1c241	[SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN ### What changes were proposed in this pull request? This a followup of the recent work such as https://github.com/apache/spark/pull/33200 For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands. This PR also moves these AlterTable commands to a individual file and give them a base trait. ### Why are the changes needed? name simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33609 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 10:43:00 +03:00
Yuming Wang	c20af53580	[SPARK-36373][SQL] DecimalPrecision only add necessary cast ### What changes were proposed in this pull request? This pr makes `DecimalPrecision` only add necessary cast similar to [`ImplicitTypeCasts`](`96c2919988/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L675-L678)`). For example: ``` EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", DecimalType(2, 1))()) ``` It will add a useless cast to _d1_: ``` (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) ``` ### Why are the changes needed? 1. Avoid adding unnecessary cast. Although it will be removed by `SimplifyCasts` later. 2. I'm trying to add an extended rule similar to `PullOutGroupingExpressions`. The current behavior will introduce additional alias. For example: `cast(d1 as decimal(5,2)) as cast_d1`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33602 from wangyum/SPARK-36373. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-08-03 08:12:54 +08:00
Chao Sun	7a27f8a07f	[SPARK-36137][SQL] HiveShim should fallback to getAllPartitionsOf even if directSQL is enabled in remote HMS ### What changes were proposed in this pull request? Change `HiveShim.getPartitionsByFilter` to always fallback to use `getAllPartitionsMethod` even if `hive.metastore.try.direct.sql` is set to true in the remote HMS. ### Why are the changes needed? At the moment `getPartitionsByFilter` in `HiveShim` only fallback to use `getAllPartitionsMethod` when `hive.metastore.try.direct.sql` is disabled in the remote HMS, and will fail the query otherwise. However, in certain cases the remote HMS will fallback to use ORM (which only support string type for partition columns) to query the underlying RDBMS even if this config is set to true. In this scenario, currently Spark will not be able to recover from the exception and will just fail the query. For instance, we encountered this bug [HIVE-21497](https://issues.apache.org/jira/browse/HIVE-21497) in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for date column. ### Does this PR introduce _any_ user-facing change? Yes, now if Spark is querying partitions from a remote HMS which throws exception even if `hive.metastore.try.direct.sql` is set to true, Spark will fallback to list all partitions and do the pruning on client side, instead of failing the query. ### How was this patch tested? Tested locally with a HMS instance running 3.1.2. It's pretty hard to add a unit test for this since we don't have a mock HMS. Closes #33382 from sunchao/SPARK-36137-direct-sql. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-02 16:48:43 -07:00
Hyukjin Kwon	0bbcbc6508	[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode ### What changes were proposed in this pull request? This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: Permissive mode: ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` ### Why are the changes needed? To make the permissive mode to proceed and parse without throwing an exception. ### Does this PR introduce _any_ user-facing change? Permissive mode: ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` ### How was this patch tested? Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 10:01:12 -07:00
Angerszhuuuu	f3173956cb	[SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name * FAILED * (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZhuuuu/SPARK-36086. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 00:08:13 +08:00
Linhong Liu	2f700773c2	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:19:54 +08:00
Terry Kim	3b713e7f61	[SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog. ### Why are the changes needed? To check the duplicate columns during analysis and be consistent with v1 command. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the fllowing: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33600 from imback82/alter_add_duplicate_columns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 17:54:50 +08:00
Sean Owen	72615bc551	[SPARK-36362][CORE][SQL][TESTS] Omnibus Java code static analyzer warning fixes ### What changes were proposed in this pull request? Fix up some minor Java issues: - Some int*int multiplications that widen to long maybe could overflow - Unnecessarily non-static inner classes - Some tests "catch (AssertionError)" and do nothing - Manual array iteration vs very slightly faster/simpler foreach - Incorrect generic types that just happen to not cause a runtime error - Missed opportunities for try-close - Mutable enums - .. and a few other minor things ### Why are the changes needed? Some are minor but clear fixes; some may have a marginal perf impact or avoid a bug later. Also: maybe avoid future PRs to address these one by one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests Closes #33594 from srowen/SPARK-36362. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-31 22:35:57 -07:00
Wenchen Fan	387a251a68	[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown: 1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark. 2. because of 1, now we can remove the `dataType` property from the public `Sum` expression. This PR also contains some small improvements: 1. Spark should deduplicate the aggregate expressions before pushing them down. 2. Improve the `toString` of public aggregate expressions to make them more SQL. ### Why are the changes needed? code and API simplification ### Does this PR introduce _any_ user-facing change? this API is not released yet. ### How was this patch tested? existing tests Closes #33579 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-30 00:26:32 -07:00
Kousuke Saruta	db18866742	[SPARK-36323][SQL] Support ANSI interval literals for TimeWindow ### What changes were proposed in this pull request? This PR proposes to support ANSI interval literals for `TimeWindow`. ### Why are the changes needed? Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33551 from sarutak/window-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-29 08:51:51 +03:00
Linhong Liu	ed0e351f05	[SPARK-36286][SQL] Block some invalid datetime string ### What changes were proposed in this pull request? In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489)) This PR blocks the invalid datetime string. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, below strings will have different results when cast to datetime. ```sql select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL ``` ### How was this patch tested? some new test cases Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 09:16:46 +08:00
dgd-contributor	e1c50ff779	[SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow ### What changes were proposed in this pull request? 1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold. ``` scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show +---------------------------+ \|conv(repeat(?, 64), 10, 16)\| +---------------------------+ \| 0\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0 +---------------------------+ \|conv(repeat(?, 65), 10, 16)\| +---------------------------+ \| FFFFFFFFFFFFFFFF\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0 +----------------------------+ \|conv(repeat(?, 65), 10, -16)\| +----------------------------+ \| -1\| +----------------------------+ scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show +----------------------------+ \|conv(repeat(?, 64), 10, -16)\| +----------------------------+ \| 0\| +----------------------------+ ``` 2/ conv should return result equal to max unsigned long value in base toBase when there is overflow ``` scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615 +-------------------------------+ \|conv(aaaaaaa0aaaaaaa0a, 16, 10)\| +-------------------------------+ \| 12297828695278266890\| +-------------------------------+ ``` ### Why are the changes needed? Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database ### Does this PR introduce _any_ user-facing change? change in result of conv() function ### How was this patch tested? add test Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 00:19:04 +08:00
Pablo Langa	f12793de20	[SPARK-35320][SQL] Improve error message for unsupported key types in MapType in from_json expression ### What changes were proposed in this pull request? Currently, when a map is parsed in a from_json function, only StringType key is supported. If you try to parse other type, it results on a cast exception. For example: ```scala Seq((s"""{"2021-05-05T20:05:08": "sampleValue"}""")) .toDF("value") .withColumn("value1", from_json(col("value"), MapType(TimestampType, StringType))) .show ``` ``` Exception in thread "main" java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297) at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297) ``` This PR proposes to improve the error message. ``` org.apache.spark.sql.AnalysisException: cannot resolve 'entries' due to data type mismatch: Input schema map<timestamp,string> can only contain StringType as a key type for a MapType.; 'Project [unresolvedalias(from_json(MapType(TimestampType,StringType,true), value#1, Some(America/Los_Angeles)), Some(org.apache.spark.sql.Column$$Lambda$1496/54693608710e5bf9c))] +- LocalRelation [value#1] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:197) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:182) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) ... ``` In https://github.com/apache/spark/pull/32599 we decide to improve the error message instead of support this. ### Why are the changes needed? Avoid confusion in the interpretation of the error ### Does this PR introduce _any_ user-facing change? Yes, the error message returned in this case ### How was this patch tested? Unit testing and manual testing Closes #33525 from planga82/feature/spark35320_improve_error_message. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:46:06 +08:00
Terry Kim	809b88a162	[SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33200 from imback82/alter_add_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:00:29 +08:00
allisonwang-db	23a6ffa5dc	[SPARK-36275][SQL] ResolveAggregateFunctions should works with nested fields ### What changes were proposed in this pull request? This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions. ### Why are the changes needed? To fix an analyzer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33498 from allisonwang-db/spark-36275-resolve-agg-func. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:35:17 +08:00
Huaxin Gao	c8dd97d456	[SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up ### What changes were proposed in this pull request? update java doc, JDBC data source doc, address follow up comments ### Why are the changes needed? update doc and address follow up comments ### Does this PR introduce _any_ user-facing change? Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc. ### How was this patch tested? manually checked Closes #33526 from huaxingao/aggPD_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 12:52:42 +08:00
Linhong Liu	8e7e14dc0d	[SPARK-36241][SQL] Support creating tables with null column ### What changes were proposed in this pull request? Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833 In this PR, I propose the restore the previous behavior to support the null column in a table. ### Why are the changes needed? For a complex query, it's possible to generate a column with null type. If this happens to the input query of CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective, it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing this constraint is more friendly to users. ### Does this PR introduce _any_ user-facing change? Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR ```sql CREATE TABLE t (col_1 void, col_2 int) ``` ### How was this patch tested? newly added and existing test cases Closes #33488 from linhongliu-db/SPARK-36241-support-void-column. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-27 17:31:52 +08:00
Wenchen Fan	068f8d434a	[SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command ### What changes were proposed in this pull request? We added the char/varchar support in 3.1, but the string length check is only applied to INSERT, not UPDATE/MERGE. This PR fixes it. This PR also adds the missing type coercion for UPDATE/MERGE. ### Why are the changes needed? complete the char/varchar support and make UPDATE/MERGE easier to use by doing type coercion. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new UT. No built-in source support UPDATE/MERGE so end-to-end test is not applicable here. Closes #33468 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-27 13:57:05 +08:00
Huaxin Gao	c561ee6865	[SPARK-34952][SQL] DSv2 Aggregate push down APIs ### What changes were proposed in this pull request? Add interfaces and APIs to push down Aggregates to V2 Data Source ### Why are the changes needed? improve performance ### Does this PR introduce _any_ user-facing change? SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source. ### How was this patch tested? New tests were added to test aggregates push down in https://github.com/apache/spark/pull/32049. The original PR is split into two PRs. This PR doesn't contain new tests. Closes #33352 from huaxingao/aggPushDownInterface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 16:01:22 +08:00
Kousuke Saruta	07fa38e2c1	[SPARK-35815][SQL] Allow delayThreshold for watermark to be represented as ANSI interval literals ### What changes were proposed in this pull request? This PR extends the way to represent `delayThreshold` with ANSI interval literals for watermark. ### Why are the changes needed? A `delayThreshold` is semantically an interval value so it's should be represented as ANSI interval literals as well as the conventional `1 second` form. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33456 from sarutak/delayThreshold-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-22 17:36:22 +03:00
Gengliang Wang	ae9f6126fb	[SPARK-36257][SQL] Updated the version of TimestampNTZ related changes as 3.3.0 ### What changes were proposed in this pull request? As we decided to release TimestampNTZ type in Spark 3.3, we should update the versions of TimestampNTZ related changes as 3.3.0. ### Why are the changes needed? Correct the versions in documentation/code comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33478 from gengliangwang/updateVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-22 21:01:29 +08:00
gengjiaan	900b72a9cd	[SPARK-35088][SQL][FOLLOWUP] Add test case for TimestampNTZ sequence with default step ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/33360 and add test case for `TimestampNTZ` sequence with default step. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. Just add test cases. ### How was this patch tested? New tests. Closes #33462 from beliefer/SPARK-36090-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-22 17:53:22 +08:00
allisonwang-db	de8e4be92c	[SPARK-36063][SQL] Optimize OneRowRelation subqueries ### What changes were proposed in this pull request? This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: `spark.sql.optimizer.optimizeOneRowRelationSubquery` (default: True). For example: ```sql select (select c1) from t ``` Analyzed plan: ``` Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22] : +- Project [outer(c1#18)] : +- OneRowRelation +- LocalRelation [c1#18, c2#19] ``` Optimized plan before this PR: ``` Project [c1#18#25 AS scalarsubquery(c1)#22] +- Join LeftOuter, (c1#24 <=> c1#18) :- LocalRelation [c1#18] +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24] +- LocalRelation [c1#18] ``` Optimized plan after this PR: ``` LocalRelation [scalarsubquery(c1)#22] ``` ### Why are the changes needed? To optimize query plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit tests. Closes #33284 from allisonwang-db/spark-36063-optimize-subquery-one-row-relation. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 10:48:32 +08:00
Rahul Mahadev	efcce23b91	[SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState ### What changes were proposed in this pull request? Adding support for accepting an initial state with flatMapGroupsWithState in batch mode. ### Why are the changes needed? SPARK-35897 added support for accepting an initial state for streaming queries using flatMapGroupsWithState. the code flow is separate for batch and streaming and required a different PR. ### Does this PR introduce _any_ user-facing change? Yes as discussed above flatMapGroupsWithState in batch mode can accept an initialState, previously this would throw an UnsupportedOperationException ### How was this patch tested? Added relevant unit tests in FlatMapGroupsWithStateSuite and modified the tests `JavaDatasetSuite` Closes #33336 from rahulsmahadev/flatMapGroupsWithStateBatch. Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2021-07-21 01:48:58 -04:00
Liang-Chi Hsieh	2653201b0a	[SPARK-36030][SQL] Support DS v2 metrics at writing path ### What changes were proposed in this pull request? We add the interface for DS v2 metrics in SPARK-34366. It is only added for reading path, though. This patch extends the metrics interface to writing path. ### Why are the changes needed? Complete DS v2 metrics interface support in writing path. ### Does this PR introduce _any_ user-facing change? No. For developer, yes, as this adds metrics support at DS v2 writing path. ### How was this patch tested? Added test. Closes #33239 from viirya/v2-write-metrics. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 20:20:35 -07:00
gengjiaan	c0d84e6cf1	[SPARK-36222][SQL] Step by days in the Sequence expression for dates ### What changes were proposed in this pull request? The current implement of `Sequence` expression not support step by days for dates. ``` spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' day); Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch: sequence uses the wrong parameter type. The parameter type must conform to: 1. The start and stop expressions must resolve to the same type. 2. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' or 'interval year to month' or 'interval day to second' type, otherwise to the same type as the start and stop expressions. ; line 1 pos 7; 'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' DAY), Some(Europe/Moscow)), None)] +- OneRowRelation ``` ### Why are the changes needed? `DayTimeInterval` has day granularity should as step for dates. ### Does this PR introduce _any_ user-facing change? 'Yes'. Sequence expression will supports step by `DayTimeInterval` has day granularity for dates. ### How was this patch tested? New tests. Closes #33439 from beliefer/SPARK-36222. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-20 19:16:56 +03:00
Karen Feng	ddc61e62b9	[SPARK-36079][SQL] Null-based filter estimate should always be in the range [0, 1] ### What changes were proposed in this pull request? Forces the selectivity estimate for null-based filters to be in the range `[0,1]`. ### Why are the changes needed? I noticed in a few TPC-DS query tests that the column statistic null count can be higher than the table statistic row count. In the current implementation, the selectivity estimate for `IsNotNull` is negative. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33286 from karenfeng/bound-selectivity-est. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 21:32:13 +08:00
gengjiaan	033a5731b4	[SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/33299 and implement `prettyName` for `MakeTimestampNTZ` and `MakeTimestampLTZ` based on the discussion show below https://github.com/apache/spark/pull/33299/files#r668423810 ### Why are the changes needed? This PR fix the incorrect alias usecase. ### Does this PR introduce _any_ user-facing change? 'No'. Modifications are transparent to users. ### How was this patch tested? Jenkins test. Closes #33430 from beliefer/SPARK-36046-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-20 21:31:00 +08:00
Yuming Wang	af978c87f1	[SPARK-36183][SQL] Push down limit 1 through Aggregate if it is group only ### What changes were proposed in this pull request? Push down limit 1 and turn `Aggregate` into `Project` through `Aggregate` if it is group only. For example: ```sql create table t1 using parquet as select id from range(100000000L); create table t2 using parquet as select id from range(100000000L); create view v1 as select * from t1 union select * from t2; select * from v1 limit 1; ``` Before this PR \| After this PR -- \| -- ![image](https://user-images.githubusercontent.com/5399861/125975690-55663515-c4c5-4a04-aedf-f8ba37581ba7.png) \| ![image](https://user-images.githubusercontent.com/5399861/126168972-b2675e09-4f93-4026-b1be-af317205e57f.png) ### Why are the changes needed? Improve query performance. This is a real case from the cluster: ![image](https://user-images.githubusercontent.com/5399861/125976597-18cb68d6-b22a-4d80-b270-01b2b13d1ef5.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33397 from wangyum/SPARK-36183. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-07-20 20:24:07 +08:00
gengjiaan	7aa01798c5	[SPARK-36091][SQL] Support TimestampNTZ type in expression TimeWindow ### What changes were proposed in this pull request? The current implement of `TimeWindow` only supports `TimestampType`. Spark added a new type `TimestampNTZType`, so we should support `TimestampNTZType` in expression `TimeWindow`. ### Why are the changes needed? `TimestampNTZType` similar to `TimestampType`, we should support `TimestampNTZType` in expression `TimeWindow`. ### Does this PR introduce _any_ user-facing change? 'Yes'. `TimeWindow` will accepts `TimestampNTZType`. ### How was this patch tested? New tests. Closes #33341 from beliefer/SPARK-36091. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-19 19:23:39 +08:00
Angerszhuuuu	313f3c5460	[SPARK-36093][SQL] RemoveRedundantAliases should not change Command's parameter's expression's name ### What changes were proposed in this pull request? RemoveRedundantAliases may change DataWritingCommand's parameter's attribute name. In the UT's case before RemoveRedundantAliases the partitionColumns is `CAL_DT`, and change by RemoveRedundantAliases and change to `cal_dt` then case the error case ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? For below SQL case ``` sql("create table t1(cal_dt date) using parquet") sql("insert into t1 values (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')") sql("create view t1_v as select * from t1") sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'") sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'") ``` Before this pr ``` sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show +----+------+ \|FLAG\|CAL_DT\| +----+------+ +----+------+ sql("SELECT * FROM t2 ").show +----+----------+ \|FLAG\| CAL_DT\| +----+----------+ \| 1\|2021-06-27\| \| 1\|2021-06-28\| +----+----------+ ``` After this pr ``` sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show +----+------+ \|FLAG\|CAL_DT\| +----+------+ \| 2\|2021-06-29\| \| 2\|2021-06-30\| +----+------+ sql("SELECT * FROM t2 ").show +----+----------+ \|FLAG\| CAL_DT\| +----+----------+ \| 1\|2021-06-27\| \| 1\|2021-06-28\| \| 2\|2021-06-29\| \| 2\|2021-06-30\| +----+----------+ ``` ### How was this patch tested? Added UT Closes #33324 from AngersZhuuuu/SPARK-36093. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-19 16:22:31 +08:00
Bessenyei Balázs Donát	92d4563124	[MINOR][SQL] Fix typo for config hint in SQLConf.scala ### What changes were proposed in this pull request? This PR fixes typo for `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` in `SQLConf.scala`. ### Why are the changes needed? This is a [Broken windows theory](https://en.wikipedia.org/wiki/Broken_windows_theory) change. ### Does this PR introduce _any_ user-facing change? Yes, after merging this PR, the error message for commands such as ```python spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation", "true") ``` , users will get a typo-free exception. ### How was this patch tested? This is a trivial change. Closes #33389 from bessbd/patch-1. Authored-by: Bessenyei Balázs Donát <9086834+bessbd@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-18 15:33:26 -05:00
gengjiaan	42275bb20d	[SPARK-36090][SQL] Support TimestampNTZType in expression Sequence ### What changes were proposed in this pull request? The current implement of `Sequence` accept `TimestampType`, `DateType` and `IntegralType`. This PR will let `Sequence` accepts `TimestampNTZType`. ### Why are the changes needed? We can generate sequence for timestamp without time zone. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR will let `Sequence` accepts `TimestampNTZType`. ### How was this patch tested? New tests. Closes #33360 from beliefer/SPARK-36090. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-18 20:46:23 +03:00
Kousuke Saruta	71ea25d4f5	[SPARK-36170][SQL] Change quoted interval literal (interval constructor) to be converted to ANSI interval types ### What changes were proposed in this pull request? This PR changes the behavior of the quoted interval literals like `SELECT INTERVAL '1 year 2 month'` to be converted to ANSI interval types. ### Why are the changes needed? The tnit-to-unit interval literals and the unit list interval literals are converted to ANSI interval types but quoted interval literals are still converted to CalendarIntervalType. ``` -- Unit list interval literals spark-sql> select interval 1 year 2 month; 1-2 -- Quoted interval literals spark-sql> select interval '1 year 2 month'; 1 years 2 months ``` ### Does this PR introduce _any_ user-facing change? Yes but the following sentence in `sql-migration-guide.md` seems to cover this change. ``` - In Spark 3.2, the unit list interval literals can not mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND). For example, `INTERVAL 1 day 1 hour` is invalid in Spark 3.2. In Spark 3.1 and earlier, there is no such limitation and the literal returns value of `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. ``` ### How was this patch tested? Modified existing tests and add new tests. Closes #33380 from sarutak/fix-interval-constructor. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-17 12:23:37 +03:00
Chao Sun	37dc3f9ea7	[SPARK-36128][SQL] Apply spark.sql.hive.metastorePartitionPruning for non-Hive tables that uses Hive metastore for partition management ### What changes were proposed in this pull request? In `CatalogFileIndex.filterPartitions`, check the config `spark.sql.hive.metastorePartitionPruning` and don't pushdown predicates to remote HMS if it is false. Instead, fallback to the `listPartitions` API and do the filtering on the client side. ### Why are the changes needed? Currently the config `spark.sql.hive.metastorePartitionPruning` is only effective for Hive tables, and for non-Hive tables we'd always use the `listPartitionsByFilter` API from HMS client. On the other hand, by default all data source tables also manage their partitions through HMS, when the config `spark.sql.hive.manageFilesourcePartitions` is turned on. Therefore, it seems reasonable to extend the above config for non-Hive tables as well. In certain cases the remote HMS service could throw exceptions when using the `listPartitionsByFilter` API, which, on the Spark side, is unrecoverable at the current state. Therefore it would be better to allow users to disable the API by using the above config. For instance, HMS only allow pushdown date column when direct SQL is used instead of JDO for interacting with the underlying RDBMS, and will throw exception otherwise. Even though the Spark Hive client will attempt to recover itself when the exception happens, it only does so when the config `hive.metastore.try.direct.sql` from remote HMS is `false`. There could be cases where the value of `hive.metastore.try.direct.sql` is true but remote HMS still throws exception. ### Does this PR introduce _any_ user-facing change? Yes now the config `spark.sql.hive.metastorePartitionPruning` is extended for non-Hive tables which use HMS to manage their partition metadata. ### How was this patch tested? Added a new unit test: ``` build/sbt "hive/testOnly *PruneFileSourcePartitionsSuite -- -z SPARK-36128" ``` Closes #33348 from sunchao/SPARK-36128-by-filter. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-16 13:32:25 -07:00
Jungtaek Lim	f2bf8b051b	[SPARK-34893][SS] Support session window natively Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR). The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped). This PR leverages two different approaches on merging session windows: 1. merging session windows with Spark's aggregation logic (a variant of sort aggregation) 2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases. This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge. The state format is versioned, so that we can bring a new state format if we find a better one. ### Why are the changes needed? For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks: 1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows 2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves 3. (flat)MapGroupsWithState is only available in Scala/Java. With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations. Quoting the query example from test suite: ``` val inputData = MemoryStream[(String, Long)] // Split the lines into words, treat words as sessionId of events val events = inputData.toDF() .select($"_1".as("value"), $"_2".as("timestamp")) .withColumn("eventTime", $"timestamp".cast("timestamp")) .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime") .withWatermark("eventTime", "30 seconds") val sessionUpdates = events .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId) .agg(count("*").as("numEvents")) .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)", "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs", "numEvents") ``` which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes). `39542bb81f/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (L66-L105)` (Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.) ### Does this PR introduce _any_ user-facing change? Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window". ### How was this patch tested? New test suites. Also tested with benchmark code. Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-07-16 20:38:16 +09:00
Hyukjin Kwon	fba61ad68b	[SPARK-36169][SQL] Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted) ### What changes were proposed in this pull request? This PR proposes to move `spark.sql.sources.disabledJdbcConnProviderList` from SQLConf to StaticSQLConf which disallows to set in runtime. ### Why are the changes needed? It's documented as a static configuration. we should make it as a static configuration properly. ### Does this PR introduce _any_ user-facing change? Previously, the configuration can be set to different value but not effective. Now it throws an exception if users try to set in runtime. ### How was this patch tested? Existing unittest was fixed. That should verify the change. Closes #33381 from HyukjinKwon/SPARK-36169. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-16 11:43:22 +09:00
Gengliang Wang	96c2919988	[SPARK-36135][SQL] Support TimestampNTZ type in file partitioning ### What changes were proposed in this pull request? Support TimestampNTZ type in file partitioning * When there is no provided schema and the default Timestamp type is TimestampNTZ , Spark should infer and parse the timestamp value partitions as TimestampNTZ. * When the provided Partition schema is TimestampNTZ, Spark should be able to parse the TimestampNTZ type partition column. ### Why are the changes needed? File partitioning is an important feature and Spark should support TimestampNTZ type in it. ### Does this PR introduce _any_ user-facing change? Yes, Spark supports TimestampNTZ type in file partitioning ### How was this patch tested? Unit tests Closes #33344 from gengliangwang/partition. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-16 01:13:32 +08:00
Jungtaek Lim	1ceb753ef5	[SPARK-36157][SQL][SS] TimeWindow expression: apply filter before project ### What changes were proposed in this pull request? This PR proposes to change the application of the operators for TimeWindow, from project -> filter, to filter -> project. Currently Spark applies project, and filter, while filter is not dependent on project. That said, if the input rows are going to be filtered out via filter predicate, applying projection on these input rows are simply waste of time. ### Why are the changes needed? This is a simple improvement requiring changes from a couple of lines. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33367 from HeartSaVioR/SPARK-36157. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-15 09:47:25 -07:00
Linhong Liu	4dfd266b27	[SPARK-36148][SQL] Fix input data types check for regexp_replace ### What changes were proposed in this pull request? `RegExpReplace` overrides `checkInputDataTypes` but doesn't do the basic type check. This PR adds the type check so that the error message is more readable. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? newly added test case Closes #33357 from linhongliu-db/SPARK-36148-regexp-replace-check. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 12:23:28 +03:00
Gengliang Wang	564d3de7c6	[SPARK-36037][TESTS][FOLLOWUP] Avoid wrong test results on daylight saving time ### What changes were proposed in this pull request? Only use the zone ids that has no daylight saving for testing `localtimestamp` ### Why are the changes needed? https://github.com/apache/spark/pull/33346#discussion_r670135296 MaxGekk suggests that we should avoid wrong results if possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33354 from gengliangwang/FIxDST. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 11:40:51 +03:00
Gengliang Wang	0973397721	[SPARK-36037][SQL][FOLLOWUP] Fix flaky test for datetime function localtimestamp ### What changes were proposed in this pull request? The threshold of the test case "datetime function localtimestamp" is small, which leads to flaky test results https://github.com/gengliangwang/spark/runs/3067396143?check_suite_focus=true This PR is to increase the threshold for checking two the different current local datetimes from 5ms to 1 second. (The test case of current_timestamp uses 5 seconds) ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33346 from gengliangwang/fixFlaky. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-15 11:32:18 +08:00
Karen Feng	e92b8ea6f8	[SPARK-36106][SQL][CORE] Label error classes for subset of QueryCompilationErrors ### What changes were proposed in this pull request? Adds error classes to some of the exceptions in QueryCompilationErrors. ### Why are the changes needed? Improves auditing for developers and adds useful fields for users (error class and SQLSTATE). ### Does this PR introduce _any_ user-facing change? Yes, fills in missing error class and SQLSTATE fields. ### How was this patch tested? Existing tests and new unit tests. Closes #33309 from karenfeng/group-compilation-errors-1. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-15 11:43:18 +09:00
Geek	1e86345ae3	[SPARK-36069][SQL] Add field info to `from_json`'s exception in the FAILFAST mode ### What changes were proposed in this pull request? spark function from_json output field name, field type and field value when FAILFAST mode throw exception. ### Why are the changes needed? This infoormation is very important for devlops to find where error input data is located. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? org/apache/spark/sql/JsonFunctionsSuite.scala:598 test("[SPARK-36069] from_json invalid json schema - check field name and field value") Closes #33297 from geekyouth/feature/FAILFAST_output_fidelaName_fieldValue_dataType. Lead-authored-by: Geek <forsupergeeker@gmail.com> Co-authored-by: 极客青年 <forsupergeeker@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-14 21:28:15 +03:00
Linhong Liu	b86645776b	[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range ### What changes were proposed in this pull request? DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger. We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates. Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals. ### Why are the changes needed? make spark more usable and bug fix ### Does this PR introduce _any_ user-facing change? Yes, after this PR, below SQL will have different results ```sql select cast('-10000-1-2' as date) as date_col -- before PR: NULL -- after PR: -10000-1-2 ``` ```sql select cast('2021-4294967297-11' as date) as date_col -- before PR: 2021-01-11 -- after PR: NULL ``` ### How was this patch tested? newly added test cases Closes #32959 from linhongliu-db/SPARK-35780-full-range-datetime. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 18:11:39 +08:00
Fu Chen	103d16e868	[SPARK-36130][SQL] UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal ### What changes were proposed in this pull request? Fix [comment](https://github.com/apache/spark/pull/32488#issuecomment-879315179) This PR fix rule `UnwrapCastInBinaryComparison` bug. Rule UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal. - In Before this pr, the following example will throw an exception. ```scala withTable("tbl") { sql("CREATE TABLE tbl (d decimal(33, 27)) USING PARQUET") sql("SELECT d FROM tbl WHERE d NOT IN (d + 1)") } ``` - InSet As the analyzer guarantee that all the elements in the `inSet.hset` are literal, so this is not an issue for `InSet`. `fbf53dee37/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L264-L279)` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? New test. Closes #33335 from cfmcgrady/SPARK-36130. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 15:57:10 +08:00
gengjiaan	b4f7758944	[SPARK-36037][SQL] Support ANSI SQL LOCALTIMESTAMP datetime value function ### What changes were proposed in this pull request? `LOCALTIMESTAMP()` is a datetime value function from ANSI SQL. The syntax show below: ``` <datetime value function> ::= <current date value function> \| <current time value function> \| <current timestamp value function> \| <current local time value function> \| <current local timestamp value function> <current date value function> ::= CURRENT_DATE <current time value function> ::= CURRENT_TIME [ <left paren> <time precision> <right paren> ] <current local time value function> ::= LOCALTIME [ <left paren> <time precision> <right paren> ] <current timestamp value function> ::= CURRENT_TIMESTAMP [ <left paren> <timestamp precision> <right paren> ] <current local timestamp value function> ::= LOCALTIMESTAMP [ <left paren> <timestamp precision> <right paren> ] ``` `LOCALTIMESTAMP()` returns the current timestamp at the start of query evaluation as TIMESTAMP WITH OUT TIME ZONE. This is similar to `CURRENT_TIMESTAMP()`. Note we need to update the optimization rule `ComputeCurrentTime` so that Spark returns the same result in a single query if the function is called multiple times. ### Why are the changes needed? `CURRENT_TIMESTAMP()` returns the current timestamp at the start of query evaluation. `LOCALTIMESTAMP()` returns the current timestamp without time zone at the start of query evaluation. The `LOCALTIMESTAMP` function is an ANSI SQL. The `LOCALTIMESTAMP` function is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Support new function `LOCALTIMESTAMP()`. ### How was this patch tested? New tests. Closes #33258 from beliefer/SPARK-36037. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 15:38:46 +08:00
Wenchen Fan	583173b7cc	[SPARK-36033][SQL][TEST] Validate partitioning requirements in TPCDS tests ### What changes were proposed in this pull request? Make sure all physical plans of TPCDS queries are valid (satisfy the partitioning requirement). ### Why are the changes needed? improve test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33248 from cloud-fan/aqe2. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 21:17:13 +08:00

1 2 3 4 5 ...

5629 commits