ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gengjiaan	cef28c2c51	[SPARK-32670][SQL][FOLLOWUP] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/29497. Because https://github.com/apache/spark/pull/29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors. This PR group other `AnalysisExcpetion` into QueryCompilationErrors. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30564 from beliefer/SPARK-32670-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 08:38:24 +00:00
Terry Kim	b112e2bfa6	[SPARK-33714][SQL] Migrate ALTER VIEW ... SET/UNSET TBLPROPERTIES commands to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR adds `allowTemp` flag to `UnresolvedView` so that `Analyzer` can check whether to resolve temp views or not. This PR also migrates `ALTER VIEW ... SET/UNSET TBLPROPERTIES` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. One benefit is that the exception message is better for `ALTER VIEW ... SET/UNSET TBLPROPERTIES`. Before, if a temp view is passed, you will just get `NoSuchTableException` with `Table or view 'tmpView' not found in database 'default'`. But with this PR, you will get more description exception message: `tmpView is a temp view. ALTER VIEW ... SET TBLPROPERTIES expects a permanent view`. ### Does this PR introduce _any_ user-facing change? The exception message changes as describe above. ### How was this patch tested? Updated existing tests. Closes #30676 from imback82/alter_view_set_unset_properties. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 05:18:34 +00:00
Max Gekk	af37c7f411	[SPARK-33558][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. ADD PARTITION` parsing tests to `AlterTableAddPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. ADD PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableAddPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. ADD PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes #30685 from MaxGekk/unify-alter-table-add-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-10 04:54:52 +00:00
Anton Okolnychyi	fa9ce1d4e8	[SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize delete conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `DeleteFromTable`. Closes #30688 from aokolnychyi/spark-33722. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-09 11:42:54 -08:00
HyukjinKwon	b5399d4ef1	[SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 20:26:18 +09:00
Terry Kim	29fed23ba1	[SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `MSCK REPAIR TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `MSCK REPAIR TABLE` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("MSCK REPAIR TABLE t") // works fine ``` , but after this PR: ``` sql("MSCK REPAIR TABLE t") org.apache.spark.sql.AnalysisException: t is a temp view. 'MSCK REPAIR TABLE' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `MSCK REPAIR TABLE t` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30664 from imback82/repair_table_V2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-09 05:06:37 +00:00
Wenchen Fan	6fd234503c	[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ ### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 11:41:35 -08:00
Terry Kim	c05ee06f5b	[SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier. This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30636 from imback82/drop_view_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 14:07:58 +00:00
Max Gekk	2b30dde249	[SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution framework ### What changes were proposed in this pull request? 1. Remove old statement `ShowTableStatement` 2. Introduce new command `ShowTableExtended` for `SHOW TABLE EXTENDED`. This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. ### Does this PR introduce _any_ user-facing change? The changes should not affect V1 tables. For V2, Spark outputs the error: ``` SHOW TABLE EXTENDED is not supported for v2 tables. ``` ### How was this patch tested? By running `SHOW TABLE EXTENDED` tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #30645 from MaxGekk/show-table-extended-statement. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 12:08:22 +00:00
luluorta	99613cd581	[SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30625 from luluorta/SPARK-33677. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-08 20:45:25 +09:00
Dongjoon Hyun	031c5ef280	[SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default ### What changes were proposed in this pull request? This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark 3.2.0. ### Why are the changes needed? By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously. ### Does this PR introduce _any_ user-facing change? Yes, but this is an improvement and it's supposed to have no bugs. ### How was this patch tested? Pass the CIs. Closes #30628 from dongjoon-hyun/SPARK-33679. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 23:10:35 -08:00
Terry Kim	5aefc49b0f	[SPARK-33664][SQL] Migrate ALTER TABLE ... RENAME TO to use UnresolvedTableOrView to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER [TABLE\|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30610 from imback82/rename_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 03:54:16 +00:00
Wenchen Fan	c0874ba9f1	[SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data. ### Why are the changes needed? This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #30653 from cloud-fan/minor2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 13:35:37 -08:00
Wenchen Fan	6aff215077	[SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 10:50:31 -08:00
Josh Soref	c62b84a043	[MINOR] Spelling sql not core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/catalyst` * `sql/hive-thriftserver` * `sql/hive` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30532 from jsoref/spelling-sql-not-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-07 08:40:29 -06:00
Kent Yao	da72b87374	[SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results ### What changes were proposed in this pull request? In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later ### Why are the changes needed? 1. udf ```scala scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2)) scala> spark.sql("select abcd()").show scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212) at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606) ... 47 elided ``` 2. spark.createDataframe ``` scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show +--------------------+ \| c\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| \|rich set of highe...\| \|MLlib for machine...\| \|and Structured St...\| \| \| \|<https://spark.ap...\| \| \| \|[![Jenkins Build]...\| \|[![AppVeyor Build...\| \|[![PySpark Covera...\| \| \| \| \| ``` 3. reader.schema ``` scala> spark.read.schema("a varchar(2)").text("./README.md").show(100) +--------------------+ \| a\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| ``` 4. etc ### Does this PR introduce _any_ user-facing change? NO, we intend to avoid protentical breaking change ### How was this patch tested? new tests Closes #30586 from yaooqinn/SPARK-33641. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 13:40:15 +00:00
Yuming Wang	1e0c006748	[SPARK-33617][SQL] Add default parallelism configuration for Spark SQL queries ### What changes were proposed in this pull request? This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`. ### Why are the changes needed? Avoid generating small files for INSERT INTO TABLE from VALUES, for example: ```sql CREATE TABLE t1(id int) USING parquet; INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8); ``` Before this pr: ``` -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:54 _SUCCESS ``` After this pr and set `spark.sql.files.minPartitionNum` to 1: ``` -rw-r--r-- 1 root root 452 Dec 1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:59 _SUCCESS ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30559 from wangyum/SPARK-33617. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 21:36:52 +09:00
Max Gekk	26c0493318	[SPARK-33676][SQL] Require exact matching of partition spec to the schema in V2 `ALTER TABLE .. ADD/DROP PARTITION` ### What changes were proposed in this pull request? Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs). ### Why are the changes needed? 1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error: ```sql spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b); spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9'); Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`'; ``` 2. To prevent future errors caused by not fully specified partition specs. ### Does this PR introduce _any_ user-facing change? Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands. ### How was this patch tested? By running the test suite with new UT: ``` $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" ``` Closes #30624 from MaxGekk/add-partition-full-spec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 08:14:36 +00:00
Chao Sun	e857e06452	[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache ### What changes were proposed in this pull request? This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`. ### Why are the changes needed? Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now delete from table in v2 also refreshes cache. ### How was this patch tested? Added a test case. Closes #30597 from sunchao/SPARK-33652. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 01:14:22 -08:00
Wenchen Fan	1b4e35d1a8	[SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data source tables ### What changes were proposed in this pull request? This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it. ### Why are the changes needed? It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all. ### Does this PR introduce _any_ user-facing change? Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed. ### How was this patch tested? new tests Closes #30595 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 16:48:31 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
Wenchen Fan	acc211d2cf	[SPARK-33141][SQL][FOLLOW-UP] Store the max nested view depth in AnalysisContext ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30289. It removes the hack in `View.effectiveSQLConf`, by putting the max nested view depth in `AnalysisContext`. Then we don't get the max nested view depth from the active SQLConf, which keeps changing during nested view resolution. ### Why are the changes needed? remove hacks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? If I just remove the hack, `SimpleSQLViewSuite.restrict the nested level of a view` fails. With this fix, it passes again. Closes #30575 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-04 14:01:15 +00:00
Jungtaek Lim (HeartSaVioR)	233a8494c8	[SPARK-27237][SS] Introduce State schema validation among query restart ## What changes were proposed in this pull request? Please refer the description of [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) to see rationalization of this patch. This patch proposes to introduce state schema validation, via storing key schema and value schema to `schema` file (for the first time) and verify new key schema and value schema for state are compatible with existing one. To be clear for definition of "compatible", state schema is "compatible" when number of fields are same and data type for each field is same - Spark has been allowing rename of field. This patch will prevent query run which has incompatible state schema, which would reduce the chance to get indeterministic behavior (actually renaming of field is also the smell of semantically incompatible, but end users could just modify its name so we can't say) as well as providing more informative error message. ## How was this patch tested? Added UTs. Closes #24173 from HeartSaVioR/SPARK-27237. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 19:33:11 +09:00
Max Gekk	94c144bdd0	[SPARK-33571][SQL][DOCS] Add a ref to INT96 config from the doc for `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` ### What changes were proposed in this pull request? For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by: 1. Explicitly document on which parquet types, those configs influence on 2. Refer to corresponding configs for `INT96` ### Why are the changes needed? To avoid user confusions like reposted in SPARK-33571, and make the config description more precise. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30596 from MaxGekk/clarify-rebase-docs. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 16:26:07 +09:00
Gengliang Wang	e8380665c7	[SPARK-33658][SQL] Suggest using Datetime conversion functions for invalid ANSI casting ### What changes were proposed in this pull request? Suggest users using Datetime conversion functions in the error message of invalid ANSI explicit casting. ### Why are the changes needed? In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed. As of now, we have introduced new functions `UNIX_SECONDS`/`UNIX_MILLIS`/`UNIX_MICROS`/`UNIX_DATE`/`DATE_FROM_UNIX_DATE`, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, better error messages ### How was this patch tested? Unit test Closes #30603 from gengliangwang/improveErrorMsgOfExplicitCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 16:24:41 +09:00
Linhong Liu	e02324f2dd	[SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view ### What changes were proposed in this pull request? Currently, in spark, the temp view is saved as its analyzed logical plan, while the permanent view is kept in HMS with its origin SQL text. As a result, permanent and temporary views have different behaviors in some cases. In this PR we store the SQL text for temporary view in order to unify the behavior between permanent and temporary views. ### Why are the changes needed? to unify the behavior between permanent and temporary views ### Does this PR introduce _any_ user-facing change? Yes, with this PR, the temporary view will be re-analyzed when it's referred. So if the underlying datasource changed, the view will also be updated. ### How was this patch tested? existing and newly added test cases Closes #30567 from linhongliu-db/SPARK-33142. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-04 06:48:49 +00:00
Gengliang Wang	29e415deac	[SPARK-33649][SQL][DOC] Improve the doc of spark.sql.ansi.enabled ### What changes were proposed in this pull request? Improve the documentation of SQL configuration `spark.sql.ansi.enabled` ### Why are the changes needed? As there are more and more new features under the SQL configuration `spark.sql.ansi.enabled`, we should make it more clear about: 1. what exactly it is 2. where can users find all the features of the ANSI mode 3. whether all the features are exactly from the SQL standard ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change. Closes #30593 from gengliangwang/reviseAnsiDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-12-04 10:58:41 +08:00
Max Gekk	85949588b7	[SPARK-33650][SQL] Fix the error from ALTER TABLE .. ADD/DROP PARTITION for non-supported partition management table ### What changes were proposed in this pull request? In the PR, I propose to change the order of post-analysis checks for the `ALTER TABLE .. ADD/DROP PARTITION` command, and perform the general check (does the table support partition management at all) before specific checks. ### Why are the changes needed? The error message for the table which doesn't support partition management can mislead users: ```java PartitionSpecs are not resolved;; 'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false +- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable5d3ff859 ``` because it says nothing about the root cause of the issue. ### Does this PR introduce _any_ user-facing change? Yes. After the change, the error message will be: ``` Table ns1.ns2.tbl can not alter partitions ``` ### How was this patch tested? By running the affected test suite `AlterTablePartitionV2SQLSuite`. Closes #30594 from MaxGekk/check-order-AlterTablePartition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-03 16:43:15 -08:00
Anton Okolnychyi	aa13e207c9	[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete ### What changes were proposed in this pull request? This PR provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time. ### Why are the changes needed? The only way to support delete statements right now is to implement ``SupportsDelete``. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table). This PR actually provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters. Consider an example of a partitioned Hive table. If we have a delete predicate like `part_col = '2020'`, we can just drop the matching partition to satisfy this delete. In this case, the data source should return `true` from `canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like `id = 10`, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return `false` from `canDeleteWhere` and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis). If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return `false` from `canDeleteWhere`. ### Does this PR introduce _any_ user-facing change? Yes but it is backward compatible. ### How was this patch tested? This PR comes with a new test. Closes #30562 from aokolnychyi/spark-33623. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-03 09:12:30 -08:00
Wenchen Fan	0706e64c49	[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command ### What changes were proposed in this pull request? For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false. This is a retry after we unify the CREATE TABLE syntax. It partially reverts `d2bec5e265` This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables. ### Why are the changes needed? Changing from Hive text table to native Parquet table has many benefits: 1. be consistent with `DataFrameWriter.saveAsTable`. 2. better performance 3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result) 4. better interoperability as Parquet is a more popular open file format. ### Does this PR introduce _any_ user-facing change? No by default. If the config is set, the behavior change is described below: Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE \| SERDEPROPERTIES]` and `LOAD DATA`. char/varchar behavior has been taken care by https://github.com/apache/spark/pull/30412, and there is no behavior difference between data source and hive tables. One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough. Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables. ### How was this patch tested? Re-enable the tests Closes #30554 from cloud-fan/create-table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 15:24:44 +00:00
luluorta	512fb32b38	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case of codegen when casting float to Integer ### What changes were proposed in this pull request? This is a followup of [#27151](https://github.com/apache/spark/pull/27151). It fixes the same issue for the codegen path. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30585 from luluorta/SPARK-26218. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 14:58:56 +00:00
Gengliang Wang	ff13f574e6	[SPARK-20044][SQL] Add new function DATE_FROM_UNIX_DATE and UNIX_DATE ### What changes were proposed in this pull request? Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between Date type and Numeric types. ### Why are the changes needed? 1. Explicit conversion between Date type and Numeric types is disallowed in ANSI mode. We need to provide new functions for users to complete the conversion. 2. We have introduced new functions from Bigquery for conversion between Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense to add functions for conversion between Date type and Numeric types as well. ### Does this PR introduce _any_ user-facing change? Yes, two new datetime functions are added. ### How was this patch tested? Unit tests Closes #30588 from gengliangwang/dateToNumber. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-03 14:04:08 +00:00
Gengliang Wang	b76c6b759c	[SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers. ### Why are the changes needed? 1. Symmetry of the conversion functions 2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion. ### Does this PR introduce _any_ user-facing change? 3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type. ### How was this patch tested? Unit tests. Closes #30566 from gengliangwang/timestampLong. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-02 12:44:39 -08:00
yi.wu	a082f4600b	[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin ### What changes were proposed in this pull request? Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`. In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change. Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed. ### Why are the changes needed? For the query below, it returns the wrong result while it should throws ambiguous self join exception instead: ```scala val emp1 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop"), TestData(4, "IT")).toDS() val emp2 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop")).toDS() val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("")) emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer") .select(emp1.col(""), emp3.col("key").as("e2")).show() // wrong result +---+---------+---+ \|key\| value\| e2\| +---+---------+---+ \| 1\| sales\| 1\| \| 2\|personnel\| 2\| \| 3\| develop\| 3\| \| 4\| IT\| 4\| +---+---------+---+ ``` This PR fixes the wrong behaviour. ### Does this PR introduce _any_ user-facing change? Yes, users hit the exception instead of the wrong result after this PR. ### How was this patch tested? Added a new unit test. Closes #30488 from Ngone51/fix-self-join. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 17:51:22 +00:00
xuewei.linxuewei	58583f7c3f	[SPARK-33619][SQL] Fix GetMapValueUtil code generation error ### What changes were proposed in this pull request? Code Gen bug fix that introduced by SPARK-33460 ``` GetMapValueUtil s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" SHOULD BE s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" ``` And the reason why SPARK-33460 failed to detect this bug via UT, it was because that `checkExceptionInExpression ` did not work as expect like `checkEvaluation` which will try eval expression with BOTH `CODEGEN_ONLY` and `NO_CODEGEN` mode, and in this PR, will also fix this Test bug, too. ### Why are the changes needed? Bug Fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT and Existing UT. Closes #30560 from leanken/leanken-SPARK-33619. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 16:10:45 +00:00
HyukjinKwon	df8d3f1bf7	[SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify the documentation more ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30504. It proposes: - Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used. - Clarify, in the docs in the expressions, that it means they don't throw exceptions ### Why are the changes needed? `NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic. It's best to be explicit so `NoThrow` was proposed - I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow). For determinism, we already have a way to note it under `Expression.deterministic`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually ran the existing unittests written. Closes #30570 from HyukjinKwon/SPARK-33544. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 16:03:08 +00:00
Dongjoon Hyun	290aa02179	[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work ### What changes were proposed in this pull request? This reverts commit SPARK-33212 (`cb3fa6c936`) mostly with three exceptions: 1. `SparkSubmitUtils` was updated recently by SPARK-33580 2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency. 3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471. ### Why are the changes needed? According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following. 1. Spark distribution with `-Phadoop-cloud` ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[], app id = local-1606806088715). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("s3a://dongjoon/users.parquet").show 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties +------+--------------+----------------+ \| name\|favorite_color\|favorite_numbers\| +------+--------------+----------------+ \|Alyssa\| null\| [3, 9, 15, 20]\| \| Ben\| red\| []\| +------+--------------+----------------+ scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V ``` 2. Spark distribution without `-Phadoop-cloud`* ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0 ... java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 18:23:48 +09:00
Cheng Su	51ebcd95a5	[SPARK-32863][SS] Full outer stream-stream join ### What changes were proposed in this pull request? This PR is to add full outer stream-stream join, and the implementation of full outer join is: * For left side input row, check if there's a match on right side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store. * For right side input row, check if there's a match on left side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store. * State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join). ### Why are the changes needed? Enable more use cases for spark stream-stream join. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30395 from c21/stream-foj. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-02 10:17:00 +09:00
Thomas Graves	f71f34572d	[SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size of its children ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals). When this happens you know that the values aren't null and it has a size. It already handles the empty array. The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map. That makes it a literal and then makes it ultimately be optimized out. ### Why are the changes needed? remove unneeded filter ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Unit tests added and manually tested various cases Closes #30504 from tgravescs/SPARK-33544. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 09:50:02 +09:00
Anton Okolnychyi	c24f2b2d6a	[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer ### What changes were proposed in this pull request? This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources. ### Why are the changes needed? Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The change is trivial and SPARK-23889 will depend on it. Closes #30558 from aokolnychyi/spark-33612. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-01 09:27:46 -08:00
Anton Okolnychyi	478fb7f528	[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates ### What changes were proposed in this pull request? This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`. ### Why are the changes needed? Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The PR adds 3 new test cases. Closes #30555 from aokolnychyi/spark-33608. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 14:11:01 +00:00
Prakhar Jain	cf4ad212b1	[SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens ### What changes were proposed in this pull request? This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 . ### Why are the changes needed? sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-01 21:13:27 +09:00
gengjiaan	9273d4250d	[SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ANY` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_any`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30465 from beliefer/SPARK-33045-like_any-bak. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:48:30 +00:00
Gabor Somogyi	e5bb2937f6	[SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API ### What changes were proposed in this pull request? Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details. The PR contains the following changes: * Added `AdminClient` based offset fetching * GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * Additional unit tests * Code comment changes * Minor bugfixes here and there * Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone. * Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration` ### Why are the changes needed? Driver may hang forever. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Cluster test with simple Kafka topic to another topic query. Documentation: ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #29729 from gaborgsomogyi/SPARK-32032. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 20:34:00 +09:00
zky.zhoukeyong	1034815519	[SPARK-33572][SQL] Datetime building should fail if the year, month, ..., second combination is invalid ### What changes were proposed in this pull request? Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30516 from waitinfuture/SPARK-33498. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: waitinfuture <waitinfuture@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:07:16 +00:00
Max Gekk	6fd148fea8	[SPARK-33569][SQL] Remove getting partitions by an identifier prefix ### What changes were proposed in this pull request? 1. Remove the method `listPartitionIdentifiers()` from the `SupportsPartitionManagement` interface. The method lists partitions by ident prefix. 2. Rename `listPartitionByNames()` to `listPartitionIdentifiers()`. 3. Re-implement the default method `partitionExists()` using new method. ### Why are the changes needed? Getting partitions by ident prefix only is not used, and it can be removed to improve code maintenance. Also this makes the `SupportsPartitionManagement` interface cleaner. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly org.apache.spark.sql.connector.catalog.*" ``` Closes #30514 from MaxGekk/remove-listPartitionIdentifiers. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 14:05:49 +00:00
Max Gekk	0a612b6a40	[SPARK-33452][SQL] Support v2 SHOW PARTITIONS ### What changes were proposed in this pull request? 1. Remove V2 logical node `ShowPartitionsStatement `, and replace it by V2 `ShowPartitions`. 2. Implement V2 execution node `ShowPartitionsExec` similar to V1 `ShowPartitionsCommand`. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes. Before the change, `SHOW PARTITIONS` fails in V2 table catalogs with the exception: ``` org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is only supported with v1 tables. at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$parseV1Table(ResolveSessionCatalog.scala:628) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:466) ``` ### How was this patch tested? By running the following test suites: 1. Modified `ShowPartitionsParserSuite` where `ShowPartitionsStatement` is replaced by V2 `ShowPartitions`. 2. `v2.ShowPartitionsSuite` Closes #30398 from MaxGekk/show-partitions-exec-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 13:45:53 +00:00
Wenchen Fan	5cfbdddefe	[SPARK-33480][SQL] Support char/varchar type ### What changes were proposed in this pull request? This PR adds the char/varchar type which is kind of a variant of string type: 1. Char type is fixed-length string. When comparing char type values, we need to pad the shorter one to the longer length. 2. Varchar type is string with a length limitation. To implement the char/varchar semantic, this PR: 1. Do string length check when writing to char/varchar type columns. 2. Do string padding when reading char type columns. We don't do it at the writing side to save storage space. 3. Do string padding when comparing char type column with string literal or another char type column. (string literal is fixed length so should be treated as char type as well) To simplify the implementation, this PR doesn't propagate char/varchar type info through functions/operators(e.g. `substring`). That said, a column can only be char/varchar type if it's a table column, not a derived column like `SELECT substring(col)`. To be safe, this PR doesn't add char/varchar type to the query engine(expression input check, internal row framework, codegen framework, etc.). We will replace char/varchar type by string type with metadata (`Attribute.metadata` or `StructField.metadata`) that includes the original type string before it goes into the query engine. That said, the existing code will not see char/varchar type but only string type. char/varchar type may come from several places: 1. v1 table from hive catalog. 2. v2 table from v2 catalog. 3. user-specified schema in `spark.read.schema` and `spark.readStream.schema` 4. `Column.cast` 5. schema string in places like `from_json`, pandas UDF, etc. These places use SQL parser which replaces char/varchar with string already, even before this PR. This PR covers all the above cases, implements the length check and padding feature by looking at string type with special metadata. ### Why are the changes needed? char and varchar are standard SQL types. varchar is widely used in other databases instead of string type. ### Does this PR introduce _any_ user-facing change? For hive tables: now the table insertion fails if the value exceeds char/varchar length. Previously we truncate the value silently. For other tables: 1. now char type is allowed. 2. now we have length check when inserting to varchar columns. Previously we write the value as it is. ### How was this patch tested? new tests Closes #30412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 09:23:05 +00:00
gengjiaan	b665d58819	[SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream databases ### What changes were proposed in this pull request? Currently, Spark allows calls to `count` even for non parameterless aggregate function. For example, the following query actually works: `SELECT count() FROM tenk1;` On the other hand, mainstream databases will throw an error. Oracle `> ORA-00909: invalid number of arguments` PgSQL `ERROR: count() must be used to call a parameterless aggregate function` MySQL* `> 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')` ### Why are the changes needed? Fix a bug so that consistent with mainstream databases. There is an example query output with/without this fix. `SELECT count() FROM testData;` The output before this fix: `0` The output after this fix: ``` org.apache.spark.sql.AnalysisException cannot resolve 'count()' due to data type mismatch: count requires at least one argument.; line 1 pos 7 ``` ### Does this PR introduce _any_ user-facing change? Yes. If not specify parameter for `count`, will throw an error. ### How was this patch tested? Jenkins test. Closes #30541 from beliefer/SPARK-28646. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 17:04:38 +09:00
xuewei.linxuewei	225c2e2815	[SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using checkCastWithParseError ### What changes were proposed in this pull request? Dup code removed in SPARK-33498 as follow-up. ### Why are the changes needed? Nit. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30540 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 15:36:26 +09:00
Terry Kim	0fd9f57dd4	[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables ### What changes were proposed in this pull request? This PR proposes to support `CHACHE/UNCACHE TABLE` commands for v2 tables. In addtion, this PR proposes to migrate `CACHE/UNCACHE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To support `CACHE/UNCACHE TABLE` commands for v2 tables. Note that `CACHE/UNCACHE TABLE` for v1 tables/views go through `SparkSession.table` to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework. ### Does this PR introduce _any_ user-facing change? Yes. Now the user can run `CACHE/UNCACHE TABLE` commands on v2 tables. ### How was this patch tested? Added/updated existing tests. Closes #30403 from imback82/cache_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:37:10 +00:00
Kent Yao	2da72593c1	[SPARK-32976][SQL] Support column list in INSERT statement ### What changes were proposed in this pull request? #### JIRA expectations ``` INSERT currently does not support named column lists. INSERT INTO <table> (col1, col2,…) VALUES( 'val1', 'val2', … ) Note, we assume the column list contains all the column names. Issue an exception if the list is not complete. The column order could be different from the column order defined in the table definition. ``` #### implemetations In this PR, we add a column list as an optional part to the `INSERT OVERWRITE/INTO` statements: ``` /** * {{{ * INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? [identifierList] ... * INSERT INTO [TABLE] tableIdentifier [partitionSpec] [identifierList] ... * }}} / ``` The column list represents all expected columns with an explicit order that you want to insert to the target table. Particularly, we assume the column list contains all the column names in the current implementation, it will fail when the list is incomplete. In Analyzer*, we add a code path to resolve the column list in the `ResolveOutputRelation` rule before it is transformed to v1 or v2 command. It will fail here if the list has any field that not belongs to the target table. Then, for v2 command, e.g. `AppendData`, we use the resolved column list and output of the target table to resolve the output of the source query `ResolveOutputRelation` rule. If the list has duplicated columns, we fail. If the list is not empty but the list size does not match the target table, we fail. If no other exceptions occur, we use the column list to map the output of the source query to the output of the target table. The column list will be set to Nil and it will not hit the rule again after it is resolved. for v1 command, those all happen in the `PreprocessTableInsertion` rule ### Why are the changes needed? new feature support ### Does this PR introduce _any_ user-facing change? yes, insert into/overwrite table support specify column list ### How was this patch tested? new tests Closes #29893 from yaooqinn/SPARK-32976. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:23:23 +00:00
luluorta	35ded12fc6	[SPARK-33141][SQL] Capture SQL configs when creating permanent views ### What changes were proposed in this pull request? This PR makes CreateViewCommand/AlterViewAsCommand capturing runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. Users can set `spark.sql.legacy.useCurrentConfigsForView` to `true` to restore the behavior before. ### Why are the changes needed? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138) that proposes to unify temp view and permanent view behaviors. This PR makes permanent views mimicking the temp view behavior that "fixes" view semantic by directly storing resolved LogicalPlan. For example, if a user uses spark 2.4 to create a view that contains null values from division-by-zero expressions, she may not want that other users' queries which reference her view throw exceptions when running on spark 3.x with ansi mode on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #30289 from luluorta/SPARK-33141. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 13:32:25 +00:00
xuewei.linxuewei	b9f2f78de5	[SPARK-33498][SQL] Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid ### What changes were proposed in this pull request? Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30442 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 13:24:11 +00:00
gengjiaan	e43255051c	[SPARK-28645][SQL] ParseException is thrown when the window is redefined ### What changes were proposed in this pull request? Currently in Spark one could redefine a window. For instance: `select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w AS (ORDER BY unique1);` The window `w` is defined two times. In PgSQL, on the other hand, a thrown will happen: `ERROR: window "w" is already defined` ### Why are the changes needed? The current implement gives the following window definitions a higher priority. But it wasn't Spark's intention and users can't know from any document of Spark. This PR fixes the bug. ### Does this PR introduce _any_ user-facing change? Yes. There is an example query output with/without this fix. ``` SELECT employee_name, salary, first_value(employee_name) OVER w highest_salary, nth_value(employee_name, 2) OVER w second_highest_salary FROM basic_pays WINDOW w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING), w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 2 FOLLOWING) ORDER BY salary DESC ``` The output before this fix: ``` Larry Bott 11798 Larry Bott Gerard Bondur Gerard Bondur 11472 Larry Bott Gerard Bondur Pamela Castillo 11303 Larry Bott Gerard Bondur Barry Jones 10586 Larry Bott Gerard Bondur George Vanauf 10563 Larry Bott Gerard Bondur Loui Bondur 10449 Larry Bott Gerard Bondur Mary Patterson 9998 Larry Bott Gerard Bondur Steve Patterson 9441 Larry Bott Gerard Bondur Julie Firrelli 9181 Larry Bott Gerard Bondur Jeff Firrelli 8992 Larry Bott Gerard Bondur William Patterson 8870 Larry Bott Gerard Bondur Diane Murphy 8435 Larry Bott Gerard Bondur Leslie Jennings 8113 Larry Bott Gerard Bondur Gerard Hernandez 6949 Larry Bott Gerard Bondur Foon Yue Tseng 6660 Larry Bott Gerard Bondur Anthony Bow 6627 Larry Bott Gerard Bondur Leslie Thompson 5186 Larry Bott Gerard Bondur ``` The output after this fix: ``` struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException The definition of window 'w' is repetitive(line 8, pos 0) ``` ### How was this patch tested? Jenkins test. Closes #30512 from beliefer/SPARK-28645. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 10:27:08 +00:00
Terry Kim	2c41d9d8fa	[SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView ### What changes were proposed in this pull request? This PR proposes to improve the exception messages while `UnresolvedTableOrView` is handled based on this suggestion: https://github.com/apache/spark/pull/30321#discussion_r521127001. Currently, when an identifier is resolved to a temp view when a table/permanent view is expected, the following exception message is displayed (e.g., for `SHOW CREATE TABLE`): ``` t is a temp view not table or permanent view. ``` After this PR, the message will be: ``` t is a temp view. 'SHOW CREATE TABLE' expects a table or permanent view. ``` Also, if an identifier is not resolved, the following exception message is currently used: ``` Table or view not found: t ``` After this PR, the message will be: ``` Table or permanent view not found for 'SHOW CREATE TABLE': t ``` or ``` Table or view not found for 'ANALYZE TABLE ... FOR COLUMNS ...': t ``` ### Why are the changes needed? To improve the exception message. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed as described above. ### How was this patch tested? Updated existing tests. Closes #30475 from imback82/unresolved_table_or_view. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 10:16:56 +00:00
yangjie01	433ae9064f	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV ### What changes were proposed in this pull request? There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different. The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable. On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default. So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv. ### Why are the changes needed? Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new case similar to that described in SPARK-33566 Closes #30518 from LuciferYang/SPARK-33566. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 15:47:39 +09:00
Maryann Xue	dfa3978d91	[SPARK-33551][SQL] Do not use custom shuffle reader for repartition ### What changes were proposed in this pull request? This PR fixes an AQE issue where local shuffle reader, partition coalescing, or skew join optimization can be mistakenly applied to a shuffle introduced by repartition or a regular shuffle that logically replaces a repartition shuffle. The proposed solution checks for the presence of any repartition shuffle and filters out not applicable optimization rules for the final stage in an AQE plan. ### Why are the changes needed? Without the change, the output of a repartition query may not be correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #30494 from maryannxue/csr-repartition. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-11-25 19:32:22 -08:00
Dongjoon Hyun	7cf6a6f996	[SPARK-31257][SPARK-33561][SQL][FOLLOWUP] Fix Scala 2.13 compilation ### What changes were proposed in this pull request? This PR is a follow-up to fix Scala 2.13 compilation. ### Why are the changes needed? To support Scala 2.13 in Apache Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action Scala 2.13 compilation job. Closes #30502 from dongjoon-hyun/SPARK-31257. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 09:57:46 -08:00
Liang-Chi Hsieh	9643eab53e	[SPARK-33540][SQL] Subexpression elimination for interpreted predicate ### What changes were proposed in this pull request? This patch proposes to support subexpression elimination for interpreted predicate. ### Why are the changes needed? Similar to interpreted projection, there are use cases when codegen predicate is not able to work, e.g. too complex schema, non-codegen expression, etc. When there are frequently occurring expressions (subexpressions) among predicate expression, the performance is quite bad as we need to re-compute same expressions. We should be able to support subexpression elimination for interpreted predicate like interpreted projection. ### Does this PR introduce _any_ user-facing change? No, this doesn't change user behavior. ### How was this patch tested? Unit test and benchmark. Closes #30497 from viirya/SPARK-33540. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-25 08:55:39 -08:00
Gengliang Wang	d691d85701	[SPARK-33496][SQL] Improve error message of ANSI explicit cast ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/30260, there are some type conversions disallowed under ANSI mode. We should tell users what they can do if they have to use the disallowed casting. ### Why are the changes needed? Make it more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes, the error message is improved on casting failure when ANSI mode is enabled ### How was this patch tested? Unit tests. Closes #30440 from gengliangwang/improveAnsiCastErrorMSG. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-25 23:15:52 +08:00
Ryan Blue	6f68ccf532	[SPARK-31257][SPARK-33561][SQL] Unify create table syntax ### What changes were proposed in this pull request? * Unify the create table syntax in the parser by merging Hive and DataSource clauses * Add `SerdeInfo` and `external` boolean to statement plans and update AstBuilder to produce them * Add conversion from create statement plan to v1 create plans in ResolveSessionCatalog * Support new statement clauses in ResolveCatalogs conversion to v2 create plans * Remove SparkSqlParser rules for Hive syntax * Add "option." namespace to distinguish SERDEPROPERTIES and OPTIONS in table properties ### Why are the changes needed? * Current behavior is confusing. * A way to pass the Hive create options to DSv2 is needed for a Hive source. ### Does this PR introduce any user-facing change? Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses. ### How was this patch tested? Existing tests validate there are no behavior changes. Update unit tests for using a statement plan for Hive create syntax: * Move create tests from spark-sql DDLParserSuite into PlanResolutionSuite * Add parser tests to spark-catalyst DDLParserSuite Closes #28026 from rdblue/unify-create-table. Lead-authored-by: Ryan Blue <blue@apache.org> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 15:09:02 +00:00
Max Gekk	2c5cc36e3f	[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management ### What changes were proposed in this pull request? 1. Add new method `listPartitionByNames` to the `SupportsPartitionManagement` interface. It allows to list partitions by partition names and their values. 2. Implement new method in `InMemoryPartitionTable` which is used in DSv2 tests. ### Why are the changes needed? Currently, the `SupportsPartitionManagement` interface exposes only `listPartitionIdentifiers` which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them). For example, the table `tableA` is partitioned by two column `year` and `month` ``` CREATE TABLE tableA (price int, year int, month int) USING _ partitioned by (year, month) ``` and has the following partitions: ``` PARTITION(year = 2015, month = 1) PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) PARTITION(year = 2016, month = 3) ``` If we want to list all partitions with `month = 2`, we have to specify `year` for listPartitionIdentifiers() which not always possible as we don't know all `year` values in advance. New method listPartitionByNames() allows to specify partition values only for `month`, and get two partitions: ``` PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `SupportsPartitionManagementSuite`. Closes #30452 from MaxGekk/column-names-listPartitionIdentifiers. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 12:41:53 +00:00
Gengliang Wang	19f3b89d62	[SPARK-33549][SQL] Remove configuration spark.sql.legacy.allowCastNumericToTimestamp ### What changes were proposed in this pull request? Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp ### Why are the changes needed? In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true. After https://github.com/apache/spark/pull/30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior. As the configuration is not in any released yet, we should remove the configuration to make things simpler. ### Does this PR introduce _any_ user-facing change? No, since the configuration is not released yet. ### How was this patch tested? Existing test cases Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 08:59:31 +00:00
Terry Kim	b7f034d8dc	[SPARK-33543][SQL] Migrate SHOW COLUMNS command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW COLUMNS` is not yet supported for v2 tables. ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `ShowColumnsCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30490 from imback82/show_columns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-25 03:04:04 +00:00
Gabor Somogyi	95b6dabc33	[SPARK-33287][SS][UI] Expose state custom metrics information on SS UI ### What changes were proposed in this pull request? Structured Streaming UI is not containing state custom metrics information. In this PR I've added it. ### Why are the changes needed? Missing state custom metrics information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. ``` #Compile Spark echo "spark.sql.streaming.ui.enabledCustomMetricList stateOnCurrentVersionSizeBytes" >> conf/spark-defaults.conf sbin/start-master.sh sbin/start-worker.sh spark://gsomogyi-MBP16:7077 ./bin/spark-submit --master spark://gsomogyi-MBP16:7077 --deploy-mode client --class com.spark.Main ../spark-test/target/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar ``` <img width="1119" alt="Screenshot 2020-11-18 at 12 45 36" src="https://user-images.githubusercontent.com/18561820/99527506-2f979680-299d-11eb-9187-4ae7fbd2596a.png"> Closes #30336 from gaborgsomogyi/SPARK-33287. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-25 07:38:45 +09:00
Terry Kim	fdd6c73b3c	[SPARK-33514][SQL] Migrate TRUNCATE TABLE command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `TRUNCATE TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `TRUNCATE TABLE` works only with v1 tables, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t using csv AS SELECT 1") sql("USE db") sql("TRUNCATE TABLE t") // Succeeds ``` With this PR, `TRUNCATE TABLE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$42(Analyzer.scala:866) ``` , which is expected since temporary view is resolved first and `TRUNCATE TABLE` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `TRUNCATE TABLE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30457 from imback82/truncate_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-24 11:06:39 +00:00
Max Gekk	a6555ee596	[SPARK-33521][SQL] Universal type conversion in resolving V2 partition specs ### What changes were proposed in this pull request? In the PR, I propose to changes the resolver of partition specs used in V2 `ALTER TABLE .. ADD/DROP PARTITION` (at the moment), and re-use `CAST` in conversion partition values to desired types according to the partition schema. ### Why are the changes needed? Currently, the resolver of V2 partition specs supports just a few types: `23e9920b39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (L72)`, and fails on other types like date/timestamp. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running `AlterTablePartitionV2SQLSuite` Closes #30474 from MaxGekk/dsv2-partition-value-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-24 08:04:21 +00:00
Dongjoon Hyun	8380e00419	[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` ### What changes were proposed in this pull request? This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`. ### Why are the changes needed? SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions. - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159 Scala 2.12.10 ```scala $ bin/scala Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode res0: Int = -2074071657 ``` Scala 2.13.3 ```scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode val res0: Int = -1669302457 ``` ### Does this PR introduce _any_ user-facing change? Yes. This is a correctness issue. ### How was this patch tested? Pass the UT with both Scala 2.12/2.13. Closes #30477 from dongjoon-hyun/SPARK-33524. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 19:35:58 -08:00
Dongjoon Hyun	3ce4ab545b	[SPARK-33513][BUILD] Upgrade to Scala 2.13.4 to improve exhaustivity ### What changes were proposed in this pull request? This PR aims the followings. 1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1 2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.) 3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job. ### Why are the changes needed? Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support. - https://github.com/scala/scala/releases/tag/v2.13.4 Also, it improves exhaustivity check. - https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors) - https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components) ### Does this PR introduce _any_ user-facing change? Yep. Although it's a maintenance version change, it's a Scala version change. ### How was this patch tested? Pass the CIs and do the manual testing. - Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change. - Scala 2.13 Compilation job to check the compilation Closes #30455 from dongjoon-hyun/SCALA_3.13. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 16:28:43 -08:00
gengjiaan	f83fcb1254	[SPARK-33278][SQL][FOLLOWUP] Improve OptimizeWindowFunctions to avoid transfer first to nth_value ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30178 provided `OptimizeWindowFunctions` used to transfer `first` to `nth_value`. If the window frame is `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, `nth_value` has better performance than `first`. But the `OptimizeWindowFunctions` need to exclude other window frame. ### Why are the changes needed? Improve `OptimizeWindowFunctions` to avoid transfer `first` to `nth_value` if the specified window frame isn't `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30419 from beliefer/SPARK-33278_followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 14:54:44 +00:00
Max Gekk	23e9920b39	[SPARK-33511][SQL] Respect case sensitivity while resolving V2 partition specs ### What changes were proposed in this pull request? 1. Pre-process partition specs in `ResolvePartitionSpec`, and convert partition names according to the partition schema and the SQL config `spark.sql.caseSensitive`. In the PR, I propose to invoke `normalizePartitionSpec` for that. The function is used in DSv1 commands, so, the behavior will be similar to DSv1. 2. Move `normalizePartitionSpec()` from `sql/core/.../datasources/PartitioningUtils` to `sql/catalyst/.../util/PartitioningUtils` to use it in Catalyst's rule `ResolvePartitionSpec` ### Why are the changes needed? DSv1 commands like `ALTER TABLE .. ADD PARTITION` and `ALTER TABLE .. DROP PARTITION` respect the SQL config `spark.sql.caseSensitive` while resolving partition specs. For example: ```sql spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet PARTITIONED BY (id); spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1); spark-sql> SHOW PARTITIONS tbl1; id=1 ``` The same command fails on V2 Table catalog with error: ``` AnalysisException: Partition key ID not exists ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, partition spec resolution works as for DSv1 (without the exception showed above). ### How was this patch tested? By running `AlterTablePartitionV2SQLSuite`. Closes #30454 from MaxGekk/partition-spec-case-sensitivity. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 09:00:41 +00:00
Terry Kim	60f3a730e4	[SPARK-33515][SQL] Improve exception messages while handling UnresolvedTable ### What changes were proposed in this pull request? This PR proposes to improve the exception messages while `UnresolvedTable` is handled based on this suggestion: https://github.com/apache/spark/pull/30321#discussion_r521127001. Currently, when an identifier is resolved to a view when a table is expected, the following exception message is displayed (e.g., for `COMMENT ON TABLE`): ``` v is a temp view not table. ``` After this PR, the message will be: ``` v is a temp view. 'COMMENT ON TABLE' expects a table. ``` Also, if an identifier is not resolved, the following exception message is currently used: ``` Table not found: t ``` After this PR, the message will be: ``` Table not found for 'COMMENT ON TABLE': t ``` ### Why are the changes needed? To improve the exception message. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed as described above. ### How was this patch tested? Updated existing tests. Closes #30461 from imback82/unresolved_table_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-23 08:54:00 +00:00
Xiao Li	c891e025b8	Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" ### What changes were proposed in this pull request? This reverts commit `065f17386d`, which is not part of any released version. That is, this is an unreleased feature ### Why are the changes needed? I like the concept of Trash, but I think this PR might just resolve a very specific issue by introducing a mechanism without a proper design doc. This could make the usage more complex. I think we need to consider the big picture. Trash directory is an important concept. If we decide to introduce it, we should consider all the code paths of Spark SQL that could delete the data, instead of Truncate only. We also need to consider what is the current behavior if the underlying file system does not provide the API `Trash.moveToAppropriateTrash`. Is the exception good? How about the performance when users are using the object store instead of HDFS? Will it impact the GDPR compliance? In sum, I think we should not merge the PR https://github.com/apache/spark/pull/29552 without the design doc and implementation plan. That is why I reverted it before the code freeze of Spark 3.1 ### Does this PR introduce _any_ user-facing change? Reverted the original commit ### How was this patch tested? The existing tests. Closes #30463 from gatorsmile/revertSpark-32481. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 17:43:58 +09:00
Liang-Chi Hsieh	aa78c05edc	[SPARK-33427][SQL][FOLLOWUP] Put key and value into IdentityHashMap sequantially ### What changes were proposed in this pull request? This follow-up fixes an issue when inserting key/value pairs into `IdentityHashMap` in `SubExprEvaluationRuntime`. ### Why are the changes needed? The last commits to #30341 follows review comment to use `IdentityHashMap`. Because we leverage `IdentityHashMap` to compare keys in reference, we should not convert expression pairs to Scala map before inserting. Scala map compares keys by equality so we will loss keys with different references. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run benchmark to verify. Closes #30459 from viirya/SPARK-33427-map. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 10:42:28 +09:00
ulysses	6d625ccd5b	[SPARK-33469][SQL] Add current_timezone function ### What changes were proposed in this pull request? Add a `CurrentTimeZone` function and replace the value at `Optimizer` side. ### Why are the changes needed? Let user get current timezone easily. Then user can call ``` SELECT current_timezone() ``` Presto: https://prestodb.io/docs/current/functions/datetime.html SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/functions/current-timezone-transact-sql?view=sql-server-ver15 ### Does this PR introduce _any_ user-facing change? Yes, a new function. ### How was this patch tested? Add test. Closes #30400 from ulysses-you/SPARK-33469. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-22 15:36:44 -08:00
Max Gekk	530c0a8e28	[SPARK-33505][SQL][TESTS] Fix adding new partitions by INSERT INTO `InMemoryPartitionTable` ### What changes were proposed in this pull request? 1. Add a hook method to `addPartitionKey()` of `InMemoryTable` which is called per every row. 2. Override `addPartitionKey()` in `InMemoryPartitionTable`, and add partition key every time when new row is inserted to the table. ### Why are the changes needed? To be able to write unified tests for datasources V1 and V2. Currently, INSERT INTO a V1 table creates partitions but the same doesn't work for the custom catalog `InMemoryPartitionTableCatalog` used in DSv2 tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `DataSourceV2SQLSuite`. Closes #30449 from MaxGekk/insert-into-InMemoryPartitionTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 18:41:25 -08:00
anchovYu	de0f50abf4	[SPARK-32670][SQL] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? Group all messages of `AnalysisExcpetions` created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file. * Create a new object: `org.apache.spark.sql.CatalystErrors` with many exception-creating functions. * When the `Analyzer` wants to create and throw a new `AnalysisException`, call functions of `CatalystErrors` ### Why are the changes needed? This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. ### Naming of exception functions All function names ended with `Error`. * For specific errors like `groupingIDMismatch` and `groupingColInvalid`, directly use them as name, just like `groupingIDMismatchError` and `groupingColInvalidError`. * For generic errors like `dataTypeMismatch`, * if confident with the context, prefix and condition can be added, like `pivotValDataTypeMismatchError` * if not sure about the context, add a `For` suffix of the specific component that this exception is related to, like `dataTypeMismatchForDeserializerError` Closes #29497 from anchovYu/32670. Lead-authored-by: anchovYu <aureole@sjtu.edu.cn> Co-authored-by: anchovYu <xyyu15@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-21 08:33:39 +09:00
ulysses	3384bda453	[SPARK-33468][SQL] ParseUrl in ANSI mode should fail if input string is not a valid url ### What changes were proposed in this pull request? With `ParseUrl`, instead of return null we throw exception if input string is not a vaild url. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception if `set spark.sql.ansi.enabled=true`. ### How was this patch tested? Add test. Closes #30399 from ulysses-you/SPARK-33468. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 13:23:08 +00:00
Chao Sun	6da8ade5f4	[SPARK-33045][SQL][FOLLOWUP] Fix build failure with Scala 2.13 ### What changes were proposed in this pull request? Explicitly convert `scala.collection.mutable.Buffer` to `Seq`. In Scala 2.13 `Seq` is an alias of `scala.collection.immutable.Seq` instead of `scala.collection.Seq`. ### Why are the changes needed? Without the change build with Scala 2.13 fails with the following: ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1417:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case null => LikeAll(e, patterns) [error] ^ [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1418:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case _ => NotLikeAll(e, patterns) [error] ^ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30431 from sunchao/SPARK-33045-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 12:42:33 -08:00
gengjiaan	3695e997d5	[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ALL` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ALL` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_all`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29999 from beliefer/SPARK-33045-like_all. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 16:56:21 +00:00
ulysses	21b13506cd	[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row ### What changes were proposed in this pull request? Change `CombineLimits` name to `EliminateLimits` and add check if `Limit` child max row <= limit. ### Why are the changes needed? In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary. A general negative example is ``` select count(*) from t limit 100000; ``` It will be great if we can eliminate limit at Spark side. Also, we make a benchmark for this case ``` runBenchmark("Sort and Limit") { val N = 100000 val benchmark = new Benchmark("benchmark sort and limit", N) benchmark.addCase("TakeOrderedAndProject", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } benchmark.addCase("Sort And Limit", 3) { _ => withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> "-1") { spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } } benchmark.addCase("Sort", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").collect() } benchmark.run() } ``` and the result is ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.15.6 Intel(R) Core(TM) i5-5257U CPU 2.70GHz benchmark sort and limit: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ TakeOrderedAndProject 1833 2259 382 0.1 18327.1 1.0X Sort And Limit 1417 1658 285 0.1 14167.5 1.3X Sort 1324 1484 225 0.1 13238.3 1.4X ``` It shows that it makes sense to replace `TakeOrderedAndProjectExec` with `Sort + Project`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30368 from ulysses-you/SPARK-33442. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:31:10 +00:00
allisonwang-db	ef2638c3e3	[SPARK-33183][SQL][FOLLOW-UP] Update rule RemoveRedundantSorts config version ### What changes were proposed in this pull request? This PR is a follow up for #30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8. ### Why are the changes needed? To update the rule version it has been backported to 2.4. #30194 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30420 from allisonwang-db/spark-33183-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 00:12:22 -08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
Ryan Blue	66a76378cf	[SPARK-31255][SQL][FOLLOWUP] Add missing license headers ### What changes were proposed in this pull request? Add missing license headers for new files added in #28027. ### Why are the changes needed? To fix licenses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a purely non-functional change. Closes #30415 from rdblue/license-headers. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 19:18:28 -08:00
Liang-Chi Hsieh	e518008ca9	[SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections ### What changes were proposed in this pull request? Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`. ### Why are the changes needed? Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30406 from viirya/SPARK-33473. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 18:58:06 -08:00
Liang-Chi Hsieh	97d2cee4af	[SPARK-33427][SQL][FOLLOWUP] Prevent test flakyness in SubExprEvaluationRuntimeSuite ### What changes were proposed in this pull request? This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`. ### Why are the changes needed? Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30414 from viirya/SPARK-33427-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-18 18:35:11 -08:00
Gengliang Wang	9a4c79073b	[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode ### What changes were proposed in this pull request? In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types. ![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png) Comparing the ANSI CAST syntax rules with the current default behavior of Spark: ![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png) To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now ``` Numeric <=> Boolean String <=> Binary ``` ### Why are the changes needed? Better ANSI SQL compliance ### Does this PR introduce _any_ user-facing change? Yes, the following castings will not be allowed in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` ### How was this patch tested? Unit test The ANSI Compliance doc preview: ![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png) Closes #30260 from gengliangwang/ansiCanCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 09:23:36 +09:00
Ryan Blue	1df69f7e32	[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 ### What changes were proposed in this pull request? This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns. The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used. ### Why are the changes needed? This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan. ### Does this PR introduce any user-facing change? Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns. ### How was this patch tested? Will include new unit tests. Closes #28027 from rdblue/add-dsv2-metadata-columns. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-11-18 14:07:51 -08:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Yuming Wang	09bb9bedcd	[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values ### What changes were proposed in this pull request? We [rewrite](`5197c5d2e7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (L722-L724)`) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values. This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow. From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000. ### Why are the changes needed? Avoid Hive metastore stack overflow when `InSet` predicate have many values. Especially DPP, it may generate many values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #30325 from wangyum/SPARK-33416. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 13:47:01 +00:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
Cheng Su	5af5aa146e	[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from https://github.com/apache/spark/pull/30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-17 11:18:42 +09:00
xuewei.linxuewei	b5eca18af0	[SPARK-33460][SQL] Accessing map values should fail if key is not found ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30386 from leanken/leanken-SPARK-33460. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:14:31 +00:00
Max Gekk	6883f29465	[SPARK-33453][SQL][TESTS] Unify v1 and v2 SHOW PARTITIONS tests ### What changes were proposed in this pull request? 1. Move `SHOW PARTITIONS` parsing tests to `ShowPartitionsParserSuite` 2. Place Hive tests for `SHOW PARTITIONS` from `HiveCommandSuite` to the base test suite `v1.ShowPartitionsSuiteBase`. This will allow to run the tests w/ and w/o Hive. The changes follow the approach of https://github.com/apache/spark/pull/30287. ### Why are the changes needed? - The unification will allow to run common `SHOW PARTITIONS` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: - new test suites `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"` - and old one `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCommandSuite"` Closes #30377 from MaxGekk/unify-dsv1_v2-show-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:11:42 +00:00
luluorta	dfa6fb46f4	[SPARK-33389][SQL] Make internal classes of SparkSession always using active SQLConf ### What changes were proposed in this pull request? This PR makes internal classes of SparkSession always using active SQLConf. We should remove all `conf: SQLConf`s from ctor-parameters of this classes (`Analyzer`, `SparkPlanner`, `SessionCatalog`, `CatalogManager` `SparkSqlParser` and etc.) and use `SQLConf.get` instead. ### Why are the changes needed? Code refine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test Closes #30299 from luluorta/SPARK-33389. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 15:27:18 +00:00
Max Gekk	71a29b2eca	[MINOR][SQL][DOCS] Fix a reference to `spark.sql.sources.useV1SourceList` ### What changes were proposed in this pull request? Replace `spark.sql.sources.write.useV1SourceList` by `spark.sql.sources.useV1SourceList` in the comment for `CatalogManager.v2SessionCatalog()`. ### Why are the changes needed? To have correct comments. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30385 from MaxGekk/fix-comment-useV1SourceList. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:57:20 +09:00
Max Gekk	4e5d2e0695	[SPARK-33394][SQL][TESTS] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()` ### What changes were proposed in this pull request? Throw `NoSuchNamespaceException` in `listTables()` of the custom test catalog `InMemoryTableCatalog` if the passed namespace doesn't exist. ### Why are the changes needed? 1. To align behavior of V2 `InMemoryTableCatalog` to V1 session catalog. 2. To distinguish two situations: 1. A namespace does exist but does not contain any tables. In that case, `listTables()` returns empty result. 2. A namespace does not exist. `listTables()` throws `NoSuchNamespaceException` in this case. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW TABLES` returns empty result before the changes. ### How was this patch tested? By running V1/V2 ShowTablesSuites. Closes #30358 from MaxGekk/show-tables-in-not-existing-namespace. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:08:21 +00:00
luluorta	156704ba0d	[SPARK-33432][SQL] SQL parser should use active SQLConf ### What changes were proposed in this pull request? This PR makes SQL parser using active SQLConf instead of the one in ctor-parameters. ### Why are the changes needed? In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: ```scala spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > > == SQL == > time Timestamp > ^^^ But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: ```scala DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > +--------------------------------+ > \|from_json({"time":"26/10/2015"})\| > +--------------------------------+ > \| {2015-10-26 00:00...\| > +--------------------------------+ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly and updated UTs Closes #30357 from luluorta/SPARK-33432. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 13:37:12 -08:00
Liang-Chi Hsieh	0046222a75	[SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite ### What changes were proposed in this pull request? This is a simple followup to prevent test flakyness in SubexpressionEliminationSuite. If `getAllEquivalentExprs` returns more than 1 sequences, due to HashMap, we should use `contains` instead of assuming the order of results. ### Why are the changes needed? Prevent test flakyness in SubexpressionEliminationSuite. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30371 from viirya/SPARK-33337-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-13 15:10:02 -08:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00
Kent Yao	cdd8e51742	[SPARK-33419][SQL] Unexpected behavior when using SET commands before a query in SparkSession.sql ### What changes were proposed in this pull request? SparkSession.sql converts a string value to a DataFrame, and the string value should be one single SQL statement ending up w/ or w/o one or more semicolons. e.g. ```sql scala> spark.sql(" select 2").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;").show +---+ \| 2\| +---+ \| 2\| +---+ scala> spark.sql(" select 2;;;;").show +---+ \| 2\| +---+ \| 2\| +---+ ``` If we put 2 or more statements in, it fails in the parser as expected, e.g. ```sql scala> spark.sql(" select 2; select 1;").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input 'select' expecting {<EOF>, ';'}(line 1, pos 11) == SQL == select 2; select 1; -----------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` As a very generic user scenario, users may want to change some settings before they execute the queries. They may pass a string value like `set spark.sql.abc=2; select 1;` into this API, which creates a confusing gap between the actual effect and the user's expectations. The user may want the query to be executed with spark.sql.abc=2, but Spark actually treats the whole part of `2; select 1;` as the value of the property 'spark.sql.abc', e.g. ``` scala> spark.sql("set spark.sql.abc=2; select 1;").show +-------------+------------+ \| key\| value\| +-------------+------------+ \|spark.sql.abc\|2; select 1;\| +-------------+------------+ ``` What's more, the SET symbol could digest everything behind it, which makes it unstable from version to version, e.g. #### 3.1 ```sql scala> spark.sql("set;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided scala> spark.sql("set a;").show org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == set a; ^^^ at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72) at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161) at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided ``` #### 2.4 ```sql scala> spark.sql("set;").show +---+-----------+ \|key\| value\| +---+-----------+ \| ;\|<undefined>\| +---+-----------+ scala> spark.sql("set a;").show +---+-----------+ \|key\| value\| +---+-----------+ \| a;\|<undefined>\| +---+-----------+ ``` In this PR, 1. make `set spark.sql.abc=2; select 1;` in `SparkSession.sql` fail directly, user should call `.sql` for each statement separately. 2. make the semicolon as the separator of statements, and if users want to use it as part of the property value, shall use quotes too. ### Why are the changes needed? 1. disambiguation for `SparkSession.sql` 2. make semicolon work same both w/ `SET` and other statements ### Does this PR introduce _any_ user-facing change? yes, the semicolon works as a separator of statements now, it will be trimmed if it is at the end of the statement and fail the statement if it is in the middle. you need to use quotes if you want it to be part of the property value ### How was this patch tested? new tests Closes #30332 from yaooqinn/SPARK-33419. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 06:58:16 +00:00
ulysses	82a21d2a3e	[SPARK-33433][SQL] Change Aggregate max rows to 1 if grouping is empty ### What changes were proposed in this pull request? Change `Aggregate` max rows to 1 if grouping is empty. ### Why are the changes needed? If `Aggregate` grouping is empty, the result is always one row. Then we don't need push down limit in `LimitPushDown` with such case ``` select count() from t1 union select count() from t2 limit 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30356 from ulysses-you/SPARK-33433. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-13 15:57:07 +09:00
Liang-Chi Hsieh	2c64b731ae	[SPARK-33259][SS] Disable streaming query with possible correctness issue by default ### What changes were proposed in this pull request? This patch proposes to disable the streaming query with possible correctness issue in chained stateful operators. The behavior can be controlled by a SQL config, so if users understand the risk and still want to run the query, they can disable the check. ### Why are the changes needed? The possible correctness in chained stateful operators in streaming query is not straightforward for users. From users perspective, it will be considered as a Spark bug. It is also possible the worse case, users are not aware of the correctness issue and use wrong results. A better approach should be to disable such queries and let users choose to run the query if they understand there is such risk, instead of implicitly running the query and let users to find out correctness issue by themselves and report this known to Spark community. ### Does this PR introduce _any_ user-facing change? Yes. Streaming query with possible correctness issue will be blocked to run, except for users explicitly disable the SQL config. ### How was this patch tested? Unit test. Closes #30210 from viirya/SPARK-33259. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-12 15:31:57 -08:00
Linhong Liu	1baf0d5c9b	[SPARK-33140][SQL][FOLLOW-UP] change val to def in object rule ### What changes were proposed in this pull request? In #30097, many rules changed from case class to object, but if the rule is stateful, there will be a problem. For example, if an object rule uses a `val` to refer to a config, it will be unchanged after initialization even if other spark session uses a different config value. ### Why are the changes needed? Avoid potential bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #30354 from linhongliu-db/SPARK-33140-followup-2. Lead-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Co-authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-13 01:10:28 +09:00
gengjiaan	2f07c56810	[SPARK-33278][SQL] Improve the performance for FIRST_VALUE ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29800 provides a performance improvement for `NTH_VALUE`. `FIRST_VALUE` also could use the `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame`. ### Why are the changes needed? Improve the performance for `FIRST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30178 from beliefer/SPARK-33278. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 14:59:22 +00:00
ulysses	a3d2954662	[SPARK-33421][SQL] Support Greatest and Least in Expression Canonicalize ### What changes were proposed in this pull request? Add `Greatest` and `Least` check in `Canonicalize`. ### Why are the changes needed? The children of both `Greatest` and `Least` are order Irrelevant. Let's say we have `greatest(1, 2)` and `greatest(2, 1)`. We can get the same canonicalized expression in this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30330 from ulysses-you/SPARK-33421. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 20:26:33 +09:00
xuewei.linxuewei	6d31daeb6a	[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for `element_at`，`elt`, `GetArrayItem` functions. ### Why are the changes needed? For ansiMode. ### Does this PR introduce any user-facing change? When `spark.sql.ansi.enabled` = true, Spark will throw `ArrayIndexOutOfBoundsException` if out-of-range index when accessing array elements ### How was this patch tested? Added UT and existing UT. Closes #30297 from leanken/leanken-SPARK-33386. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 08:50:32 +00:00
stczwd	1eb236b936	[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 ### What changes were proposed in this pull request? This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617. ### Does this PR introduce _any_ user-facing change? Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table. ### How was this patch tested? Run suites and fix old tests. Closes #29339 from stczwd/SPARK-32512-new. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jacky Lee <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 09:30:42 +00:00
Wenchen Fan	8760032f4f	[SPARK-33412][SQL] OverwriteByExpression should resolve its delete condition based on the table relation not the input query ### What changes were proposed in this pull request? Make a special case in `ResolveReferences`, which resolves `OverwriteByExpression`'s condition expression based on the table relation instead of the input query. ### Why are the changes needed? The condition expression is passed to the table implementation at the end, so we should resolve it using table schema. Previously it works because we have a hack in `ResolveReferences` to delay the resolution if `outputResolved == false`. However, this hack doesn't work for tables accepting any schema like https://github.com/delta-io/delta/pull/521 . We may wrongly resolve the delete condition using input query's outout columns which don't match the table column names. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests and updated test in v2 write. Closes #30318 from cloud-fan/v2-write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 16:13:21 +09:00
Terry Kim	6d5d030957	[SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("SHOW CREATE TABLE t AS SERDE") // Succeeds ``` With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883) at scala.Option.map(Option.scala:230) ``` , which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail. ### Does this PR introduce _any_ user-facing change? After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`. ### How was this patch tested? Updated existing tests. Closes #30321 from imback82/show_create_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:54:27 +00:00
Max Gekk	1e2eeda20e	[SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests ### What changes were proposed in this pull request? In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`. ### Why are the changes needed? - The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: - `org.apache.spark.sql.execution.command.v1.ShowTablesSuite` - `org.apache.spark.sql.execution.command.v2.ShowTablesSuite` - `ShowTablesParserSuite` Closes #30287 from MaxGekk/unify-dsv1_v2-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:26:46 +00:00
ulysses	5197c5d2e7	[SPARK-33390][SQL] Make Literal support char array ### What changes were proposed in this pull request? Make Literal support char array. ### Why are the changes needed? We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`. ### Does this PR introduce _any_ user-facing change? Yes, user can call `Literal()` with `char[]`. ### How was this patch tested? Add test. Closes #30295 from ulysses-you/SPARK-33390. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 11:39:11 +09:00
Utkarsh	46346943bb	[SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression ### What changes were proposed in this pull request? The following query produces incorrect results: ``` SELECT date_trunc('minute', '1769-10-17 17:10:02') ``` Spark currently incorrectly returns ``` 1769-10-17 17:10:02 ``` against the expected return value of ``` 1769-10-17 17:10:00 ``` Steps to repro Run the following commands in spark-shell: ``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show() ``` This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`. This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`. This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests to `DateTimeUtilsSuite` which previously failed and pass now. Closes #30303 from utkarsh39/trunc-timestamp-fix. Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 09:28:59 +09:00
Liang-Chi Hsieh	6fa80ed1dd	[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions ### What changes were proposed in this pull request? Currently we skip subexpression elimination in branches of conditional expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions. ### Why are the changes needed? We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two `Project`s and produces conditional expression like: ``` CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END ``` If `jsonToStruct(json)` is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30245 from viirya/SPARK-33337. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-10 16:17:00 -08:00
angerszhu	34f5e7ce77	[SPARK-33302][SQL] Push down filters through Expand ### What changes were proposed in this pull request? Push down filter through expand. For case below: ``` create table t1(pid int, uid int, sid int, dt date, suid int) using parquet; create table t2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vs AS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM t1 u join t2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 ``` Plan. before this pr: ``` == Physical Plan == (5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- (4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- (4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- (2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- (2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- (2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- (2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- (2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- (2) Project [uid#7, dt#9, suid#10] : +- (2) Filter isnotnull(uid#7) : +- (2) ColumnarToRow : +- FileScan parquet default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date,suid:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233] +- (1) Project [pid#11, vs#12, uid#13] +- (1) Filter isnotnull(uid#13) +- (1) ColumnarToRow +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [isnotnull(uid#13)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` Plan. after. this pr. : ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[years#0, appversion#2], functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L]) +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71] +- HashAggregate(keys=[years#0, appversion#2], functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2, sum#22L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[count(distinct uid#7)], output=[years#0, appversion#2, uusers#3L]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true, [id=#67] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[partial_count(distinct uid#7)], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, count#27L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7, 5), true, [id=#63] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Project [uid#7, dt#9, pid#11, vs#12] +- BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight, false :- Filter isnotnull(uid#7) : +- FileScan parquet default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, false] as bigint)),false), [id=#58] +- Filter ((CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND isnotnull(uid#13)) +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS), isnotnull..., Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` ### Why are the changes needed? Improve performance, filter more data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30278 from AngersZhuuuu/SPARK-33302. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:40:24 +00:00
xuewei.linxuewei	e3a768dd79	[SPARK-33391][SQL] element_at with CreateArray not respect one based index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes #30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 07:23:47 +00:00
Terry Kim	90f6f39e42	[SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `LOAD DATA` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds ``` With this change, `LOAD DATA` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865) at scala.Option.foreach(Option.scala:407) ``` , which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30270 from imback82/load_data_cmd. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:28:06 +00:00
Gengliang Wang	a1f84d8714	[SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata ### What changes were proposed in this pull request? When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference. ### Why are the changes needed? For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`. The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`. ### Does this PR introduce _any_ user-facing change? Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true. ### How was this patch tested? Unit test Closes #30273 from gengliangwang/supportsExternalMetadata. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 04:43:32 +00:00
Wenchen Fan	98730b7ee2	[SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer ### What changes were proposed in this pull request? This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer. ### Why are the changes needed? Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes. ### Does this PR introduce _any_ user-facing change? Yes. 1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT. 2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed. 3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed. 4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command) ### How was this patch tested? new tests Closes #29970 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:08:00 +00:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
Hannah Amundson	1090b1b00a	[SPARK-32860][DOCS][SQL] Updating documentation about map support in Encoders ### What changes were proposed in this pull request? Javadocs updated for the encoder to include maps as a collection type ### Why are the changes needed? The javadocs were not updated with fix SPARK-16706 ### Does this PR introduce _any_ user-facing change? Yes, the javadocs are updated ### How was this patch tested? sbt was run to ensure it meets scalastyle Closes #30274 from hannahkamundson/SPARK-32860. Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com> Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-08 20:29:24 +09:00
Stuart White	09fa7ecae1	[SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and structs ### What changes were proposed in this pull request? The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit. Here's sample code to illustrate the behavior I'd like to clean up: ```scala val rows = Seq[String](null) .toDF("value") .withColumn("struct1", struct('value as "value1")) .withColumn("struct2", struct('value as "value1", 'value as "value2")) .withColumn("array1", array('value)) .withColumn("array2", array('value, 'value)) // Show the DataFrame using the "first" codepath. rows.show(truncate=false) +-----+-------+-------------+------+--------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+-------------+------+--------+ \|null \|{ null}\|{ null, null}\|[] \|[, null]\| +-----+-------+-------------+------+--------+ // Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path: rows.write.parquet("rows") spark.read.parquet("rows").show(truncate=false) +-----+-------+-------------+-------+-------------+ \|value\|struct1\|struct2 \|array1 \|array2 \| +-----+-------+-------------+-------+-------------+ \|null \|{ null}\|{ null, null}\|[ null]\|[ null, null]\| +-----+-------+-------------+-------+-------------+ ``` Notice: 1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}"). I think it's preferable to print it without the leading space (e.g. "\{null\}"). This is consistent with how non-null values are printed inside a struct. 2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space. I think both code paths should be consistent and print it without a leading space (e.g. "[null]"). The desired result of this PR is to product the following output via both code paths: ``` +-----+-------+------------+------+------------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+------------+------+------------+ \|null \|{null} \|{null, null}\|[null]\|[null, null]\| +-----+-------+------------+------+------------+ ``` This contribution is my original work and I license the work to the project under the project’s open source license. ### Why are the changes needed? To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs. ### Does this PR introduce _any_ user-facing change? Yes. This PR changes what is printed out by DataFrame.show(). ### How was this patch tested? I added new test cases in CastSuite.scala to cover the cases addressed by this PR. Closes #30189 from stwhit/show_nulls. Authored-by: Stuart White <stuart.white1@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-06 13:12:35 -08:00
Terry Kim	68c032c246	[SPARK-33364][SQL] Introduce the "purge" option in TableCatalog.dropTable for v2 catalog ### What changes were proposed in this pull request? This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed. Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110 ### Why are the changes needed? Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs. ### Does this PR introduce _any_ user-facing change? This PR introduces a new API in `TableCatalog`. ### How was this patch tested? Added a test. Closes #30267 from imback82/purge_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 22:00:45 -08:00
Wenchen Fan	d16311051d	[SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments ### What changes were proposed in this pull request? 1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions. 2. Refine and add more comments. 3. Remove `isRelative` as it's useless. ### Why are the changes needed? code refinement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30261 from cloud-fan/window. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-06 05:20:25 +00:00
Dongjoon Hyun	90f35c663e	[MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer ### What changes were proposed in this pull request? This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by SPARK-31670 (https://github.com/apache/spark/pull/28490) ```scala - // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations + // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations ``` ### Why are the changes needed? Fix the wrong information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a comment change. Manually review. Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-06 12:46:26 +09:00
Wenchen Fan	cd4e3d3b0c	[SPARK-33360][SQL] Simplify DS v2 write resolution ### What changes were proposed in this pull request? Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery` ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30264 from cloud-fan/ds-minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 15:44:04 -08:00
Wenchen Fan	26ea417b14	[SPARK-33362][SQL] skipSchemaResolution should still require query to be resolved ### What changes were proposed in this pull request? Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved. ### Why are the changes needed? To prevent potential bugs that we skip resolve the input query. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #30265 from cloud-fan/ds-minor-2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 09:23:41 -08:00
Dongjoon Hyun	42c0b175ce	[SPARK-33338][SQL] GROUP BY using literal map should not fail ### What changes were proposed in this pull request? This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`. ### Why are the changes needed? This is a regression from Apache Spark 1.6.x. ```scala scala> sc.version res1: String = 1.6.3 scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show +---+ \|_c0\| +---+ \| v1\| +---+ ``` Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries. ```sql CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k SELECT map('k1', 'v1')[k] FROM t GROUP BY 1 SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k] SELECT map('k1', 'v1')[k] a FROM t GROUP BY a ``` BEFORE ```scala Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ``` AFTER ```sql spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1; v1 Time taken: 1.278 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]; v1 Time taken: 0.313 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a; v1 Time taken: 0.265 seconds, Fetched 1 row(s) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30246 from dongjoon-hyun/SPARK-33338. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-04 08:35:10 -08:00
Terry Kim	0ad35ba5f8	[SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ANALYZE TABLE` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior: ```scala sql("create temporary view t as select 1") sql("create database db") sql("create table db.t using csv as select 1") sql("use db") sql("ANALYZE TABLE t compute statistics") // Succeeds ``` With this change, ANALYZE TABLE above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856) ``` , which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`. ### How was this patch tested? Updated existing tests. Closes #30229 from imback82/parse_v1table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 06:50:37 +00:00
Wenchen Fan	034070a23a	Revert "[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size" This reverts commit `0c943cd2fb`.	2020-11-04 12:30:38 +08:00
Max Gekk	eecebd0302	[SPARK-33306][SQL][FOLLOWUP] Group DateType and TimestampType together in `needsTimeZone()` ### What changes were proposed in this pull request? In the PR, I propose to group `DateType` and `TimestampType` together in checking time zone needs in the `Cast.needsTimeZone()` method. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By the existing test `"SPARK-33306: Timezone is needed when cast Date to String"`. Closes #30223 from MaxGekk/WangGuangxin-SPARK-33306-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:07:18 -08:00
wangguangxin.cn	69c27f49ac	[SPARK-33306][SQL] Timezone is needed when cast date to string ### What changes were proposed in this pull request? When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as `72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309)`. Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting. A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true ``` when we execute the following sql, ``` select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` ### Why are the changes needed? As described above, it's a bug here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add more UT Closes #30213 from WangGuangxin/SPARK-33306. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 15:14:46 -07:00
angerszhu	0c943cd2fb	[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size ### What changes were proposed in this pull request? Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size. Since we can't decide whether it's a but and some use need it behavior same as Hive. ### Why are the changes needed? Provides a compatible choice between historical behavior and Hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30156 from AngersZhuuuu/SPARK-33284. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-30 14:11:25 +09:00
Max Gekk	343e0bb3ad	[SPARK-33286][SQL] Improve the error message about schema parsing by `from_json/from_csv` # What changes were proposed in this pull request? In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers: - DataType.fromJson (except CSV) - CatalystSqlParser.parseDataType - CatalystSqlParser.parseTableSchema Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users. ### Why are the changes needed? Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example: ```scala val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}""" df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show() ``` The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out: ``` mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '{' expecting {'ADD', 'AFTER', ... }(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes for the example above: ``` Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field. Failed fallback parsing: Cannot parse the data type: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ Failed fallback parsing: mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0) == SQL == {"fields": [{"a":123}], "type": "struct"} ^^^ ``` ### How was this patch tested? - By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`. - Add new test to `JsonFunctionsSuite`. - Re-gen results for `json-functions.sql`. Closes #30183 from MaxGekk/fromDDL-error-msg. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-30 11:18:47 +09:00
Dongjoon Hyun	838791bf0b	[SPARK-33292][SQL] Make Literal ArrayBasedMapData string representation disambiguous ### What changes were proposed in this pull request? This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`. ### Why are the changes needed? Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`. BEFORE ```scala scala> spark.version res0: String = 2.4.7 scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#0, 'map(key1, value1) AS b#1] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#0, map(key1, value1) AS b#1] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- OneRowRelation == Physical Plan == (1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- Scan OneRowRelation[] ``` AFTER* ```scala scala> spark.version res0: String = 3.1.0-SNAPSHOT scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#4, 'map(key1, value1) AS b#5] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#4, map(key1, value1) AS b#5] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- OneRowRelation == Physical Plan == (1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- (1) Scan OneRowRelation[] ``` ### Does this PR introduce _any_ user-facing change? Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes #30190 from dongjoon-hyun/SPARK-33292. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-29 19:10:01 -07:00
luluorta	cbd3fdea62	[SPARK-33008][SQL] Division by zero on divide-like operations returns incorrect result ### What changes were proposed in this pull request? In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #29882 from luluorta/SPARK-33008. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-29 16:44:17 +00:00
Liang-Chi Hsieh	056b62264b	[SPARK-33263][SS] Configurable StateStore compression codec ### What changes were proposed in this pull request? This patch proposes to make StateStore compression codec configurable. ### Why are the changes needed? Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. For example, we can choose zstd codec and zstd is configurable with different compression level. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can config different codec for StateStore. ### How was this patch tested? Unit test. Closes #30162 from viirya/SPARK-33263. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-29 07:44:44 -07:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
Nathan Wreggit	c592ae6ed8	[SQL][MINOR] Update from_unixtime doc ### What changes were proposed in this pull request? This PR fixes from_unixtime documentation to show that fmt is optional parameter. ### Does this PR introduce _any_ user-facing change? Yes, documentation update. Before change: ![image](https://user-images.githubusercontent.com/4176173/97497659-18c6cc80-1928-11eb-93d8-453ef627ac7c.png) After change: ![image](https://user-images.githubusercontent.com/4176173/97496153-c5537f00-1925-11eb-8102-457e85e019d5.png) ### How was this patch tested? Style check using: ./dev/run-tests Manual check and screenshotting with: ./sql/create-docs.sh Manual verification of behavior with latest spark-sql binary. Closes #30176 from Obbay2/from_unixtime_doc. Authored-by: Nathan Wreggit <obbay2@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:28:50 +09:00
Wenchen Fan	2639ad43cb	[SPARK-33272][SQL] prune the attributes mapping in QueryPlan.transformUpWithNewOutput ### What changes were proposed in this pull request? For complex query plans, `QueryPlan.transformUpWithNewOutput` will keep accumulating the attributes mapping to be propagated, which may hurt performance. This PR prunes the attributes mapping before propagating. ### Why are the changes needed? A simple perf improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30173 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-29 07:37:16 +09:00
Jungtaek Lim (HeartSaVioR)	a744fea3be	[SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values contains null ### What changes were proposed in this pull request? This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue. ### Why are the changes needed? This is an obvious bug as `In` filter doesn't care about null value when calculating hash code. ### Does this PR introduce _any_ user-facing change? Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail. ### How was this patch tested? UT added. The new UT fails without the PR and passes with the PR. Closes #30170 from HeartSaVioR/SPARK-33267. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 10:00:29 -07:00
Takeshi Yamamuro	a6216e2446	[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30169 from maropu/FixPythonUDTCast. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 08:33:02 -07:00
gengjiaan	3c3ad5f7c0	[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction ### What changes were proposed in this pull request? Spark SQL supports some window function like `NTH_VALUE`. If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations. For example: if we execute the SQL show below: ``` SELECT NTH_VALUE(col, 2) OVER(ORDER BY rank UNBOUNDED PRECEDING AND CURRENT ROW) FROM tab; ``` The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2. `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler. ### Why are the changes needed? Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29800 from beliefer/optimize-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 06:40:23 +00:00
allisonwang-db	9fb45361fd	[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts ### What changes were proposed in this pull request? This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix. ### Why are the changes needed? A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts. ``` Sort(orders, global = True, ...) Sort(orders, global = False, ...) ``` Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit tests Closes #30093 from allisonwang-db/fix-sort. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:51:47 +00:00
Terry Kim	528160f001	[SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("DROP TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("DROP TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior. ### How was this patch tested? Added a new test Closes #30079 from imback82/drop_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 05:44:55 +00:00
Jungtaek Lim (HeartSaVioR)	fcf8aa59b5	[SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 session catalog ### What changes were proposed in this pull request? This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog. ### Why are the changes needed? The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation. This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list. ### Does this PR introduce _any_ user-facing change? Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog. ### How was this patch tested? New UT added. Closes #30147 from HeartSaVioR/SPARK-33240. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-28 03:31:11 +00:00
Ankur Dave	3f2a2b5fe6	[SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream ### What changes were proposed in this pull request? The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`. ```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 ``` When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes. Closes #30160 from ankurdave/SPARK-33260. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-27 13:20:22 -07:00
tanel.kiis@gmail.com	281f99c70b	[SPARK-33225][SQL] Extract AliasHelper trait ### What changes were proposed in this pull request? Extract methods related to handling Aliases to a trait. ### Why are the changes needed? Avoid code duplication ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs cover this Closes #30134 from tanelk/SPARK-33225_aliasHelper. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-27 22:53:05 +09:00
xuewei.linxuewei	537a49fc09	[SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of Rule[QueryPlan] ### What changes were proposed in this pull request? Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan]. In this PR. * remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SQLConf.get to replace the original SQLConf instance. * remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan]. * using SparkSession.active to replace the original SparkSession instance. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test Closes #30097 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-27 12:40:57 +00:00
Cheng Su	1042d49bf9	[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) ### What changes were proposed in this pull request? This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate. ### Why are the changes needed? Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work. Closes #30138 from c21/enable-auto-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 20:23:24 +09:00
Yuning Zhang	a21945ce6c	[SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime. ### Why are the changes needed? `spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### Does this PR introduce _any_ user-facing change? Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### How was this patch tested? modified unit test Closes #30108 from yuningzh-db/dynamic-analyzer-max-iterations. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 16:19:06 +09:00
Cheng Su	d87a0bb2ca	[SPARK-32862][SS] Left semi stream-stream join ### What changes were proposed in this pull request? This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`): * For left side input row, check if there's a match on right side state store. * if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store). * if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store). * For right side input row, check if there's a match on left side state store. * For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics. * State store eviction: evict rows from left/right side state store below watermark, same as inner join. Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup. ### Why are the changes needed? Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30076 from c21/stream-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-10-26 13:33:06 +09:00
Liang-Chi Hsieh	87b32f65ef	[MINOR][DOCS][TESTS] Fix PLAN_CHANGE_LOG_LEVEL document ### What changes were proposed in this pull request? `PLAN_CHANGE_LOG_LEVEL` config document is wrong. This is to fix it. ### Why are the changes needed? Fix wrong doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #30136 from viirya/minor-sqlconf. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 13:35:46 +09:00
Ankit Srivastava	3819d39607	[SPARK-32998][BUILD] Add ability to override default remote repos with internal one ### What changes were proposed in this pull request? - Building spark internally in orgs where access to outside internet is not allowed takes a long time because unsuccessful attempts are made to download artifacts from repositories which are not accessible. The unsuccessful attempts unnecessarily add significant amount of time to the build. I have seen a difference of up-to 1hr for some runs. - Adding 1 environment variables that should be present that the start of the build and if they exist, override the default repos defined in the code and scripts. envVariables: - DEFAULT_ARTIFACT_REPOSITORY=https://artifacts.internal.com/libs-release/ ### Why are the changes needed? To allow orgs to build spark internally without relying on external repositories for artifact downloads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Multiple builds with and without env variables set. Closes #29874 from ankits/SPARK-32998. Authored-by: Ankit Srivastava <ankit_srivastava@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-22 16:35:55 -07:00
Xuedong Luan	d9ee33cfb9	[SPARK-26533][SQL] Support query auto timeout cancel on thriftserver ### What changes were proposed in this pull request? Support query auto cancelling when running too long on thriftserver. This is the rework of #28991 and the credit should be the original author, leoluan2009. Closes #28991 ### Why are the changes needed? For some cases, we use thriftserver as long-running applications. Some times we want all the query need not to run more than given time. In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #29933 from maropu/pr28991. Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Luan <luanxuedong2009@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-22 17:23:10 +09:00
gengjiaan	eb33bcb4b2	[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE ### What changes were proposed in this pull request? `REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string. But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore: AddressLine1 -> address_line_1 If we support the parameter position, we can do like this(e.g. Oracle): ``` WITH strings as ( SELECT 'AddressLine1' s FROM dual union all SELECT 'ZipCode' s FROM dual union all SELECT 'Country' s FROM dual ) SELECT s "STRING", lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING" FROM strings; ``` The output: ``` STRING MODIFIED_STRING -------------------- -------------------- AddressLine1 address_line_1 ZipCode zip_code Country country ``` There are some mainstream database support the syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490 Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace Redshift https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html ### Why are the changes needed? The parameter position for `REGEXP_REPLACE` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #29891 from beliefer/add-position-for-regex_replace. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 07:59:49 +00:00
Chao Sun	cb3fa6c936	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? This serves two purposes: - to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop. - avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #29843 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-10-22 03:21:34 +00:00
Max Gekk	ba13b94f6b	[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 03:04:29 +00:00
Max Gekk	a44e008de3	[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing ### What changes were proposed in this pull request? 1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0. 2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`. 3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one. 4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag: - No rebasing in testing when the SQL config `spark.test.forceNoRebase` is set to `true` - No rebasing if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`. - With rebasing if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`. - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`. New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of: - To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED. - To have different modes for INT96 and DATE in load (or in save). - To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default. ### Why are the changes needed? 1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed. 2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date). 3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps. ### Does this PR introduce _any_ user-facing change? It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`. ### How was this patch tested? - Added a test to check the metadata key `org.apache.spark.int96NoRebase` - By `ParquetIOSuite` Closes #30056 from MaxGekk/parquet-rebase-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 14:58:59 +09:00
Liang-Chi Hsieh	66c5e01322	[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase ### What changes were proposed in this pull request? This patch proposes to add more optimization to `UpdateFields` expression chain. And optimize `UpdateFields` early in analysis phase. ### Why are the changes needed? `UpdateFields` can manipulate complex nested data, but using `UpdateFields` can easily create inefficient expression chain. We should optimize it further. Because when manipulating deeply nested schema, the `UpdateFields` expression tree could be too complex to analyze, this change optimizes `UpdateFields` early in analysis phase. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29812 from viirya/SPARK-32941. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-19 10:35:34 -07:00
tanel.kiis@gmail.com	ce498943d2	[SPARK-33177][SQL] CollectList and CollectSet should not be nullable ### What changes were proposed in this pull request? Mark `CollectList` and `CollectSet` as non-nullable. ### Why are the changes needed? `CollectList` and `CollectSet` SQL expressions never return null value. Marking them as non-nullable can have some performance benefits, because some optimizer rules apply only to non-nullable expressions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Did not find any existing tests on the nullability of aggregate functions. Closes #30087 from tanelk/SPARK-33177_collect. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 09:50:59 +09:00
Liang-Chi Hsieh	3010e9044e	[SPARK-33170][SQL] Add SQL config to control fast-fail behavior in FileFormatWriter ### What changes were proposed in this pull request? This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default. ### Why are the changes needed? In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry. Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases. We are going to add a config we can control this behavior and set it false for fast-fail by default. ### Does this PR introduce _any_ user-facing change? Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it. ### How was this patch tested? Unit test. Closes #30073 from viirya/SPARK-33170. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-17 21:02:25 -07:00
Liang-Chi Hsieh	2c4599db4b	[MINOR][SS][DOCS] Update Structured Streaming guide doc and update code typo ### What changes were proposed in this pull request? This is a minor change to update structured-streaming-programming-guide and typos in code. ### Why are the changes needed? Keep the user-facing document correct and updated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #30074 from viirya/ss-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-16 22:18:12 -07:00
Liang-Chi Hsieh	e574fcd230	[SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns ### What changes were proposed in this pull request? SPARK-29358 added support for `unionByName` to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns. The behavior before this PR: ```scala scala> val df1 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2, 'a', id + 3) c1") scala> val df2 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2) c1") scala> df1.unionByName(df2, true).printSchema org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c:bigint,b:bigint> <> struct<c:bigint,b:bigint,a:bigint> at the second column of the second table;; 'Union false, false :- Project [id#0L AS c0#2L, named_struct(c, (id#0L + cast(1 as bigint)), b, (id#0L + cast(2 as bigint)), a, (id#0L + cast(3 as bigint))) AS c1#3] : +- Range (0, 1, step=1, splits=Some(12)) +- Project [c0#8L, c1#9] +- Project [id#6L AS c0#8L, named_struct(c, (id#6L + cast(1 as bigint)), b, (id#6L + cast(2 as bigint))) AS c1#9] +- Range (0, 1, step=1, splits=Some(12)) ``` The behavior after this PR: ```scala scala> df1.unionByName(df2, true).printSchema root \|-- c0: long (nullable = false) \|-- c1: struct (nullable = false) \| \|-- a: long (nullable = true) \| \|-- b: long (nullable = false) \| \|-- c: long (nullable = false) scala> df1.unionByName(df2, true).show() +---+-------------+ \| c0\| c1\| +---+-------------+ \| 0\| {3, 2, 1}\| \| 0\|{ null, 2, 1}\| +---+-------------+ ``` ### Why are the changes needed? The `allowMissingColumns` of `unionByName` is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage. ### Does this PR introduce _any_ user-facing change? Yes, after this change users can union two datasets with different schema with different structs. ### How was this patch tested? Unit tests. Closes #29587 from viirya/SPARK-32376. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-16 14:48:14 -07:00
ulysses	3ae1520185	[SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name ### What changes were proposed in this pull request? Correct the resolution of having clause. ### Why are the changes needed? Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example: ``` -- Works resolved by `ResolveReferences` select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 ``` It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? add test. Closes #30029 from ulysses-you/SPARK-33131. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:26:27 +00:00
gengjiaan	b69e0651fe	[SPARK-33126][SQL] Simplify offset window function(Remove direction field) ### What changes were proposed in this pull request? The current `Lead`/`Lag` extends `OffsetWindowFunction`. `OffsetWindowFunction` contains field `direction` and use `direction` to calculates the `boundary`. We can use single literal expression unify the two properties. For example: 3 means `direction` is Asc and `boundary` is 3. -3 means `direction` is Desc and `boundary` is -3. ### Why are the changes needed? Improve the current implement of `Lead`/`Lag`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30023 from beliefer/SPARK-33126. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 11:11:57 +00:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Denis Pyshev	ba69d68d91	[SPARK-33080][BUILD] Replace fatal warnings snippet ### What changes were proposed in this pull request? Current solution in build file to enable build failure on compilation warnings with exclusion of deprecation ones is not portable after SBT version 1.3.13 (build import fails with compilation error with SBT 1.4) and could be replaced with more robust and maintainable, especially since Scala 2.13.2 with similar built-in functionality. Additionally, warnings were fixed to pass the build, with as few changes as possible: warnings in 2.12 compilation fixed in code, warnings in 2.13 compilation covered by configuration to be addressed separately ### Why are the changes needed? Unblocks upgrade to SBT after 1.3.13. Enhances build file maintainability. Allows fine tune of warnings configuration in scope of Scala 2.13 compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/sbt`'s `compile` and `Test/compile` for both Scala 2.12 and 2.13 profiles. Closes #29995 from gemelen/feature/warnings-reporter. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-15 14:49:43 -05:00
Liang-Chi Hsieh	9e3746469c	[SPARK-33078][SQL] Add config for json expression optimization ### What changes were proposed in this pull request? This proposes to add a config for json expression optimization. ### Why are the changes needed? For the new Json expression optimization rules, it is safer if we can disable it using SQL config. ### Does this PR introduce _any_ user-facing change? Yes, users can disable json expression optimization rule. ### How was this patch tested? Unit test Closes #30047 from viirya/SPARK-33078. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-15 12:38:10 -07:00
Jungtaek Lim (HeartSaVioR)	8e5cb1d276	[SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes #30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-14 08:30:03 -07:00
Max Gekk	05a62dcada	[SPARK-33134][SQL] Return partial results only for root JSON objects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default PERMISSIVE mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ \|event\| +-----+ \| null\| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes #30031 from MaxGekk/json-skip-row-wrong-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-14 12:13:54 +09:00
xuewei.linxuewei	dc697a8b59	[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero ### What changes were proposed in this pull request? As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result. Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard. ### Why are the changes needed? SQL correctness issue. ### Does this PR introduce any user-facing change? Yes. See sql-migration-guide In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`. ### How was this patch tested? Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior. Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior. Closes #29983 from leanken/leanken-SPARK-13860. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:21:45 +00:00
gengjiaan	2b7239edfb	[SPARK-33125][SQL] Improve the error when Lead and Lag are not allowed to specify window frame ### What changes were proposed in this pull request? Except for Postgresql, other data sources (for example: vertica, oracle, redshift, mysql, presto) are not allowed to specify window frame for the Lead and Lag functions. But the current error message is not clear enough. `Window Frame $f must match the required frame` This PR will use the following error message. `Cannot specify window frame for lead function` ### Why are the changes needed? Make clear error message. ### Does this PR introduce _any_ user-facing change? Yes Users will see the clearer error message. ### How was this patch tested? Jenkins test. Closes #30021 from beliefer/SPARK-33125. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:12:17 +00:00
Huaxin Gao	af3e2f7d58	[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) ### What changes were proposed in this pull request? - Override the default SQL strings in the DB2 Dialect for: * ALTER TABLE UPDATE COLUMN TYPE * ALTER TABLE UPDATE COLUMN NULLABILITY - Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala ### Why are the changes needed? In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new integration test suite: $ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite" Closes #29972 from huaxingao/db2_docker. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:57:54 +00:00
Chao Sun	feee8da14b	[SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types ### What changes were proposed in this pull request? In SPARK-24994 we implemented unwrapping cast for integral types. This extends it to support numeric types such as float/double/decimal, so that filters involving these types can be better pushed down to data sources. Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case: ```sql cast(e as double) < 1.9 ``` assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to: ```sql e < cast(1.9 as short) ``` as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false. To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from converting the literal first to the narrower type, and then to the wider type, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9. <img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png"> Now in the case of truncate, we'd convert the original expression to: ```sql e <= cast(1.9 as short) ``` instead, so that the conversion also is valid when `e` is 1. For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works. ### Why are the changes needed? For queries such as: ```sql SELECT * FROM tbl WHERE short_col < 100.5 ``` The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29792 from sunchao/SPARK-32858. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 12:44:20 +00:00
tanel.kiis@gmail.com	17eebd7209	[SPARK-32295][SQL] Add not null and size > 0 filters before inner explode/inline to benefit from predicate pushdown ### What changes were proposed in this pull request? Add `And(IsNotNull(e), GreaterThan(Size(e), Literal(0)))` filter before Explode, PosExplode and Inline, when `outer = false`. Removed unused `InferFiltersFromConstraints` from `operatorOptimizationRuleSet` to avoid confusion that happened during the review process. ### Why are the changes needed? Predicate pushdown will be able to move this new filter down through joins and into data sources for performance improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29092 from tanelk/SPARK-32295. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-13 20:11:04 +09:00
Takeshi Yamamuro	a0e324460e	[SPARK-32704][SQL][FOLLOWUP] Corrects version values of plan logging configs in SQLConf ### What changes were proposed in this pull request? This PR intends to correct version values (`3.0.0` -> `3.1.0`) of three configs below in `SQLConf`: - spark.sql.planChangeLog.level - spark.sql.planChangeLog.rules - spark.sql.planChangeLog.batches This PR comes from https://github.com/apache/spark/pull/29544#discussion_r503049350. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30015 from maropu/pr29544-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 22:54:31 +09:00
Liang-Chi Hsieh	78c0967bbe	[SPARK-33092][SQL] Support subexpression elimination in ProjectExec ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination support into `ProjectExec`. It can be controlled by `spark.sql.subexpressionElimination.enabled` config. Before this change: ```scala val df = spark.read.option("header", true).csv("/tmp/test.csv") df.withColumn("my_map", expr("str_to_map(foo, '&', '=')")).select(col("my_map")("foo"), col("my_map")("bar"), col("my_map")("baz")).debugCodegen ``` L27-40: first `str_to_map`. L68:81: second `str_to_map`. L109-122: third `str_to_map`. ``` /* 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / boolean project_isNull_0 = true; / 026 / UTF8String project_value_0 = null; / 027 / boolean project_isNull_1 = true; / 028 / MapData project_value_1 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_1 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / if (!project_isNull_1) { / 042 / project_isNull_0 = false; // resultCode could change nullability. / 043 / / 044 / final int project_length_0 = project_value_1.numElements(); / 045 / final ArrayData project_keys_0 = project_value_1.keyArray(); / 046 / final ArrayData project_values_0 = project_value_1.valueArray(); / 047 / / 048 / int project_index_0 = 0; / 049 / boolean project_found_0 = false; / 050 / while (project_index_0 < project_length_0 && !project_found_0) { / 051 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 052 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 053 / project_found_0 = true; / 054 / } else { / 055 / project_index_0++; / 056 / } / 057 / } / 058 / / 059 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 060 / project_isNull_0 = true; / 061 / } else { / 062 / project_value_0 = project_values_0.getUTF8String(project_index_0); / 063 / } / 064 / / 065 / } / 066 / boolean project_isNull_6 = true; / 067 / UTF8String project_value_6 = null; / 068 / boolean project_isNull_7 = true; / 069 / MapData project_value_7 = null; / 070 / / 071 / if (!project_exprIsNull_0_0) { / 072 / project_isNull_7 = false; // resultCode could change nullability. / 073 / / 074 / UTF8String[] project_kvs_1 = project_expr_0_0.split(((UTF8String) references[5] / literal /), -1); / 075 / for(UTF8String kvEntry: project_kvs_1) { / 076 / UTF8String[] kv = kvEntry.split(((UTF8String) references[6] / literal /), 2); / 077 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 078 / } / 079 / project_value_7 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] / mapBuilder /).build(); / 080 / / 081 / } / 082 / if (!project_isNull_7) { / 083 / project_isNull_6 = false; // resultCode could change nullability. / 084 / / 085 / final int project_length_1 = project_value_7.numElements(); / 086 / final ArrayData project_keys_1 = project_value_7.keyArray(); / 087 / final ArrayData project_values_1 = project_value_7.valueArray(); / 088 / / 089 / int project_index_1 = 0; / 090 / boolean project_found_1 = false; / 091 / while (project_index_1 < project_length_1 && !project_found_1) { / 092 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 093 / if (project_key_1.equals(((UTF8String) references[7] / literal /))) { / 094 / project_found_1 = true; / 095 / } else { / 096 / project_index_1++; / 097 / } / 098 / } / 099 / / 100 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 101 / project_isNull_6 = true; / 102 / } else { / 103 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 104 / } / 105 / / 106 / } / 107 / boolean project_isNull_12 = true; / 108 / UTF8String project_value_12 = null; / 109 / boolean project_isNull_13 = true; / 110 / MapData project_value_13 = null; / 111 / / 112 / if (!project_exprIsNull_0_0) { / 113 / project_isNull_13 = false; // resultCode could change nullability. / 114 / / 115 / UTF8String[] project_kvs_2 = project_expr_0_0.split(((UTF8String) references[9] / literal /), -1); / 116 / for(UTF8String kvEntry: project_kvs_2) { / 117 / UTF8String[] kv = kvEntry.split(((UTF8String) references[10] / literal /), 2); / 118 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 119 / } / 120 / project_value_13 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] / mapBuilder /).build(); / 121 / / 122 / } ... ``` After this change: L27-40 evaluates the common map variable. ``` / 024 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / boolean project_isNull_0 = true; / 028 / MapData project_value_0 = null; / 029 / / 030 / if (!project_exprIsNull_0_0) { / 031 / project_isNull_0 = false; // resultCode could change nullability. / 032 / / 033 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 034 / for(UTF8String kvEntry: project_kvs_0) { / 035 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 036 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 037 / } / 038 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 039 / / 040 / } / 041 / / 042 / boolean project_isNull_4 = true; / 043 / UTF8String project_value_4 = null; / 044 / / 045 / if (!project_isNull_0) { / 046 / project_isNull_4 = false; // resultCode could change nullability. / 047 / / 048 / final int project_length_0 = project_value_0.numElements(); / 049 / final ArrayData project_keys_0 = project_value_0.keyArray(); / 050 / final ArrayData project_values_0 = project_value_0.valueArray(); / 051 / / 052 / int project_index_0 = 0; / 053 / boolean project_found_0 = false; / 054 / while (project_index_0 < project_length_0 && !project_found_0) { / 055 / final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0); / 056 / if (project_key_0.equals(((UTF8String) references[3] / literal /))) { / 057 / project_found_0 = true; / 058 / } else { / 059 / project_index_0++; / 060 / } / 061 / } / 062 / / 063 / if (!project_found_0 \|\| project_values_0.isNullAt(project_index_0)) { / 064 / project_isNull_4 = true; / 065 / } else { / 066 / project_value_4 = project_values_0.getUTF8String(project_index_0); / 067 / } / 068 / / 069 / } / 070 / boolean project_isNull_6 = true; / 071 / UTF8String project_value_6 = null; / 072 / / 073 / if (!project_isNull_0) { / 074 / project_isNull_6 = false; // resultCode could change nullability. / 075 / / 076 / final int project_length_1 = project_value_0.numElements(); / 077 / final ArrayData project_keys_1 = project_value_0.keyArray(); / 078 / final ArrayData project_values_1 = project_value_0.valueArray(); / 079 / / 080 / int project_index_1 = 0; / 081 / boolean project_found_1 = false; / 082 / while (project_index_1 < project_length_1 && !project_found_1) { / 083 / final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1); / 084 / if (project_key_1.equals(((UTF8String) references[4] / literal /))) { / 085 / project_found_1 = true; / 086 / } else { / 087 / project_index_1++; / 088 / } / 089 / } / 090 / / 091 / if (!project_found_1 \|\| project_values_1.isNullAt(project_index_1)) { / 092 / project_isNull_6 = true; / 093 / } else { / 094 / project_value_6 = project_values_1.getUTF8String(project_index_1); / 095 / } / 096 / / 097 / } / 098 / boolean project_isNull_8 = true; / 099 / UTF8String project_value_8 = null; / 100 / ... ``` When the code is split into separated method: ``` / 026 / private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 027 / // common sub-expressions / 028 / / 029 / MapData project_subExprValue_0 = project_subExpr_0(project_exprIsNull_0_0, project_expr_0_0); / 030 / ... / 140 / private MapData project_subExpr_0(boolean project_exprIsNull_0_0, org.apache.spark.unsafe.types.UTF8String project_expr_0_0) { / 141 / boolean project_isNull_0 = true; / 142 / MapData project_value_0 = null; / 143 / / 144 / if (!project_exprIsNull_0_0) { / 145 / project_isNull_0 = false; // resultCode could change nullability. / 146 / / 147 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] / literal /), -1); / 148 / for(UTF8String kvEntry: project_kvs_0) { / 149 / UTF8String[] kv = kvEntry.split(((UTF8String) references[2] / literal /), 2); / 150 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).put(kv[0], kv.length == 2 ? kv[1] : null); / 151 / } / 152 / project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] / mapBuilder /).build(); / 153 / / 154 / } / 155 / project_subExprIsNull_0 = project_isNull_0; / 156 / return project_value_0; / 157 */ } ``` ### Why are the changes needed? Users occasionally write repeated expression in projection. It is also possibly that query optimizer optimizes a query to evaluate same expression many times in a Project. Currently in ProjectExec, we don't support subexpression elimination in Whole-stage codegen. We can support it to reduce redundant evaluation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `spark.sql.subexpressionElimination.enabled` is enabled by default. So that's said we should pass all tests with this change. Closes #29975 from viirya/SPARK-33092. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-12 16:54:21 +09:00
Gabor Somogyi	4af1ac9384	[SPARK-32047][SQL] Add JDBC connection provider disable possibility ### What changes were proposed in this pull request? At the moment there is no possibility to turn off JDBC authentication providers which exists on the classpath. This can be problematic because service providers are loaded with service loader. In this PR I've added `spark.sql.sources.disabledJdbcConnProviderList` configuration possibility (default: empty). ### Why are the changes needed? No possibility to turn off JDBC authentication providers. ### Does this PR introduce _any_ user-facing change? Yes, it introduces new configuration option. ### How was this patch tested? * Existing + newly added unit tests. * Existing integration tests. Closes #29964 from gaborgsomogyi/SPARK-32047. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-12 12:24:54 +09:00
Jungtaek Lim (HeartSaVioR)	edb140eb5c	[SPARK-32896][SS] Add DataStreamWriter.table API ### What changes were proposed in this pull request? This PR proposes to add `DataStreamWriter.table` to specify the output "table" to write from the streaming query. ### Why are the changes needed? For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the `DataStreamWriter.format(provider)` and wish the provider can handle it as same as we do with catalog table. With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following: ```scala // assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo") val query = inputDF .writeStream .table("testcat.ns.table1") .option(...) .start() ``` ### Does this PR introduce _any_ user-facing change? Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change. ### How was this patch tested? New unit tests. Closes #29767 from HeartSaVioR/SPARK-32896. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:01:54 -07:00
ulysses	a9077299d7	[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString ### What changes were proposed in this pull request? Add distinct info at `UnresolvedFunction.toString`. ### Why are the changes needed? Make `UnresolvedFunction` info complete. ``` create table test (c1 int, c2 int); explain extended select sum(distinct c1) from test; -- before this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum('c1), None)] +- 'UnresolvedRelation [test] -- after this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum(distinct 'c1), None)] +- 'UnresolvedRelation [test] ``` ### Does this PR introduce _any_ user-facing change? Yes, get distinct info during sql parse. ### How was this patch tested? manual test. Closes #29586 from ulysses-you/SPARK-32743. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-09 09:25:22 +09:00
Max Gekk	7d6e3fb998	[SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog ### What changes were proposed in this pull request? 1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes. 2. Replace H2 exception `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`. 3. Add `H2Dialect` ### Why are the changes needed? Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`. Closes #29952 from MaxGekk/jdbcv2-classify-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 05:28:33 +00:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
Takeshi Yamamuro	94d648dff5	[SPARK-33036][SQL] Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner ### What changes were proposed in this pull request? This PR intends to refactor code in `RewriteCorrelatedScalarSubquery` for replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one. This PR comes from the talk with cloud-fan in https://github.com/apache/spark/pull/29585#discussion_r490371252. ### Why are the changes needed? To improve code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29913 from maropu/RefactorRewriteCorrelatedScalarSubquery. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-07 20:16:40 +09:00
Terry Kim	7e99fcd64e	[SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE testcat.ns") sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t Describing columns is not supported for v2 tables.; org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.; ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE spark_catalog.test") sql("DESCRIBE t i").show // 't' is resolved to a temp view +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| i\| \|data_type\| int\| \| comment\| NULL\| +---------+----------+ ``` ### Does this PR introduce _any_ user-facing change? After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29880 from imback82/describe_column_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 06:33:20 +00:00
Liang-Chi Hsieh	57ed5a829b	[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain ### What changes were proposed in this pull request? This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`. ### Why are the changes needed? Simplify complex expression tree that could be produced by query optimization or user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29942 from viirya/SPARK-33007. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 16:59:23 -07:00
Kousuke Saruta	3b2a38d735	[SPARK-32511][SQL][FOLLOWUP] Fix the broken build for Scala 2.13 with Maven ### What changes were proposed in this pull request? This PR fixes the broken build for Scala 2.13 with Maven. https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966 #29795 was merged though it doesn't successfully finish the build for Scala 2.13 ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package` Closes #29954 from sarutak/hotfix-seq. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 09:40:16 -07:00
angerszhu	ddc7012b3d	[SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error ### What changes were proposed in this pull request? When we create a UDAF function use class extended `UserDefinedAggregeteFunction`, when we call the function, in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, but it will catch error such as the function need 2 parameter and we only give 1, throw exception only show ``` No handler for UDF/UDAF/UDTF xxxxxxxx ``` This is confused for develop , we should show error thrown by super method too, For this pr's UT : Before change, throw Exception like ``` No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` After this pr, throw exception ``` Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1; Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` ### Why are the changes needed? Show more detail error message when define UDAF ### Does this PR introduce _any_ user-facing change? People will see more detail error message when use spark sql's UDAF in hive support Mode ### How was this patch tested? Added UT Closes #29054 from AngersZhuuuu/SPARK-32243. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 09:09:19 +00:00
fqaiser94@gmail.com	2793347972	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:53:30 +00:00
Takeshi Yamamuro	4adc2822a3	[SPARK-33035][SQL] Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput ### What changes were proposed in this pull request? This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this; ``` case class TestRule extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput { case p Project(projList, _) => val newPlan = p.copy(projectList = projList.map { _.transform { // Assigns a new `ExprId` for references case a: AttributeReference => Alias(a, a.name)() }}.asInstanceOf[Seq[NamedExpression]]) val attrMapping = p.output.zip(newPlan.output) newPlan -> attrMapping } } ``` Then, this rule is applied into a plan below; ``` (3) Project [a#5, b#6] +- (2) Project [a#5, b#6] +- (1) Project [a#5, b#6] +- LocalRelation <empty>, [a#5, b#6] ``` In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule. ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `QueryPlanSuite`. Closes #29911 from maropu/QueryPlanBug. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:32:55 +00:00
Yuming Wang	023eb482b2	[SPARK-32914][SQL] Avoid constructing dataType multiple times ### What changes were proposed in this pull request? Some expression's data type not a static value. It needs to be constructed a new object when calling `dataType` function. E.g.: `CaseWhen`. We should avoid constructing dataType multiple times because it may be used many times. E.g.: [`HyperLogLogPlusPlus.update`](`10edeafc69/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (L122)`). ### Why are the changes needed? Improve query performance. for example: ```scala spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show ``` Profiling result: ``` -- Execution profile --- Total samples : 18365 Frame buffer usage : 2.6688% --- 58443254327 ns (31.82%), 5844 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int, StarTask&) [ 1] StealTask::do_it(GCTaskManager, unsigned int) [ 2] GCTaskThread::run() [ 3] java_start(Thread) [ 4] start_thread --- 6140668667 ns (3.34%), 614 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek() [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator) [ 2] StealTask::do_it(GCTaskManager, unsigned int) [ 3] GCTaskThread::run() [ 4] java_start(Thread) [ 5] start_thread --- 5679994036 ns (3.09%), 568 samples [ 0] scala.collection.generic.Growable.$plus$plus$eq [ 1] scala.collection.generic.Growable.$plus$plus$eq$ [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1 [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply [ 6] scala.collection.immutable.List.foreach [ 7] scala.collection.generic.GenericTraversableTemplate.flatten [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$ [ 9] scala.collection.AbstractTraversable.flatten [10] org.apache.spark.internal.config.ConfigEntry.readString [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom [12] org.apache.spark.sql.internal.SQLConf.getConf [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis [14] org.apache.spark.sql.types.DataType.sameType [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1 [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl [19] scala.collection.IndexedSeqOptimized.forall [20] scala.collection.IndexedSeqOptimized.forall$ [21] scala.collection.mutable.ArrayBuffer.forall [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$ [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$ [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2 [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7 [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test and benchmark test: Benchmark code \| Before this PR(Milliseconds) \| After this PR(Milliseconds) --- \| --- \| --- spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").collect() \| 56462 \| 3794 Closes #29790 from wangyum/SPARK-32914. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 22:00:42 +09:00
Yuning Zhang	0fb2574d4e	[SPARK-33042][SQL][TEST] Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime. ### Why are the changes needed? Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156 However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test Closes #29919 from yuningzh-db/add_optimizer_test. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 20:25:57 +09:00
Liang-Chi Hsieh	37c806af2b	[SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for `JsonToStructs` expression if we only require some fields from it. ### Why are the changes needed? `JsonToStructs` takes a schema parameter used to tell `JacksonParser` what fields are needed to parse. If `JsonToStructs` is followed by `GetStructField`. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29900 from viirya/SPARK-32958. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-03 14:55:02 -07:00
Takeshi Yamamuro	82721ce00b	[SPARK-32741][SQL][FOLLOWUP] Run plan integrity check only for effective plan changes ### What changes were proposed in this pull request? (This is a followup PR of #29585) The PR modified `RuleExecutor#isPlanIntegral` code for checking if a plan has globally-unique attribute IDs, but this check made Jenkins maven test jobs much longer (See [the Dongjoon comment](https://github.com/apache/spark/pull/29585#issuecomment-702461314) and thanks, dongjoon-hyun !). To recover running time for the Jenkins tests, this PR intends to update the code to run plan integrity check only for effective plans. ### Why are the changes needed? To recover running time for Jenkins tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29928 from maropu/PR29585-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 22:16:19 +09:00
Cheng Su	d6f3138352	[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 09:01:15 +09:00
ulysses	e62d24717e	[SPARK-32585][SQL] Support scala enumeration in ScalaReflection ### What changes were proposed in this pull request? Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark. ### Why are the changes needed? We support java enum but failed with scala enum, it's better to keep the same behavior. Here is a example. ``` package test object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import TestEnum._ case class TestClass(i: Int, e: TestEnum) { } import test._ Seq(TestClass(1, TestEnum.E1)).toDS ``` Before this PR ``` Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum - field (class: "scala.Enumeration.Value", name: "e") - root class: "test.TestClass" at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881) ``` After this PR `org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]` ### Does this PR introduce _any_ user-facing change? Yes, user can make case class which include scala enumeration field as dataset. ### How was this patch tested? Add test. Closes #29403 from ulysses-you/SPARK-32585. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-10-01 15:58:01 -04:00
yangjie01	0963fcd848	[SPARK-33024][SQL] Fix CodeGen fallback issue of UDFSuite in Scala 2.13 ### What changes were proposed in this pull request? After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow: ``` - SPARK-32459: UDF should not fail on WrappedArray * FAILED * Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)" ``` The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method. The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0 * 1 TEST FAILED * ``` After ``` Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0 All tests passed. ``` Closes #29903 from LuciferYang/fix-udfsuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-01 08:37:07 -05:00
Takeshi Yamamuro	3a299aa648	[SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans ### What changes were proposed in this pull request? Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`. This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278 ### Why are the changes needed? For better logical plan integrity checking. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29585 from maropu/PlanIntegrityTest. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-30 21:37:29 +09:00
Yuming Wang	711d8dd28a	[SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes ### What changes were proposed in this pull request? This pr fix estimate statistics issue if child has 0 bytes. ### Why are the changes needed? The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example: ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29894 from wangyum/SPARK-33018. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 16:46:04 +00:00
Liang-Chi Hsieh	202115e7cd	[SPARK-32948][SQL] Optimize to_json and from_json expression chain ### What changes were proposed in this pull request? This patch proposes to optimize from_json + to_json expression chain. ### Why are the changes needed? To optimize json expression chain that could be manually generated or generated automatically during query optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29828 from viirya/SPARK-32948. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:22:47 -07:00
Max Gekk	1b60ff5afe	[MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated ### What changes were proposed in this pull request? Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value ### Why are the changes needed? Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation: `0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running `./dev/scalastyle`. Closes #29892 from MaxGekk/doc-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:20:12 +00:00
Max Gekk	68cd5677ae	[SPARK-33015][SQL] Compute the current date only once ### What changes were proposed in this pull request? Compute the current date at the specified time zone using timestamp taken at the start of query evaluation. ### Why are the changes needed? According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`. Closes #29889 from MaxGekk/fix-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:13:01 +00:00
gengjiaan	a53fc9b7ae	[SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE. We should override the `prettyName` and `sql`. ### Why are the changes needed? Make the name of nth_value correct. To show the ignoreNulls parameter correctly. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29886 from beliefer/improve-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-29 09:54:43 +09:00
tanel.kiis@gmail.com	f41ba2a2f3	[SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar canonicalization rules to boolean OR and AND ### What changes were proposed in this pull request? Add canonicalization rules for commutative bitwise operations. ### Why are the changes needed? Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #29794 from tanelk/SPARK-32927. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 12:22:15 +09:00
Kris Mok	9a155d42a3	[SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode ### What changes were proposed in this pull request? Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`. ### Why are the changes needed? On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error. Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue. ### Does this PR introduce _any_ user-facing change? Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes. ### How was this patch tested? Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs. Manually tested on JDK8u and JDK11u and observed expected behavior: - JDK8u: the test case triggers the "Malformed class name" issue and the fix works; - JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped. Closes #29875 from rednaxelafx/spark-32999-getsimplename. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-26 16:03:59 -07:00
gatorsmile	e887c639a7	[SPARK-32931][SQL] Unevaluable Expressions are not Foldable ### What changes were proposed in this pull request? Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it. ### Why are the changes needed? Ensure that we will not hit the weird cases that trigger ConstantFolding. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #29798 from gatorsmile/refactorUneval. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 07:27:29 +00:00
Yuanjian Li	9e6882feca	[SPARK-32885][SS] Add DataStreamReader.table API ### What changes were proposed in this pull request? This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader. ### Why are the changes needed? Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example: Application 1 for initializing and starting the streaming job: ``` val path = "/home/yuanjian.li/runtime/to_be_deleted" val tblName = "my_table" // Write some data to `my_table` spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName) // Read the table as a streaming source, write result to destination directory val table = spark.readStream.table(tblName) table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2") ``` Application 2 for appending new data: ``` // Append new data into the path spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save() ``` Check result: ``` // The desitination directory should contains all written data spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show() ``` ### Does this PR introduce _any_ user-facing change? Yes, a new API added. ### How was this patch tested? New UT added and integrated testing. Closes #29756 from xuanyuanking/SPARK-32885. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 06:50:24 +00:00
Terry Kim	e9c98c910a	[SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("REFRESH TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29866 from imback82/refresh_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 04:29:09 +00:00
Michael Munday	383bb4af00	[SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms MurmurHash3 and xxHash64 interpret sequences of bytes as integers encoded in little-endian byte order. This requires a byte reversal on big endian platforms. I've left the hashInt and hashLong functions as-is for now. My interpretation of these functions is that they perform the hash on the integer value as if it were serialized in little-endian byte order. Therefore no byte reversal is necessary. ### What changes were proposed in this pull request? Modify hash functions to produce correct results on big-endian platforms. ### Why are the changes needed? Hash functions produce incorrect results on big-endian platforms which, amongst other potential issues, causes test failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests run on the IBM Z (s390x) platform which uses a big-endian byte order. Closes #29762 from mundaym/fix-hashes. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-23 12:36:46 -05:00
tanel.kiis@gmail.com	acfee3c8b1	[SPARK-32870][DOCS][SQL] Make sure that all expressions have their ExpressionDescription filled ### What changes were proposed in this pull request? Made sure, that all the expressions in the `FunctionRegistry ` have the fields `usage`, `examples` and `since` filled in their `ExpressionDescription`. Added UT to `ExpressionInfoSuite`, to make sure, that all new expressions will also fill those fields. ### Why are the changes needed? Documentation improvement ### Does this PR introduce _any_ user-facing change? Better generated SQL built in functions documentation ### How was this patch tested? Checked the fix version in the following jiras: SPARK-1251 - UnaryMinus, Add, Subtract, Multiply, Divide, Remainder, Explode, Not, In, And, Or, Equals, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual, If, Cast SPARK-2053 - CaseWhen SPARK-2665 - EqualNullSafe SPARK-3176 - Abs SPARK-6542 - CreateStruct SPARK-7135 - MonotonicallyIncreasingID SPARK-7152 - SparkPartitionID SPARK-7295 - bitwiseAND, bitwiseOR, bitwiseXOR, bitwiseNOT SPARK-8005 - InputFileName SPARK-8203 - Greatest SPARK-8204 - Least SPARK-8220 - UnaryPositive SPARK-8221 - Pmod SPARK-8230 - Size SPARK-8231 - ArrayContains SPARK-8232 - SortArray SPARK-8234 - md5 SPARK-8235 - sha1 SPARK-8236 - crc32 SPARK-8237 - sha2 SPARK-8240 - Concat SPARK-8246 - GetJsonObject SPARK-8407 - CreateNamedStruct SPARK-9617 - JsonTuple SPARK-10810 - CurrentDatabase SPARK-12480 - Murmur3Hash SPARK-14061 - CreateMap SPARK-14160 - TimeWindow SPARK-14580 - AssertTrue SPARK-16274 - XPathBoolean SPARK-16278 - MapKeys SPARK-16279 - MapValues SPARK-16284 - CallMethodViaReflection SPARK-16286 - Stack SPARK-16288 - Inline SPARK-16289 - PosExplode SPARK-16318 - XPathShort, XPathInt, XPathLong, XPathFloat, XPathDouble, XPathString, XPathList SPARK-16730 - Cast aliases SPARK-17495 - HiveHash SPARK-18702 - InputFileBlockStart, InputFileBlockLength SPARK-20910 - UUID Closes #29743 from tanelk/SPARK-32870. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-23 10:18:38 +09:00
Max Gekk	b53da23a28	[MINOR][SQL] Improve examples for `percentile_approx()` ### What changes were proposed in this pull request? In the PR, I propose to replace current examples for `percentile_approx()` with only one input value by example with multiple values in the input column. ### Why are the changes needed? Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - by running `ExpressionInfoSuite` - `./dev/scalastyle` Closes #29841 from MaxGekk/example-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:41:38 +09:00
Max Gekk	7c14f177eb	[SPARK-32306][SQL][DOCS] Clarify the result of `percentile_approx()` ### What changes were proposed in this pull request? More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns one of elements (or array of elements) from the input column. ### Why are the changes needed? To improve Spark docs and avoid misunderstanding of the function behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #29835 from MaxGekk/doc-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-09-22 12:45:19 -07:00
Peter Toth	f03c03576a	[SPARK-32951][SQL] Foldable propagation from Aggregate ### What changes were proposed in this pull request? This PR adds foldable propagation from `Aggregate` as per: https://github.com/apache/spark/pull/29771#discussion_r490412031 ### Why are the changes needed? This is an improvement as `Aggregate`'s `aggregateExpressions` can contain foldables that can be propagated up. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. Closes #29816 from peter-toth/SPARK-32951-foldable-propagation-from-aggregate. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-21 21:43:17 -07:00
angerszhu	c336ddfdb8	[SPARK-32867][SQL] When explain, HiveTableRelation show limited message ### What changes were proposed in this pull request? In current mode, when explain a SQL plan with HiveTableRelation, it will show so many info about HiveTableRelation's prunedPartition, this make plan hard to read, this pr make this information simpler. Before: ![image](https://user-images.githubusercontent.com/46485123/93012078-aeeca080-f5cf-11ea-9286-f5c15eadbee3.png) For UT ``` test("Make HiveTableScanExec message simple") { withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") { withTable("df") { spark.range(30) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("hive") .mode("overwrite") .saveAsTable("df") val df = sql("SELECT df.id, df.k FROM df WHERE df.k < 2") df.explain(true) } } } ``` After this pr will show ``` == Parsed Logical Plan == 'Project ['df.id, 'df.k] +- 'Filter ('df.k < 2) +- 'UnresolvedRelation [df], [] == Analyzed Logical Plan == id: bigint, k: bigint Project [id#11L, k#12L] +- Filter (k#12L < cast(2 as bigint)) +- SubqueryAlias spark_catalog.default.df +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L]] == Optimized Logical Plan == Filter (isnotnull(k#12L) AND (k#12L < 2)) +- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]] == Physical Plan == Scan hive default.df [id#11L, k#12L], HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]], [isnotnull(k#12L), (k#12L < 2)] ``` In my pr, I will construct `HiveTableRelation`'s `simpleString` method to avoid show too much unnecessary info in explain plan. compared to what we had before，I decrease the detail metadata of each partition and only retain the partSpec to show each partition was pruned. Since for detail information, we always don't see this in Plan but to use DESC EXTENDED statement. ### Why are the changes needed? Make plan about HiveTableRelation more readable ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #29739 from AngersZhuuuu/HiveTableScan-meta-location-info. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-21 09:15:12 +00:00
Peter Toth	3309a2be07	[SPARK-32635][SQL][FOLLOW-UP] Add a new test case in catalyst module ### What changes were proposed in this pull request? This is a follow-up PR to https://github.com/apache/spark/pull/29771 and just adds a new test case. ### Why are the changes needed? To have better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. Closes #29802 from peter-toth/SPARK-32635-fix-foldable-propagation-followup. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-18 13:56:19 -07:00
yangjie01	2128c4f14b	[SPARK-32808][SQL] Pass all test of sql/core module in Scala 2.13 ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/29660 and https://github.com/apache/spark/pull/29689 there are 13 remaining failed cases of sql core module with Scala 2.13. The reason for the remaining failed cases is the optimization result of `CostBasedJoinReorder` maybe different with same input in Scala 2.12 and Scala 2.13 if there are more than one same cost candidate plans. In this pr give a way to make the optimization result deterministic as much as possible to pass all remaining failed cases of `sql/core` module in Scala 2.13, the main change of this pr as follow: - Change to use `LinkedHashMap` instead of `Map` to store `foundPlans` in `JoinReorderDP.search` method to ensure same iteration order with same insert order because iteration order of `Map` behave differently under Scala 2.12 and 2.13 - Fixed `StarJoinCostBasedReorderSuite` affected by the above change - Regenerate golden files affected by the above change. ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8485, failed 13, canceled 1, ignored 52, pending 0 * 13 TESTS FAILED * ``` After ``` Tests: succeeded 8498, failed 0, canceled 1, ignored 52, pending 0 All tests passed. ``` Closes #29711 from LuciferYang/SPARK-32808-3. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-18 10:38:30 -05:00
gengjiaan	8b09536cdf	[SPARK-27951][SQL] Support ANSI SQL NTH_VALUE window function ### What changes were proposed in this pull request? The `NTH_VALUE` function is an ANSI SQL. For examples: ``` CREATE TEMPORARY TABLE empsalary ( depname varchar, empno bigint, salary int, enroll_date date ); INSERT INTO empsalary VALUES ('develop', 10, 5200, '2007-08-01'), ('sales', 1, 5000, '2006-10-01'), ('personnel', 5, 3500, '2007-12-10'), ('sales', 4, 4800, '2007-08-08'), ('personnel', 2, 3900, '2006-12-23'), ('develop', 7, 4200, '2008-01-01'), ('develop', 9, 4500, '2008-01-01'), ('sales', 3, 4800, '2007-08-01'), ('develop', 8, 6000, '2006-10-01'), ('develop', 11, 5200, '2007-08-15'); select first_value(salary) over(order by salary range between 1000 preceding and 1000 following), lead(salary) over(order by salary range between 1000 preceding and 1000 following), nth_value(salary, 1) over(order by salary range between 1000 preceding and 1000 following), salary from empsalary; first_value \| lead \| nth_value \| salary -------------+------+-----------+-------- 3500 \| 3900 \| 3500 \| 3500 3500 \| 4200 \| 3500 \| 3900 3500 \| 4500 \| 3500 \| 4200 3500 \| 4800 \| 3500 \| 4500 3900 \| 4800 \| 3900 \| 4800 3900 \| 5000 \| 3900 \| 4800 4200 \| 5200 \| 4200 \| 5000 4200 \| 5200 \| 4200 \| 5200 4200 \| 6000 \| 4200 \| 5200 5000 \| \| 5000 \| 6000 (10 rows) ``` There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/8.4/functions-window.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/NTH_VALUEAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_____23 Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html Presto https://prestodb.io/docs/current/functions/window.html MySQL https://www.mysqltutorial.org/mysql-window-functions/mysql-nth_value-function/ ### Why are the changes needed? The `NTH_VALUE` function is an ANSI SQL. The `NTH_VALUE` function is very useful. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exists and new UT. Closes #29604 from beliefer/support-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-18 07:06:38 +00:00
Takeshi Yamamuro	b49aaa33e1	[SPARK-32906][SQL] Struct field names should not change after normalizing floats ### What changes were proposed in this pull request? This PR intends to fix a minor bug when normalizing floats for struct types; ``` scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") scala> val agg = df.distinct() scala> agg.explain() == Physical Plan == (2) HashAggregate(keys=[k#40], functions=[]) +- Exchange hashpartitioning(k#40, 200), true, [id=#62] +- (1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1)))) AS k#40], functions=[]) +- *(1) LocalTableScan [k#40] scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head } scala> aggOutput.foreach { attr => println(attr.prettyJson) } ### Final Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "_1", ^^^ "type" : "double", "nullable" : false, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ### Partial Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "col1", ^^^^ "type" : "double", "nullable" : true, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #29780 from maropu/FixBugInNormalizedFloatingNumbers. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-09-17 22:07:47 -07:00
Max Gekk	75dd86400c	[SPARK-32908][SQL] Fix target error calculation in `percentile_approx()` ### What changes were proposed in this pull request? 1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1). 2. Added a test to check different accuracies. 3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test. ### Why are the changes needed? To fix incorrect percentile calculation, see an example in SPARK-32908. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`. - Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`. Closes #29784 from MaxGekk/fix-percentile_approx-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-18 10:47:06 +09:00
Peter Toth	4ced58862c	[SPARK-32635][SQL] Fix foldable propagation ### What changes were proposed in this pull request? This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children. Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`): ```scala val a = Seq("1").toDF("col1").withColumn("col2", lit("1")) val b = Seq("2").toDF("col1").withColumn("col2", lit("2")) val aub = a.union(b) val c = aub.filter($"col1" === "2").cache() val d = Seq("2").toDF( "col4") val r = d.join(aub, $"col2" === $"col4").select("col4") val l = c.select("col2") val df = l.join(r, $"col2" === $"col4", "LeftOuter") df.show() ``` foldable propagation happens incorrectly: ``` Join LeftOuter, (col2#6 = col4#34) Join LeftOuter, (col2#6 = col4#34) !:- Project [col2#6] :- Project [1 AS col2#6] : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- Union : +- Union : :- (1) Project [value#1 AS col1#4, 1 AS col2#6] : :- (1) Project [value#1 AS col1#4, 1 AS col2#6] : : +- (1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- (1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- (1) LocalTableScan [value#1] : : +- (1) LocalTableScan [value#1] : +- (2) Project [value#10 AS col1#13, 2 AS col2#15] : +- (2) Project [value#10 AS col1#13, 2 AS col2#15] : +- (2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- (2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- (2) LocalTableScan [value#10] : +- (2) LocalTableScan [value#10] +- Project [col4#34] +- Project [col4#34] +- Join Inner, (col2#6 = col4#34) +- Join Inner, (col2#6 = col4#34) :- Project [value#31 AS col4#34] :- Project [value#31 AS col4#34] : +- LocalRelation [value#31] : +- LocalRelation [value#31] +- Project [col2#6] +- Project [col2#6] +- Union false, false +- Union false, false :- Project [1 AS col2#6] :- Project [1 AS col2#6] : +- LocalRelation [value#1] : +- LocalRelation [value#1] +- Project [2 AS col2#15] +- Project [2 AS col2#15] +- LocalRelation [value#10] +- LocalRelation [value#10] ``` and so the result is wrong: ``` +----+----+ \|col2\|col4\| +----+----+ \| 1\|null\| +----+----+ ``` After this PR foldable propagation will not happen incorrectly and the result is correct: ``` +----+----+ \|col2\|col4\| +----+----+ \| 2\| 2\| +----+----+ ``` ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Existing and new UTs. Closes #29771 from peter-toth/SPARK-32635-fix-foldable-propagation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-18 08:17:23 +09:00
Chao Sun	482a79a5e3	[SPARK-24994][SQL][FOLLOW-UP] Handle foldable, timezone and cleanup ### What changes were proposed in this pull request? This is a follow-up on #29565, and addresses a few issues in the last PR: - style issue pointed by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646749) - skip optimization when `fromExp` is foldable (by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646973)) as there could be more efficient rule to apply for this case. - pass timezone info to the generated cast on the literal value - a bunch of cleanups and test improvements Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that. ### Why are the changes needed? To fix a few left over issues in the above PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test for the foldable case. Otherwise relying on existing tests. Closes #29775 from sunchao/SPARK-24994-followup. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-17 07:50:39 -07:00
Liang-Chi Hsieh	bd38e0be83	[SPARK-32903][SQL] GeneratePredicate should be able to eliminate common sub-expressions ### What changes were proposed in this pull request? This patch proposes to make GeneratePredicate eliminate common sub-expressions. ### Why are the changes needed? Both GenerateMutableProjection and GenerateUnsafeProjection, such codegen objects can eliminate common sub-expressions. But GeneratePredicate currently doesn't do it. We encounter a customer issue that a Filter pushed down through a Project causes performance issue, compared with not pushed down case. The issue is one expression used in Filter predicates are run many times. Due to the complex schema, the query nodes are not wholestage codegen, so it runs Filter.doExecute and then call GeneratePredicate. The common expression was run many time and became performance bottleneck. GeneratePredicate should be able to eliminate common sub-expressions for such case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29776 from viirya/filter-pushdown. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-17 05:39:40 +00:00
Linhong Liu	40ef5c91ad	[SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns ### What changes were proposed in this pull request? This PR fixes a conflict between `RewriteDistinctAggregates` and `DecimalAggregates`. In some cases, `DecimalAggregates` will wrap the decimal column to `UnscaledValue` using different rules for different aggregates. This means, same distinct column with different aggregates will change to different distinct columns after `DecimalAggregates`. For example: `avg(distinct decimal_col), sum(distinct decimal_col)` may change to `avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)` We assume after `RewriteDistinctAggregates`, there will be at most one distinct column in aggregates, but `DecimalAggregates` breaks this assumption. To fix this, we have to switch the order of these two rules. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added test cases Closes #29673 from linhongliu-db/SPARK-32816. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-16 16:53:25 +00:00
Yuming Wang	3bc13e6412	[SPARK-32706][SQL] Improve cast string to decimal type ### What changes were proposed in this pull request? This pr makes cast string type to decimal decimal type fast fail if precision larger that 38. ### Why are the changes needed? It is very slow if precision very large. Benchmark and benchmark result: ```scala import org.apache.spark.benchmark.Benchmark val bd1 = new java.math.BigDecimal("6.0790316E+25569151") val bd2 = new java.math.BigDecimal("6.0790316E+25"); val benchmark = new Benchmark("Benchmark string to decimal", 1, minNumIters = 2) benchmark.addCase(bd1.toString) { _ => println(Decimal(bd1).precision) } benchmark.addCase(bd2.toString) { _ => println(Decimal(bd2).precision) } benchmark.run() ``` ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.6 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Benchmark string to decimal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ 6.0790316E+25569151 9340 9381 57 0.0 9340094625.0 1.0X 6.0790316E+25 0 0 0 0.5 2150.0 4344230.1X ``` Stacktrace: ![image](https://user-images.githubusercontent.com/5399861/92941705-4c868980-f483-11ea-8a15-b93acde8c0f4.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test: Dataset \| Before this pr (Seconds) \| After this pr (Seconds) -- \| -- \| -- https://issues.apache.org/jira/secure/attachment/13011406/part-00000.parquet \| 2640 \| 2 Closes #29731 from wangyum/SPARK-32706. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-16 14:08:59 +00:00
Tanel Kiis	6051755bfe	[SPARK-32688][SQL][TEST] Add special values to LiteralGenerator for float and double ### What changes were proposed in this pull request? The `LiteralGenerator` for float and double datatypes was supposed to yield special values (NaN, +-inf) among others, but the `Gen.chooseNum` method does not yield values that are outside the defined range. The `Gen.chooseNum` for a wide range of floats and doubles does not yield values in the "everyday" range as stated in https://github.com/typelevel/scalacheck/issues/113 . There is an similar class `RandomDataGenerator` that is used in some other tests. Added `-0.0` and `-0.0f` as special values to there too. These changes revealed an inconsistency with the equality check between `-0.0` and `0.0`. ### Why are the changes needed? The `LiteralGenerator` is mostly used in the `checkConsistencyBetweenInterpretedAndCodegen` method in `MathExpressionsSuite`. This change would have caught the bug fixed in #29495 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally reverted #29495 and verified that the existing test cases caught the bug. Closes #29515 from tanelk/SPARK-32688. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-16 12:13:15 +09:00
ulysses	888b343587	[SPARK-32827][SQL] Add spark.sql.maxMetadataStringLength config ### What changes were proposed in this pull request? Add a new config `spark.sql.maxMetadataStringLength`. This config aims to limit metadata value length, e.g. file location. ### Why are the changes needed? Some metadata have been abbreviated by `...` when I tried to add some test in `SQLQueryTestSuite`. We need to replace such value to `notIncludedMsg`. That caused we can't replace that like location value by `className` since the `className` has been abbreviated. Here is a case: ``` CREATE table explain_temp1 (key int, val int) USING PARQUET; EXPLAIN EXTENDED SELECT sum(distinct val) FROM explain_temp1; -- ignore parsed,analyzed,optimized -- The output like == Physical Plan == HashAggregate(keys=[], functions=[sum(distinct cast(val#x as bigint)#xL)], output=[sum(DISTINCT val)#xL]) +- Exchange SinglePartition, true, [id=#x] +- HashAggregate(keys=[], functions=[partial_sum(distinct cast(val#x as bigint)#xL)], output=[sum#xL]) +- HashAggregate(keys=[cast(val#x as bigint)#xL], functions=[], output=[cast(val#x as bigint)#xL]) +- Exchange hashpartitioning(cast(val#x as bigint)#xL, 4), true, [id=#x] +- HashAggregate(keys=[cast(val#x as bigint) AS cast(val#x as bigint)#xL], functions=[], output=[cast(val#x as bigint)#xL]) +- *ColumnarToRow +- FileScan parquet default.explain_temp1[val#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/runner/work/spark/spark/sql/core/spark-warehouse/org.apache.spark.sq...], PartitionFilters: ... ``` ### Does this PR introduce _any_ user-facing change? No, a new config. ### How was this patch tested? new test. Closes #29688 from ulysses-you/SPARK-32827. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-15 14:11:30 +00:00
tanel.kiis@gmail.com	7a17158a4d	[SPARK-32868][SQL] Add more order irrelevant aggregates to EliminateSorts ### What changes were proposed in this pull request? Mark `BitAggregate` as order irrelevant in `EliminateSorts`. ### Why are the changes needed? Performance improvements in some queries ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Generalized an existing UT Closes #29740 from tanelk/SPARK-32868. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-14 22:52:33 +09:00
Yuanjian Li	5e825482d7	[SPARK-32844][SQL] Make `DataFrameReader.table` take the specified options for datasource v1 ### What changes were proposed in this pull request? Make `DataFrameReader.table` take the specified options for datasource v1. ### Why are the changes needed? Keep the same behavior of v1/v2 datasource, the v2 fix has been done in SPARK-32592. ### Does this PR introduce _any_ user-facing change? Yes. The DataFrameReader.table will take the specified options. Also, if there are the same key and value exists in specified options and table properties, an exception will be thrown. ### How was this patch tested? New UT added. Closes #29712 from xuanyuanking/SPARK-32844. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-14 09:20:24 +00:00
Chao Sun	3d08084022	[SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify literal types ### What changes were proposed in this pull request? Currently, in cases like the following: ```sql SELECT * FROM t WHERE age < 40 ``` where `age` is of short type, Spark won't be able to simplify this and can only generate filter `cast(age, int) < 40`. This won't get pushed down to datasources and therefore is not optimized. This PR proposes a optimizer rule to improve this when the following constraints are satisfied: - input expression is binary comparisons when one side is a cast operation and another is a literal. - both the cast child expression and literal are of integral type (i.e., byte, short, int or long) When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so result filter for the above case becomes `age < cast(40 as smallint)`. This is better since the cast can be optimized away later and the filter can be pushed down to data sources. This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups. ### Why are the changes needed? As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet. Closes #29565 from sunchao/SPARK-24994. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-12 21:34:35 -07:00
Karol Chmist	3be552ccc8	[SPARK-30090][SHELL] Adapt Spark REPL to Scala 2.13 ### What changes were proposed in this pull request? This is an attempt to adapt Spark REPL to Scala 2.13. It is based on a [scala-2.13 branch](https://github.com/smarter/spark/tree/scala-2.13) made by smarter. I had to set Scala version to 2.13 in some places, and to adapt some other modules, before I could start working on the REPL itself. These are separate commits on the branch that probably would be fixed beforehand, and thus dropped before the merge of this PR. I couldn't find a way to run the initialization code with existing REPL classes in Scala 2.13.2, so I [modified REPL in Scala](`e9cc0dd547`) to make it work. With this modification I managed to run Spark Shell, along with the units tests passing, which is good news. The bad news is that it requires an upstream change in Scala, which must be accepted first. I'd be happy to change it if someone points a way to do it differently. If not, I'd propose a PR in Scala to introduce `ILoop.internalReplAutorunCode`. ### Why are the changes needed? REPL in Scala changed quite a lot, so current version of Spark REPL needed to be adapted. ### Does this PR introduce _any_ user-facing change? In the previous version of `SparkILoop`, a lot of Scala's `ILoop` code was [overridden and duplicated](`2bc7b75537`) to make the welcome message a bit more pleasant. In this PR, the message is in a bit different order, but it's still acceptable IMHO. Before this PR: ``` 20/05/15 15:32:39 WARN Utils: Your hostname, hermes resolves to a loopback address: 127.0.1.1; using 192.168.1.28 instead (on interface enp0s31f6) 20/05/15 15:32:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/05/15 15:32:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/05/15 15:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark context Web UI available at http://192.168.1.28:4041 Spark context available as 'sc' (master = local[], app id = local-1589549565502). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> ``` With this PR: ``` 20/05/15 15:32:15 WARN Utils: Your hostname, hermes resolves to a loopback address: 127.0.1.1; using 192.168.1.28 instead (on interface enp0s31f6) 20/05/15 15:32:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/05/15 15:32:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.13.2-20200422-211118-706ef1b (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. Spark context Web UI available at http://192.168.1.28:4040 Spark context available as 'sc' (master = local[], app id = local-1589549541259). Spark session available as 'spark'. scala> ``` It seems that currently the welcoming message is still an improvement from [the original ticket](https://issues.apache.org/jira/browse/SPARK-24785), albeit in a different order. As a bonus, some fragile code duplication was removed. ### How was this patch tested? Existing tests pass in `repl`module. The REPL runs in a terminal and the following code executed correctly: ``` scala> spark.range(1000 * 1000 * 1000).count() val res0: Long = 1000000000 ``` Closes #28545 from karolchmist/scala-2.13-repl. Authored-by: Karol Chmist <info+github@chmist.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-12 18:15:15 -05:00
yangjiang	fe2ab255d1	[MINOR][SQL] Fix a typo at 'spark.sql.sources.fileCompressionFactor' error message in SQLConf ### What changes were proposed in this pull request? fix typo in SQLConf ### Why are the changes needed? typo fix to increase readability ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? no test Closes #29668 from Ted-Jiang/fix_annotate. Authored-by: yangjiang <yangjiang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-11 08:05:34 -05:00
gengjiaan	a22871f50a	[SPARK-32777][SQL] Aggregation support aggregate function with multiple foldable expressions ### What changes were proposed in this pull request? Spark SQL exists a bug show below: ``` spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") .show() +-----------------+--------------------+ \|count(DISTINCT 2)\|count(DISTINCT 2, 3)\| +-----------------+--------------------+ \| 1\| 1\| +-----------------+--------------------+ spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") .show() +-----------------+--------------------+ \|count(DISTINCT 2)\|count(DISTINCT 3, 2)\| +-----------------+--------------------+ \| 1\| 0\| +-----------------+--------------------+ ``` The first query is correct, but the second query is not. The root reason is the second query rewrited by `RewriteDistinctAggregates` who expand the output but lost the 2. ### Why are the changes needed? Fix a bug. `SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)` should return `1, 1` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? New UT Closes #29626 from beliefer/support-multiple-foldable-distinct-expressions. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-10 11:25:32 +00:00
Kent Yao	5669b212ec	[SPARK-32840][SQL] Invalid interval value can happen to be just adhesive with the unit ### What changes were proposed in this pull request? In this PR, we add a checker for STRING form interval value ahead for parsing multiple units intervals and fail directly if the interval value contains alphabets to prevent correctness issues like `interval '1 day 2' day`=`3 days`. ### Why are the changes needed? fix correctness issue ### Does this PR introduce _any_ user-facing change? yes, in spark 3.0.0 `interval '1 day 2' day`=`3 days` but now we fail with ParseException ### How was this patch tested? add a test. Closes #29708 from yaooqinn/SPARK-32840. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-10 11:20:05 +00:00
Takeshi Yamamuro	7eb76d6988	[SPARK-32828][SQL] Cast from a derived user-defined type to a base type ### What changes were proposed in this pull request? This PR intends to fix an existing bug below in `UserDefinedTypeSuite`; ``` [info] - SPARK-19311: UDFs disregard UDT type hierarchy (931 milliseconds) 16:22:35.936 WARN org.apache.spark.sql.catalyst.expressions.SafeProjection: Expr codegen error and falling back to interpreter mode org.apache.spark.SparkException: Cannot cast org.apache.spark.sql.ExampleSubTypeUDT46b1771f to org.apache.spark.sql.ExampleBaseTypeUDT31e8d979. at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeCastFunction(Cast.scala:891) at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:852) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147) ... ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #29691 from maropu/FixUdtBug. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-10 19:19:26 +09:00
Jungtaek Lim (HeartSaVioR)	db89b0e1b8	[SPARK-32831][SS] Refactor SupportsStreamingUpdate to represent actual meaning of the behavior ### What changes were proposed in this pull request? This PR renames `SupportsStreamingUpdate` to `SupportsStreamingUpdateAsAppend` as the new interface name represents the actual behavior clearer. This PR also removes the `update()` method (so the interface is more likely a marker), as the implementations of `SupportsStreamingUpdateAsAppend` should support append mode by default, hence no need to trigger some flag on it. ### Why are the changes needed? SupportsStreamingUpdate was intended to revive the functionality of Streaming update output mode for internal data sources, but despite the name, that interface isn't really used to do actual update on sink; all sinks are implementing this interface to do append, so strictly saying, it's just to support update as append. Renaming the interface would make it clear. ### Does this PR introduce _any_ user-facing change? No, as the class is only for internal data sources. ### How was this patch tested? Jenkins test will follow. Closes #29693 from HeartSaVioR/SPARK-32831. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-09-10 15:33:18 +09:00
Bryan Cutler	e0538bd38c	[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-10 14:16:19 +09:00
Liang-Chi Hsieh	add267c4de	[SPARK-32819][SQL] ignoreNullability parameter should be effective recursively ### What changes were proposed in this pull request? This patch proposes to check `ignoreNullability` parameter recursively in `equalsStructurally` method. ### Why are the changes needed? `equalsStructurally` is used to check type equality. We can optionally ask to ignore nullability check. But the parameter `ignoreNullability` is not passed recursively down to nested types. So it produces weird error like: ``` data type mismatch: argument 3 requires array<array<string>> type, however ... is of array<array<string>> type. ``` when running the query `select aggregate(split('abcdefgh',''), array(array('')), (acc, x) -> array(array( x ) ) )`. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when running user query. ### How was this patch tested? Unit tests. Closes #29698 from viirya/SPARK-32819. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-10 02:53:22 +00:00
Wenchen Fan	f7995c576a	Revert "[SPARK-32677][SQL] Load function resource before create" This reverts commit `05fcf26b79`.	2020-09-09 18:15:22 +00:00
yangjie01	fc10511d15	[SPARK-32755][SQL][FOLLOWUP] Ensure `--` method of AttributeSet have same behavior under Scala 2.12 and 2.13 ### What changes were proposed in this pull request? `--` method of `AttributeSet` behave differently under Scala 2.12 and 2.13 because `--` method of `LinkedHashSet` in Scala 2.13 can't maintains the insertion order. This pr use a Scala 2.12 based code to ensure `--` method of AttributeSet have same behavior under Scala 2.12 and 2.13. ### Why are the changes needed? The behavior of `AttributeSet` needs to be compatible with Scala 2.12 and 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Scala 2.12: Pass the Jenkins or GitHub Action Scala 2.13: Manual test sub-suites of `PlanStabilitySuite` - Before ：293 TESTS FAILED - After：13 TESTS FAILED(The remaining failures are not associated with the current issue) Closes #29689 from LuciferYang/SPARK-32755-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-09 14:40:49 +00:00
yangjie01	513d51a2c5	[SPARK-32808][SQL] Fix some test cases of `sql/core` module in scala 2.13 ### What changes were proposed in this pull request? The purpose of this pr is to partial resolve [SPARK-32808](https://issues.apache.org/jira/browse/SPARK-32808), total of 26 failed test cases were fixed, the related suite as follow: - `StreamingAggregationSuite` related test cases (2 FAILED -> Pass) - `GeneratorFunctionSuite` related test cases (2 FAILED -> Pass) - `UDFSuite` related test cases (2 FAILED -> Pass) - `SQLQueryTestSuite` related test cases (5 FAILED -> Pass) - `WholeStageCodegenSuite` related test cases (1 FAILED -> Pass) - `DataFrameSuite` related test cases (3 FAILED -> Pass) - `OrcV1QuerySuite\OrcV2QuerySuite` related test cases (4 FAILED -> Pass) - `ExpressionsSchemaSuite` related test cases (1 FAILED -> Pass) - `DataFrameStatSuite` related test cases (1 FAILED -> Pass) - `JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite` related test cases (6 FAILED -> Pass) The main change of this pr as following: - Fix Scala 2.13 compilation problems in `ShuffleBlockFetcherIterator` and `Analyzer` - Specified `Seq` to `scala.collection.Seq` in `objects.scala` and `GenericArrayData` because internal use `Seq` maybe `mutable.ArraySeq` and not easy to call `.toSeq` - Should specified `Seq` to `scala.collection.Seq` when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13 - Use a compatible way to let `+` and `-` method of `Decimal` having the same behavior in Scala 2.12 and Scala 2.13 - Call `toList` in `RelationalGroupedDataset.toDF` method when `groupingExprs` is `Stream` type because `Stream` can't serialize in Scala 2.13 - Add a manual sort to `classFunsMap` in `ExpressionsSchemaSuite` because `Iterable.groupBy` in Scala 2.13 has different result with `TraversableLike.groupBy` in Scala 2.12 ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? Should specified `Seq` to `scala.collection.Seq` when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but the `Seq` is `immutable.Seq` in Scala 2.13 ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 * 319 TESTS FAILED * ``` After ``` Tests: succeeded 8204, failed 286, canceled 1, ignored 52, pending 0 * 286 TESTS FAILED * ``` Closes #29660 from LuciferYang/SPARK-32808. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-09 08:53:44 -05:00
Wenchen Fan	4144b6da52	[SPARK-32764][SQL] -0.0 should be equal to 0.0 ### What changes were proposed in this pull request? This is a Spark 3.0 regression introduced by https://github.com/apache/spark/pull/26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 ### Why are the changes needed? Fix a correctness bug. ### Does this PR introduce _any_ user-facing change? Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. ### How was this patch tested? new tests Closes #29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-07 20:43:43 -07:00
Wenchen Fan	117a6f135b	[SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29485 It moves the plan rewriting methods from `Analyzer` to `QueryPlan`, so that it can work with `SparkPlan` as well. This PR also does an improvement to support a corner case (The attribute to be replace stays together with an unresolved attribute), and make it more general, so that `WidenSetOperationTypes` can rewrite the plan in one shot like before. ### Why are the changes needed? Code cleanup and generalize. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #29643 from cloud-fan/cleanup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-08 09:54:05 +09:00
manuzhang	c43460cf82	[SPARK-32753][SQL] Only copy tags to node with no tags ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- (3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- (1) Range (0, 10, step=1, splits=2) +- (2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] (4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- (3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- (1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce _any_ user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29593 from manuzhang/spark-32753. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 16:08:57 +00:00
ulysses	05fcf26b79	[SPARK-32677][SQL] Load function resource before create ### What changes were proposed in this pull request? Change `CreateFunctionCommand` code that add class check before create function. ### Why are the changes needed? We have different behavior between create permanent function and temporary function when function class is invaild. e.g., ``` create function f as 'test.non.exists.udf'; -- Time taken: 0.104 seconds create temporary function f as 'test.non.exists.udf' -- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath; ``` And Hive also fails both of them. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception when create a invalid udf. ### How was this patch tested? New test. Closes #29502 from ulysses-you/function. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 06:00:23 +00:00
Kent Yao	de44e9cfa0	[SPARK-32785][SQL] Interval with dangling parts should not results null ### What changes were proposed in this pull request? bugfix for incomplete interval values, e.g. interval '1', interval '1 day 2', currently these cases will result null, but actually we should fail them with IllegalArgumentsException ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? yes, incomplete intervals will throw exception now #### before ``` bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'" NULL NULL NULL ``` #### after ``` -- !query select interval '1' -- !query schema struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException Cannot parse the INTERVAL value: 1(line 1, pos 7) == SQL == select interval '1' ``` ### How was this patch tested? unit tests added Closes #29635 from yaooqinn/SPARK-32785. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 05:11:30 +00:00
Ali Afroozeh	f55694638d	[SPARK-32800][SQL] Remove ExpressionSet from the 2.13 branch ### What changes were proposed in this pull request? This PR is a followup on #29598 and removes the `ExpressionSet` class from the 2.13 branch. ### Why are the changes needed? `ExpressionSet` does not extend Scala `Set` anymore and this class is no longer needed in the 2.13 branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passes existing tests Closes #29648 from dbaliafroozeh/RemoveExpressionSetFrom2.13Branch. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-06 09:44:07 +09:00
Takeshi Yamamuro	a6114d8fb8	[SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29485 from maropu/SPARK-32638. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-03 14:48:26 +00:00

... 3 4 5 6 7 ...

5079 commits