ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
alexander-daskalov	10edeafc69	[MINOR][SQL] Fixed approx_count_distinct rsd param description ### What changes were proposed in this pull request? In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from _maximum estimation error allowed_ to _maximum relative standard deviation allowed_ ### Why are the changes needed? Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter. ### Does this PR introduce _any_ user-facing change? This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead, ### How was this patch tested? No tests, as no code changes were made. Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description. Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-14 22:10:41 +09:00
yangjie01	6ae2cb2db3	[SPARK-32526][SQL] Fix some test cases of `sql/catalyst` module in scala 2.13 ### What changes were proposed in this pull request? The purpose of this pr is to partial resolve [SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), total of 88 failed and 2 aborted test cases were fixed, the related suite as follow: - `DataSourceV2AnalysisBaseSuite` related test cases (71 FAILED -> Pass) - `TreeNodeSuite` (1 FAILED -> Pass) - `MetadataSuite `(1 FAILED -> Pass) - `InferFiltersFromConstraintsSuite `(3 FAILED -> Pass) - `StringExpressionsSuite ` (1 FAILED -> Pass) - `JacksonParserSuite ` (1 FAILED -> Pass) - `HigherOrderFunctionsSuite `(1 FAILED -> Pass) - `ExpressionParserSuite` (1 FAILED -> Pass) - `CollectionExpressionsSuite `(6 FAILED -> Pass) - `SchemaUtilsSuite` (2 FAILED -> Pass) - `ExpressionSetSuite `(ABORTED -> Pass) - `ArrayDataIndexedSeqSuite `(ABORTED -> Pass) The main change of this pr as following: - `Optimizer` and `Analyzer` are changed to pass compile, `ArrayBuffer` is not a `Seq` in scala 2.13, call `toSeq` method manually to compatible with Scala 2.12 - `m.mapValues().view.force` pattern return a `Map` in scala 2.12 but return a `IndexedSeq` in scala 2.13, call `toMap` method manually to compatible with Scala 2.12. `TreeNode` are changed to pass `DataSourceV2AnalysisBaseSuite` related test cases and `TreeNodeSuite` failed case. - call `toMap` method of `Metadata#hash` method `case map` branch because `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in Scala 2.13. - `impl` contact method of `ExpressionSet` in Scala 2.13 version refer to `ExpressionSet` in Scala 2.12 to support `+ + ` method conform to `ExpressionSet` semantics - `GenericArrayData` not accept `ArrayBuffer` input, call `toSeq` when use `ArrayBuffer` construction `GenericArrayData` for Scala version compatibility - Call `toSeq` in `RandomDataGenerator#randomRow` method to ensure contents of `fields` is `Seq` not `ArrayBuffer` - Call `toSeq` Let `JacksonParser#parse` still return a `Seq` because the check method of `JacksonParserSuite#"skipping rows using pushdown filters"` dependence on `Seq` type - Call `toSeq` in `AstBuilder#visitFunctionCall`, otherwise `ctx.argument.asScala.map(expression)` is `Buffer` in Scala 2.13 - Add a `LongType` match to `ArraySetLike.nullValueHolder` - Add a `sorted` to ensure `duplicateColumns` string in `SchemaUtils.checkColumnNameDuplication` method error message have a deterministic order ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/catalyst -Pscala-2.13 -am mvn test -pl sql/catalyst -Pscala-2.13 ``` Before ``` Tests: succeeded 3853, failed 103, canceled 0, ignored 6, pending 0 * 3 SUITES ABORTED * * 103 TESTS FAILED * ``` After ``` Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0 * 1 SUITE ABORTED * * 15 TESTS FAILED * ``` Closes #29370 from LuciferYang/fix-DataSourceV2AnalysisBaseSuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-13 11:46:30 -05:00
fqaiser94@gmail.com	0c850c71e7	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? Only one minor change. If the user submits the following query: ``` df.withColumn("a", $"a".withField(null, null)) ``` instead of throwing: ``` java.lang.IllegalArgumentException: requirement failed: fieldName cannot be null ``` it will now throw: ``` java.lang.IllegalArgumentException: requirement failed: col cannot be null ``` I don't believe its should be an issue to change this because: - neither message is incorrect - Spark 3.1.0 has yet to be released but please feel free to correct me if I am wrong. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29322 from fqaiser94/SPARK-32511. Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Co-authored-by: fqaiser94 <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-13 03:28:25 +00:00
stczwd	60fa8e304d	[SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2 ### What changes were proposed in this pull request? There are no partition Commands, such as AlterTableAddPartition supported in DatasourceV2, it is widely used in mysql or hive or other datasources. Thus it is necessary to defined Partition API to support these Commands. We defined the partition API as part of Table API, as it will change table data sometimes. And a partition is composed of identifier and properties, while identifier is defined with InternalRow and properties is defined as a Map. ### Does this PR introduce _any_ user-facing change? Yes. This PR will enable user to use some partition commands ### How was this patch tested? run all tests and add some partition api tests Closes #28617 from stczwd/SPARK-31694. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-12 17:25:47 +00:00
Yuming Wang	5d130f0360	[SPARK-32586][SQL] Fix NumberFormatException error message when ansi is enabled ### What changes were proposed in this pull request? This pr fixes the error message of `NumberFormatException` when casting invalid input to FractionalType and enabling ansi: ``` spark-sql> set spark.sql.ansi.enabled=true; spark.sql.ansi.enabled true spark-sql> create table SPARK_32586 using parquet as select 's' s; spark-sql> select * from SPARK_32586 where s > 1.13D; java.lang.NumberFormatException: invalid input syntax for type numeric: columnartorow_value_0 ``` After this pr: ``` spark-sql> select * from SPARK_32586 where s > 1.13D; java.lang.NumberFormatException: invalid input syntax for type numeric: s ``` ### Why are the changes needed? Improve error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29405 from wangyum/SPARK-32586. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-12 13:16:57 +09:00
gengjiaan	e7c1204f6c	[SPARK-32540][SQL] Eliminate the filter clause in aggregate ### What changes were proposed in this pull request? Spark SQL supported filter clause in aggregate, for example: `select sum(distinct id) filter (where sex = 'man') from student;` But sometimes we can eliminate the filter clause in aggregate. `SELECT COUNT(DISTINCT 1) FILTER (WHERE true) FROM testData;` could be transformed to `SELECT COUNT(DISTINCT 1) FROM testData;` `SELECT COUNT(DISTINCT 1) FILTER (WHERE false) FROM testData;` could be transformed to `SELECT 0 FROM testData;` ### Why are the changes needed? Optimize the filter clause in aggregation ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New test. Closes #29369 from beliefer/eliminate-filter-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-11 16:20:19 +00:00
xuewei.linxuewei	c37357a092	[SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys ### What changes were proposed in this pull request? In [SPARK-32290](https://issues.apache.org/jira/browse/SPARK-32290), we introduced several new types of HashedRelation. * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements. * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide. This Patch including two changes. * using EmptyHashedRelation to do fast stop for common anti join as well * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys ### Why are the changes needed? LeftAntiJoin could apply `fast stop` when BuildSide is EmptyHashedRelation, While within AQE with EmptyHashedRelationWithAllNullKeys, we can eliminate the NAAJ. This should be a performance improvement in AntiJoin. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * added case in AdaptiveQueryExecSuite. * added case in HashedRelationSuite. * Make sure SubquerySuite JoinSuite SQLQueryTestSuite passed. Closes #29389 from leanken/leanken-SPARK-32573. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-11 06:23:51 +00:00
allisonwang-db	1b7443bd9a	[SPARK-32216][SQL] Remove redundant ProjectExec ### What changes were proposed in this pull request? This PR added a physical rule to remove redundant project nodes. A `ProjectExec` is redundant when 1. It has the same output attributes and order as its child's output when ordering of these attributes is required. 2. It has the same output attributes as its child's output when attribute output ordering is not required. For example: After Filter: ``` == Physical Plan == (1) Project [a#14L, b#15L, c#16, key#17] +- (1) Filter (isnotnull(a#14L) AND (a#14L > 5)) +- (1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,c#16,key#17] ``` The `Project a#14L, b#15L, c#16, key#17` is redundant because its output is exactly the same as filter's output. Before Aggregate: ``` == Physical Plan == (2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L]) +- Exchange hashpartitioning(key#17, 5), true, [id=#77] +- (1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) +- (1) Project [key#17, a#14L, b#15L] +- (1) Filter (isnotnull(a#14L) AND (a#14L > 100)) +- (1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,key#17] ``` The `Project key#17, a#14L, b#15L` is redundant because hash aggregate doesn't require child plan's output to be in a specific order. ### Why are the changes needed? It removes unnecessary query nodes and makes query plan cleaner. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29031 from allisonwang-db/remove-project. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-11 03:14:15 +00:00
Yuanjian Li	b03761e330	[SPARK-32456][SS] Check the Distinct by assuming it as Aggregate for Structured Streaming ### What changes were proposed in this pull request? Check the Distinct nodes by assuming it as Aggregate in `UnsupportOperationChecker` for streaming. ### Why are the changes needed? We want to fix 2 things here: 1. Give better error message for Distinct related operations in append mode that doesn't have a watermark We use the union streams as the example, distinct in SQL has the same issue. Since the union clause in SQL has the requirement of deduplication, the parser will generate `Distinct(Union)` and the optimizer rule `ReplaceDistinctWithAggregate` will change it to `Aggregate(Union)`. This logic is of both batch and streaming queries. However, in the streaming, the aggregation will be wrapped by state store operations so we need extra checking logic in `UnsupportOperationChecker`. Before this change, the SS union queries in Append mode will get the following confusing error when the watermark is lacking. ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) ... ``` 2. Make `Distinct` in complete mode runnable. Before this fix, the distinct in complete mode will throw the exception: ``` Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets; ``` ### Does this PR introduce _any_ user-facing change? Yes, return a better error message. ### How was this patch tested? New UT added. Closes #29256 from xuanyuanking/SPARK-32456. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-10 14:01:31 +09:00
allisonwang-db	924c161544	[SPARK-32337][SQL] Show initial plan in AQE plan tree string ### What changes were proposed in this pull request? This PR adds initial plan in `AdaptiveSparkPlanExec` and generates tree string for both current plan and initial plan. When the adaptive plan is not final, `Current Plan` will be used to indicate current physical plan, and `Final Plan` will be used when the adaptive plan is final. The difference between `Current Plan` and `Final Plan` here is that current plan indicates an intermediate state. The plan is subject to further transformations, while final plan represents an end state, which means the plan will no longer be changed. Examples: Before this change: ``` AdaptiveSparkPlan isFinalPlan=true +- (3) BroadcastHashJoin :- BroadcastQueryStage 2 ... ``` `EXPLAIN FORMATTED` ``` == Physical Plan == AdaptiveSparkPlan (9) +- BroadcastHashJoin Inner BuildRight (8) :- Project (3) : +- Filter (2) ``` After this change ``` AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == (3) BroadcastHashJoin :- BroadcastQueryStage 2 : +- BroadcastExchange ... +- == Initial Plan == SortMergeJoin :- Sort : +- Exchange ... ``` `EXPLAIN FORMATTED` ``` == Physical Plan == AdaptiveSparkPlan (9) +- == Current Plan == BroadcastHashJoin Inner BuildRight (8) :- Project (3) : +- Filter (2) +- == Initial Plan == BroadcastHashJoin Inner BuildRight (8) :- Project (3) : +- Filter (2) ``` ### Why are the changes needed? It provides better visibility into the plan change introduced by AQE. ### Does this PR introduce _any_ user-facing change? Yes. It changed the AQE plan output string. ### How was this patch tested? Unit test Closes #29137 from allisonwang-db/aqe-plan. Lead-authored-by: allisonwang-db <allison.wang@databricks.com> Co-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-10 04:49:37 +00:00
Liang-Chi Hsieh	7b6e1d5cec	[SPARK-25557][SQL] Nested column predicate pushdown for ORC ### What changes were proposed in this pull request? We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC. ### Why are the changes needed? Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #28761 from viirya/SPARK-25557. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-07 08:07:41 -07:00
Wenchen Fan	d5682c13a2	[SPARK-32528][SQL][TEST] The analyze method should make sure the plan is analyzed ### What changes were proposed in this pull request? This PR updates the `analyze` method to make sure the plan can be resolved. It also fixes some miswritten optimizer tests. ### Why are the changes needed? It's error-prone if the `analyze` method can return an unresolved plan. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test only Closes #29349 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-07 07:36:08 +00:00
Max Gekk	3a437ed22b	[SPARK-32501][SQL] Convert null to "null" in structs, maps and arrays while casting to strings ### What changes were proposed in this pull request? Convert `NULL` elements of maps, structs and arrays to the `"null"` string while converting maps/struct/array values to strings. The SQL config `spark.sql.legacy.omitNestedNullInCast.enabled` controls the behaviour. When it is `true`, `NULL` elements of structs/maps/arrays will be omitted otherwise, when it is `false`, `NULL` elements will be converted to `"null"`. ### Why are the changes needed? 1. It is impossible to distinguish empty string and null, for instance: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +-----+ \|value\| +-----+ \| []\| \| []\| +-----+ ``` 2. Inconsistent NULL conversions for top-level values and nested columns, for instance: ```scala scala> sql("select named_struct('c', null), null").show +---------------------+----+ \|named_struct(c, NULL)\|NULL\| +---------------------+----+ \| []\|null\| +---------------------+----+ ``` 3. `.show()` is different from conversions to Hive strings, and as a consequence its output is different from `spark-sql` (sql tests): ```sql spark-sql> select named_struct('c', null) as struct; {"c":null} ``` ```scala scala> sql("select named_struct('c', null) as struct").show +------+ \|struct\| +------+ \| []\| +------+ ``` 4. It is impossible to distinguish empty struct/array from struct/array with null in the current implementation: ```scala scala> Seq[Seq[String]](Seq(), Seq(null)).toDF.show() +-----+ \|value\| +-----+ \| []\| \| []\| +-----+ ``` ### Does this PR introduce _any_ user-facing change? Yes, before: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +-----+ \|value\| +-----+ \| []\| \| []\| +-----+ ``` After: ```scala scala> Seq(Seq(""), Seq(null)).toDF().show +------+ \| value\| +------+ \| []\| \|[null]\| +------+ ``` ### How was this patch tested? By existing test suite `CastSuite`. Closes #29311 from MaxGekk/nested-null-to-string. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-05 12:03:36 +00:00
Max Gekk	7eb6f45688	[SPARK-32499][SQL] Use `{}` in conversions maps and structs to strings ### What changes were proposed in this pull request? Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`. ### Why are the changes needed? - To distinguish structs/maps from arrays. - To make `show`'s output consistent with Hive and conversions to Hive strings. - To display dataframe content in the same form by `spark-sql` and `show` - To be consistent with the `*.sql` tests ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suite `CastSuite`. Closes #29308 from MaxGekk/show-struct-map. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-04 14:57:09 +00:00
fqaiser94@gmail.com	6d69068057	[SPARK-32521][SQL] Bug-fix: WithFields Expression should not be foldable ### What changes were proposed in this pull request? Make WithFields Expression not foldable. ### Why are the changes needed? The following query currently fails on master brach: ``` sql("SELECT named_struct('a', 1, 'b', 2) a") .select($"a".withField("c", lit(3)).as("a")) .show(false) // java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3) ``` This happens because the Catalyst optimizer sees that the WithFields Expression is foldable and tries to statically evaluate the WithFields Expression (via the ConstantFolding rule), however it cannot do so because WithFields Expression is Unevaluable. ### Does this PR introduce _any_ user-facing change? Yes, queries like the one shared above will now succeed. That said, this bug was introduced in Spark 3.1.0 which has yet to be released. ### How was this patch tested? A new unit test was added. Closes #29338 from fqaiser94/SPARK-32521. Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Co-authored-by: fqaiser94 <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-04 12:11:04 +00:00
gengjiaan	1597d8fcd4	[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT ### What changes were proposed in this pull request? This PR is related to https://github.com/apache/spark/pull/26656. https://github.com/apache/spark/pull/26656 only support use FILTER clause on aggregate expression without DISTINCT. This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause. Such as: ``` select sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id; select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student; select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id; select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student; select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id; ``` ### Why are the changes needed? Spark SQL only support use FILTER clause on aggregate expression without DISTINCT. This PR support Filter expression allows simultaneous use of DISTINCT ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Exists and new UT Closes #29291 from beliefer/support-distinct-with-filter. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-04 04:41:19 +00:00
Max Gekk	f3b10f526b	[SPARK-32290][SQL][FOLLOWUP] Add version for the SQL config `spark.sql.optimizeNullAwareAntiJoin` ### What changes were proposed in this pull request? Add the version `3.1.0` for the SQL config `spark.sql.optimizeNullAwareAntiJoin`. ### Why are the changes needed? To inform users when the config was added, for example on the page http://spark.apache.org/docs/latest/configuration.html. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By compiling and running `./dev/scalastyle`. Closes #29335 from MaxGekk/leanken-SPARK-32290-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-03 16:05:54 +00:00
Takeshi Yamamuro	c6109ba918	[SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command ### What changes were proposed in this pull request? This PR modified the parser code to handle invalid usages of a SET/RESET command. For example; ``` SET spark.sql.ansi.enabled true ``` The above SQL command does not change the configuration value and it just tries to display the value of the configuration `spark.sql.ansi.enabled true`. This PR disallows using special characters including spaces in the configuration name and reports a user-friendly error instead. In the error message, it tells users a workaround to use quotes or a string literal if they still needs to specify a configuration with them. Before this PR: ``` scala> sql("SET spark.sql.ansi.enabled true").show(1, -1) +---------------------------+-----------+ \|key \|value \| +---------------------------+-----------+ \|spark.sql.ansi.enabled true\|<undefined>\| +---------------------------+-----------+ ``` After this PR: ``` scala> sql("SET spark.sql.ansi.enabled true") org.apache.spark.sql.catalyst.parser.ParseException: Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0) == SQL == SET spark.sql.ansi.enabled true ^^^ ``` ### Why are the changes needed? For better user-friendly errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `SparkSqlParserSuite`. Closes #29146 from maropu/SPARK-32257. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-03 13:00:07 +00:00
Kent Yao	3deb59d5c2	[SPARK-31709][SQL] Proper base path for database/table location when it is a relative path ### What changes were proposed in this pull request? Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g. ```sql > set spark.sql.warehouse.dir; spark.sql.warehouse.dir file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/ spark-sql> create database loctest location 'loctestdbdir'; spark-sql> desc database loctest; Database Name loctest Comment Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Owner kentyao spark-sql> create table loctest(id int) location 'loctestdbdir'; spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database default Table loctest Owner kentyao Created Time Thu May 14 16:29:05 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations. The config `spark.sql.warehouse.dir` represents `the default location for managed databases and tables`. For databases, the case above seems not to follow its semantics, because it should use ` `spark.sql.warehouse.dir` as the base path instead. For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations. With changes in this PR, The location of a database will be `warehouseDir/dbpath` when `dbpath` is relative. The location of a table will be `dbpath/tblpath` when `tblpath` is relative. ### Why are the changes needed? bugfix and improvement Firstly, the databases with relative locations should be created under the default location specified by `spark.sql.warehouse.dir`. Secondly, the external tables with relative paths may also follow this behavior for consistency. At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the `createDatabase`, `alterDatabase`, `createTable` and `alterTable` APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will be `spark.sql.warehouse.dir` for databases, and `spark.sql.warehouse.dir` / `dbPath` for tables. e.g. #### after ```sql spark-sql> desc database loctest; Database Name loctest Comment Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest Owner kentyao spark-sql> use loctest; spark-sql> create table loctest(id int) location 'loctest'; 20/05/14 18:18:02 WARN InMemoryFileIndex: The directory file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/loctest was not found. Was it deleted very recently? 20/05/14 18:18:02 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database loctest Table loctest Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat spark-sql> alter table loctest set location 'loctest2' > ; spark-sql> desc formatted loctest; id int NULL # Detailed Table Information Database loctest Table loctest Owner kentyao Created Time Thu May 14 18:18:03 CST 2020 Last Access UNKNOWN Created By Spark 3.1.0-SNAPSHOT Type EXTERNAL Provider parquet Location file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest2 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` ### How was this patch tested? Add unit tests. Closes #28527 from yaooqinn/SPARK-31709. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-03 12:48:22 +00:00
beliefer	42f9ee4c7d	[SPARK-24884][SQL] Support regexp function regexp_extract_all ### What changes were proposed in this pull request? `regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`. There are some description of this function. ``` SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1 SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14] SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a' SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm'] ``` There are some mainstream database support the syntax. Presto: https://prestodb.io/docs/current/functions/regexp.html Pig: https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html Note: This PR pick up the work of https://github.com/apache/spark/pull/21985 ### Why are the changes needed? `regexp_extract_all` is a very useful function and make work easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT Closes #27507 from beliefer/support-regexp_extract_all. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-03 06:03:55 +00:00
Robert (Bobby) Evans	713124d5e3	[SPARK-32274][SQL] Make SQL cache serialization pluggable ### What changes were proposed in this pull request? Add a config to let users change how SQL/Dataframe data is compressed when cached. This adds a few new classes/APIs for use with this config. 1. `CachedBatch` is a trait used to tag data that is intended to be cached. It has a few APIs that lets us keep the compression/serialization of the data separate from the metrics about it. 2. `CachedBatchSerializer` provides the APIs that must be implemented to cache data. * `convertForCache` is an API that runs a cached spark plan and turns its result into an `RDD[CachedBatch]`. The actual caching is done outside of this API * `buildFilter` is an API that takes a set of predicates and builds a filter function that can be used to filter the `RDD[CachedBatch]` returned by `convertForCache` * `decompressColumnar` decompresses an `RDD[CachedBatch]` into an `RDD[ColumnarBatch]` This is only used for a limited set of data types. These data types may expand in the future. If they do we can add in a new API with a default value that says which data types this serializer supports. * `decompressToRows` decompresses an `RDD[CachedBatch]` into an `RDD[InternalRow]` this API, like `decompressColumnar` decompresses the data in `CachedBatch` but turns it into `InternalRow`s, typically using code generation for performance reasons. There is also an API that lets you reuse the current filtering based on min/max values. `SimpleMetricsCachedBatch` and `SimpleMetricsCachedBatchSerializer`. ### Why are the changes needed? This lets users explore different types of compression and compression ratios. ### Does this PR introduce _any_ user-facing change? This adds in a single config, and exposes some developer API classes described above. ### How was this patch tested? I ran the unit tests around this and I also did some manual performance tests. I could find any performance difference between the old and new code, and if there is any it is within error. Closes #29067 from revans2/pluggable_cache_serializer. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-03 03:15:54 +00:00
Yuanjian Li	354313b6bc	[SPARK-31894][SS][FOLLOW-UP] Rephrase the config doc ### What changes were proposed in this pull request? Address comment in https://github.com/apache/spark/pull/28707#discussion_r461102749 ### Why are the changes needed? Hide the implementation details in the config doc. ### Does this PR introduce _any_ user-facing change? Config doc change. ### How was this patch tested? Document only. Closes #29315 from xuanyuanking/SPARK-31894-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-31 14:16:41 +00:00
Takeshi Yamamuro	30e3042dc5	[SPARK-32488][SQL] Use @parser::members and @lexer::members to avoid generating unused code ### What changes were proposed in this pull request? This PR aims to update `SqlBse.g4` for avoiding generating unused code. Currently, ANTLR generates unused methods and variables; `isValidDecimal` and `isHint` are only used in the generated lexer. This PR changed the code to use `parser::members` and `lexer::members` to avoid it. ### Why are the changes needed? To reduce unnecessary code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29296 from maropu/UpdateSqlBase. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-30 07:51:27 +00:00
Max Gekk	99a855575c	[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources ### What changes were proposed in this pull request? When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: ``` ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase` ``` Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758. ### Does this PR introduce _any_ user-facing change? Yes. For the example from SPARK-32431: ORC: ```scala java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) ... Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3 at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) ``` JSON: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` Parquet: ```scala +------------+ \|StructColumn\| +------------+ \| [0,, 1]\| +------------+ ``` Avro: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` After the changes, Parquet, ORC, JSON and Avro output the same error: ```scala Found duplicate column(s) in the data schema: `camelcase`; org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67) ``` ### How was this patch tested? Run modified test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*" ``` and added new UT to `SchemaUtilsSuite`. Closes #29234 from MaxGekk/nested-case-insensitive-column. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-30 06:05:55 +00:00
Max Gekk	d897825d2d	[SPARK-32346][SQL] Support filters pushdown in Avro datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in Avro datasource V1 and V2. 1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default. 2. Renamed `CSVFilters` to `OrderedFilters`. 3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2) 4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to 2 times on JDK 11: ``` OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 9614 9669 54 0.1 9614.1 1.0X pushdown disabled 10077 10141 66 0.1 10077.2 1.0X w/ filters 4681 4713 29 0.2 4681.5 2.1X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite` - Re-running `AvroReadBenchmark` using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #29145 from MaxGekk/avro-filters-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-30 01:37:42 +08:00
Terry Kim	45b7212fd3	[SPARK-32401][SQL] Migrate function related commands to use UnresolvedFunc to resolve function identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following function related commands to use `UnresolvedFunc` to resolve function identifier: - DROP FUNCTION - DESCRIBE FUNCTION - SHOW FUNCTIONS `DropFunctionStatement`, `DescribeFunctionStatement` and `ShowFunctionsStatement` logical plans are replaced with `DropFunction`, `DescribeFunction` and `ShowFunctions` logical plans respectively, and each contains `UnresolvedFunc` as its child so that it can be resolved in `Analyzer`. ### Why are the changes needed? Migrating to the new resolution framework, which resolves `UnresolvedFunc` in `Analyzer`. ### Does this PR introduce _any_ user-facing change? The message of exception thrown when a catalog is resolved to v2 has been merged to: `function is only supported in v1 catalog` Previously, it printed out the command used. E.g.,: `CREATE FUNCTION is only supported in v1 catalog` ### How was this patch tested? Updated existing tests. Closes #29198 from imback82/function_framework. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-29 03:55:48 +00:00
Michael Munday	a3d80564ad	[SPARK-32458][SQL][TESTS] Fix incorrectly sized row value reads ### What changes were proposed in this pull request? Updates to tests to use correctly sized `getInt` or `getLong` calls. ### Why are the changes needed? The reads were incorrectly sized (i.e. `putLong` paired with `getInt` and `putInt` paired with `getLong`). This causes test failures on big-endian systems. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests were run on a big-endian system (s390x). This change is unlikely to have any practical effect on little-endian systems. Closes #29258 from mundaym/fix-row. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-28 10:36:20 -07:00
yi.wu	ca1ecf7f9f	[SPARK-32459][SQL] Support WrappedArray as customCollectionCls in MapObjects ### What changes were proposed in this pull request? This PR supports `WrappedArray` as `customCollectionCls` in `MapObjects`. ### Why are the changes needed? This helps fix the regression caused by SPARK-31826. For the following test, it can pass in branch-3.0 but fail in master branch: ```scala test("WrappedArray") { val myUdf = udf((a: WrappedArray[Int]) => WrappedArray.make[Int](Array(a.head + 99))) checkAnswer(Seq(Array(1)) .toDF("col") .select(myUdf(Column("col"))), Row(ArrayBuffer(100))) } ``` In SPARK-31826, we've changed the catalyst-to-scala converter from `CatalystTypeConverters` to `ExpressionEncoder.deserializer`. However, `CatalystTypeConverters` supports `WrappedArray` while `ExpressionEncoder.deserializer` doesn't. ### Does this PR introduce _any_ user-facing change? No, SPARK-31826 is merged into master and branch-3.1, which haven't been released. ### How was this patch tested? Added a new test for `WrappedArray` in `UDFSuite`; Also updated `ObjectExpressionsSuite` for `MapObjects`. Closes #29261 from Ngone51/fix-wrappedarray. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-28 12:24:15 +00:00
xuewei.linxuewei	12b9787a7f	[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize ### What changes were proposed in this pull request? Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16. ``` select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#45' and p_type not like 'MEDIUM POLISHED%' and p_size in (49, 14, 23, 45, 19, 3, 36, 9) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size ``` In above query, will planed into LeftAnti condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey)) Inside BroadcastNestedLoopJoinExec will perform O(M\*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M). But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future. After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s The semantic of null-aware anti join is: ![image](https://user-images.githubusercontent.com/17242071/88077041-66a39a00-cbad-11ea-8fb6-c235c4d219b4.png) ### Why are the changes needed? TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off 2. added case in org.apache.spark.sql.JoinSuite. 3. added case in org.apache.spark.sql.SubquerySuite. 3. Compare performance before and after applying this patch against TPCH Query 16. 4. config combination against e2e test with following ``` Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "false" ), Map( "sspark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "true" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "false" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "true" ) ``` Closes #29104 from leanken/leanken-SPARK-32290. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-28 04:42:15 +00:00
Frank Yin	8323c8eb56	[SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans ### What changes were proposed in this pull request? This PR is intended to solve schema pruning not working with window functions, as described in SPARK-32059. It also solved schema pruning not working with `Sort`. It also generalizes with `Project->Filter->[any node can be pruned]`. ### Why are the changes needed? This is needed because of performance issues with nested structures with querying using window functions as well as sorting. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Introduced two tests: 1) optimizer planning level 2) end-to-end tests with SQL queries. Closes #28898 from frankyin-factual/master. Authored-by: Frank Yin <frank@factual.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-28 10:00:21 +09:00
Kent Yao	d315ebf3a7	[SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens ### What changes were proposed in this pull request? When using `Seconds.toMicros` API to convert epoch seconds to microseconds, ```scala /** * Equivalent to * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}. * param duration the duration * return the converted duration, * or {code Long.MIN_VALUE} if conversion would negatively * overflow, or {code Long.MAX_VALUE} if it would positively overflow. */ ``` This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)` ### Why are the changes needed? fix silent data change between 3.x and 2.x ``` ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" +294247-01-10 12:00:54.775807 ``` ``` kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" 284550-10-19 15:58:1010.448384 ``` ### Does this PR introduce _any_ user-facing change? Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow. ### How was this patch tested? add unit test Closes #29220 from yaooqinn/SPARK-32424. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 17:03:14 +00:00
Kent Yao	d3596c04b0	[SPARK-32406][SQL] Make RESET syntax support single configuration reset ### What changes were proposed in this pull request? This PR extends the RESET command to support reset SQL configuration one by one. ### Why are the changes needed? Currently, the reset command only supports restore all of the runtime configurations to their defaults. In most cases, users do not want this, but just want to restore one or a small group of settings. The SET command can work as a workaround for this, but you have to keep the defaults in your mind or by temp variables, which turns out not very convenient to use. Hive supports this: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample reset <key> \| Resets the value of a particular configuration variable (key) to the default value.Note: If you misspell the variable name, Beeline will not show an error. -- \| -- PostgreSQL supports this too https://www.postgresql.org/docs/9.1/sql-reset.html ### Does this PR introduce _any_ user-facing change? yes, reset can restore one configuration now ### How was this patch tested? add new unit tests. Closes #29202 from yaooqinn/SPARK-32406. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 09:13:26 -07:00
Liang-Chi Hsieh	84efa04c57	[SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase ### What changes were proposed in this pull request? Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase. See https://github.com/apache/spark/pull/28996#discussion_r453460284. ### Why are the changes needed? Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29107 from viirya/move-union-by-name. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-24 04:33:18 +00:00
Max Gekk	658e87471c	[SPARK-30648][SQL][FOLLOWUP] Refactoring of JsonFilters: move config checking out ### What changes were proposed in this pull request? Refactoring of `JsonFilters`: - Add an assert to the `skipRow` method to check the input `index` - Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`. ### Why are the changes needed? 1. The assert should catch incorrect usage of `JsonFilters` 2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see https://github.com/apache/spark/pull/29145). 3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json." $ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json." ``` Closes #29206 from MaxGekk/json-filters-pushdown-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-24 09:54:11 +09:00
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
Terry Kim	35345e30e5	[SPARK-32374][SQL] Disallow setting properties when creating temporary views ### What changes were proposed in this pull request? Currently, you can specify properties when creating a temporary view. However, the specified properties are not used and can be misleading. This PR propose to disallow specifying properties when creating temporary views. ### Why are the changes needed? To avoid confusion by disallowing specifying unused properties. ### Does this PR introduce _any_ user-facing change? Yes, now if you create a temporary view with properties, the operation will fail: ``` scala> sql("CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1") org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: CREATE TEMPORARY VIEW ... TBLPROPERTIES (property_name = property_value, ...)(line 1, pos 0) == SQL == CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1 ^^^ ``` ### How was this patch tested? Added tests Closes #29167 from imback82/disable_properties_temp_view. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:32:10 +00:00
yi.wu	a8e3de36e7	[SPARK-32280][SPARK-32372][SQL] ResolveReferences.dedupRight should only rewrite attributes for ancestor nodes of the conflict plan ### What changes were proposed in this pull request? This PR refactors `ResolveReferences.dedupRight` to make sure it only rewrite attributes for ancestor nodes of the conflict plan. ### Why are the changes needed? This is a bug fix. ```scala sql("SELECT name, avg(age) as avg_age FROM person GROUP BY name") .createOrReplaceTempView("person_a") sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = p2.name") .createOrReplaceTempView("person_b") sql("SELECT * FROM person_a UNION SELECT * FROM person_b") .createOrReplaceTempView("person_c") sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name = p2.name").show() ``` When executing the above query, we'll hit the error: ```scala [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) avg_age#231 missing from name#223,avg_age#218,id#232,age#234,name#233 in operator !Project [name#233, avg_age#231]. Attribute(s) with the same name appear in the operation: avg_age. Please check if the right attribute(s) are used.;; ... ``` The plan below is the problematic plan which is the right plan of a `Join` operator. And, it has conflict plans comparing to the left plan. In this problematic plan, the first `Aggregate` operator (the one under the first child of `Union`) becomes a conflict plan compares to the left one and has a rewrite attribute pair as `avg_age#218` -> `avg_age#231`. With the current `dedupRight` logic, we'll first replace this `Aggregate` with a new one, and then rewrites the attribute `avg_age#218` from bottom to up. As you can see, projects with the attribute `avg_age#218` of the second child of the `Union` can also be replaced with `avg_age#231`(That means we also rewrite attributes for non-ancestor plans for the conflict plan). Ideally, the attribute `avg_age#218` in the second `Aggregate` operator (the one under the second child of `Union`) should also be replaced. But it didn't because it's an `Alias` while we only rewrite `Attribute` yet. Therefore, the project above the second `Aggregate` becomes unresolved. ```scala :  : +- SubqueryAlias p2 +- SubqueryAlias person_c +- Distinct +- Union :- Project [name#233, avg_age#231] : +- SubqueryAlias person_a : +- Aggregate [name#233], [name#233, avg(cast(age#234 as bigint)) AS avg_age#231] : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- Project [name#233 AS name#227, avg_age#231 AS avg_age#228] +- Project [name#233, avg_age#231] +- SubqueryAlias person_b +- !Project [name#233, avg_age#231] +- Join Inner, (name#233 = name#223) :- SubqueryAlias p1 : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- SubqueryAlias p2 +- SubqueryAlias person_a +- Aggregate [name#223], [name#223, avg(cast(age#224 as bigint)) AS avg_age#218] +- SubqueryAlias person +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#222, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#223, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#224] +- ExternalRDD [obj#165] ``` ### Does this PR introduce _any_ user-facing change? Yes, users would no longer hit the error after this fix. ### How was this patch tested? Added test. Closes #29166 from Ngone51/impr-dedup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:24:47 +00:00
Wenchen Fan	aa54dcf193	[SPARK-32251][SQL][TESTS][FOLLOWUP] improve SQL keyword test ### What changes were proposed in this pull request? Improve the `SQLKeywordSuite` so that: 1. it checks keywords under default mode as well 2. it checks if there are typos in the doc (found one and fixed in this PR) ### Why are the changes needed? better test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #29200 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:02:38 +00:00
Dongjoon Hyun	aed8dbab1d	[SPARK-32364][SQL][FOLLOWUP] Add toMap to return originalMap and documentation ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/29160. We already removed the indeterministicity. This PR aims the following for the existing code base. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` ### Why are the changes needed? `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`. Closes #29191 from dongjoon-hyun/SPARK-32364-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 06:28:08 -07:00
LantaoJin	182566bf57	[SPARK-32237][SQL] Resolve hint in CTE ### What changes were proposed in this pull request? This PR is to move `Substitution` rule before `Hints` rule in `Analyzer` to avoid hint in CTE not working. ### Why are the changes needed? Below SQL in Spark3.0 will throw AnalysisException, but it works in Spark2.x ```sql WITH cte AS (SELECT /+ REPARTITION(3) / T.id, T.data FROM $t1 T) SELECT cte.id, cte.data FROM cte ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`cte.id`' given input columns: [cte.data, cte.id]; line 3 pos 7; 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test Closes #29062 from LantaoJin/SPARK-32237. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 03:10:45 +00:00
Takuya UESHIN	46169823c0	[SPARK-30616][SQL][FOLLOW-UP] Use only config key name in the config doc ### What changes were proposed in this pull request? This is a follow-up of #28852. This PR to use only config name; otherwise the doc for the config entry shows the entire details of the referring configs. ### Why are the changes needed? The doc for the newly introduced config entry shows the entire details of the referring configs. ### Does this PR introduce _any_ user-facing change? The doc for the config entry will show only the referring config keys. ### How was this patch tested? Existing tests. Closes #29194 from ueshin/issues/SPARK-30616/fup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 03:07:30 +00:00
ulysses	184074de22	[SPARK-31999][SQL] Add REFRESH FUNCTION command ### What changes were proposed in this pull request? In Hive mode, permanent functions are shared with Hive metastore so that functions may be modified by other Hive client. With in long-lived spark scene, it's hard to update the change of function. Here are 2 reasons: * Spark cache the function in memory using `FunctionRegistry`. * User may not know the location or classname of udf when using `replace function`. Note that we use v2 command code path to add new command. ### Why are the changes needed? Give a easy way to make spark function registry sync with Hive metastore. Then we can call ``` refresh function functionName ``` ### Does this PR introduce _any_ user-facing change? Yes, new command. ### How was this patch tested? New UT. Closes #28840 from ulysses-you/SPARK-31999. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-22 19:05:50 +00:00
Dongjoon Hyun	cd16a10475	[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options ### What changes were proposed in this pull request? When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is non-deterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. ### Why are the changes needed? Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` ### Does this PR introduce _any_ user-facing change? Yes. However, this is a bug fix for the indeterministic cases. ### How was this patch tested? Pass the Jenkins or GitHub Action with newly added test cases. Closes #29160 from dongjoon-hyun/SPARK-32364. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-22 07:58:45 -07:00
Takeshi Yamamuro	04bf3511f1	[SPARK-21117][SQL][FOLLOWUP] Define prettyName for WidthBucket ### What changes were proposed in this pull request? This PR is to define prettyName for `WidthBucket`. This comes from the gatorsmile's suggestion: https://github.com/apache/spark/pull/28764#discussion_r457802957 ### Why are the changes needed? For a better name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29183 from maropu/SPARK-21117-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-22 02:51:30 -07:00
Cheng Su	39181ff209	[SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable ### What changes were proposed in this pull request? Based on a follow up comment in https://github.com/apache/spark/pull/28123, where we can coalesce buckets for shuffled hash join as well. The note here is we only coalesce the buckets from shuffled hash join stream side (i.e. the side not building hash map), so we don't need to worry about OOM when coalescing multiple buckets in one task for building hash map. > If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. Refactor existing physical plan rule `CoalesceBucketsInSortMergeJoin` to `CoalesceBucketsInJoin`, for covering shuffled hash join as well. Refactor existing unit test `CoalesceBucketsInSortMergeJoinSuite` to `CoalesceBucketsInJoinSuite`, for covering shuffled hash join as well. ### Why are the changes needed? Avoid shuffle for joining different bucketed tables, is also useful for shuffled hash join. In production, we are seeing users to use shuffled hash join to join bucketed tables (set `spark.sql.join.preferSortMergeJoin`=false, to avoid sort), and this can help avoid shuffle if number of buckets are not same. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `CoalesceBucketsInJoinSuite` for verifying shuffled hash join physical plan. ### Performance number per request from maropu I was looking at TPCDS per suggestion from maropu. But I found most of queries from TPCDS are doing aggregate, and only several ones are doing join. None of input tables are bucketed. So I took the approach to test a modified version of `TPCDS q93` as ``` SELECT ss_ticket_number, sr_ticket_number FROM store_sales JOIN store_returns ON ss_ticket_number = sr_ticket_number ``` And make `store_sales` and `store_returns` to be bucketed tables. Physical query plan without coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- Exchange hashpartitioning(ss_ticket_number#109L, 4), true, [id=#67] : +- (1) Project [ss_ticket_number#109L] : +- (1) Filter isnotnull(ss_ticket_number#109L) : +- (1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- (2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- (2) Filter isnotnull(sr_ticket_number#120L) +- (2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 ``` Physical query plan with coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- (1) Project [ss_ticket_number#109L] : +- (1) Filter isnotnull(ss_ticket_number#109L) : +- (1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- (2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- (2) Filter isnotnull(sr_ticket_number#120L) +- (2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 (Coalesced to 2) ``` Run time improvement as 50% of wall clock time: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join coalesce bucket off 1541 1664 106 1.9 535.1 1.0X shuffle hash join coalesce bucket on 1060 1169 81 2.7 368.1 1.5X ``` Closes #29079 from c21/split-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-22 08:56:26 +09:00
Dongjoon Hyun	8c7d6f9733	[SPARK-32377][SQL] CaseInsensitiveMap should be deterministic for addition ### What changes were proposed in this pull request? This PR aims to fix `CaseInsensitiveMap` to be deterministic for addition. ### Why are the changes needed? ```scala import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap var m = CaseInsensitiveMap(Map.empty[String, String]) Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", "5")).foreach { kv => m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]] println(m.get("path")) } ``` BEFORE ``` Some(1) Some(2) Some(3) Some(4) Some(1) ``` AFTER ``` Some(1) Some(2) Some(3) Some(4) Some(5) ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug fix on non-deterministic behavior. ### How was this patch tested? Pass the newly added test case. Closes #29172 from dongjoon-hyun/SPARK-32377. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-20 22:20:16 -07:00
gengjiaan	02114f96d6	[SPARK-32365][SQL] Add a boundary condition for negative index in regexp_extract ### What changes were proposed in this pull request? The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', 'd+' -1) ``` java.lang.IndexOutOfBoundsException: No group -1 java.util.regex.Matcher.group(Matcher.java:538) org.apache.spark.sql.catalyst.expressions.RegExpExtract.nullSafeEval(regexpExpressions.scala:455) org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:704) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45) ``` ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException: No group -1` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? new UT Closes #29161 from beliefer/regexp_extract-group-not-allow-less-than-zero. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-20 20:34:51 -07:00
Terry Kim	e0ecb66f53	[SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning ### What changes were proposed in this pull request? Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`. For example, ```Scala spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500") val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1") val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2") val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3") val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4") // join1 is a sort merge join. val join1 = t1.join(t2, t1("i1") === t2("i2")) // join2 is a broadcast join where t3 is broadcasted. val join2 = join1.join(t3, join1("i1") === t3("i3")) // Join on the column from the broadcasted side (i3). val join3 = join2.join(t4, join2("i3") === t4("i4")) join3.explain ``` You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side. ``` == Physical Plan == (6) SortMergeJoin [i3#29], [i4#40], Inner :- (4) Sort [i3#29 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i3#29, 200), true, [id=#55] : +- (3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight : :- (3) SortMergeJoin [i1#7], [i2#18], Inner : : :- (1) Sort [i1#7 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#7, 200), true, [id=#28] : : : +- LocalTableScan [i1#7, j1#8] : : +- (2) Sort [i2#18 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#18, 200), true, [id=#29] : : +- LocalTableScan [i2#18, j2#19] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34] : +- LocalTableScan [i3#29, j3#30] +- (5) Sort [i4#40 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#40, 200), true, [id=#39] +- LocalTableScan [i4#40, j4#41] ``` This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s. There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature. ### Why are the changes needed? To remove unnecessary shuffle. ### Does this PR introduce _any_ user-facing change? Yes, now the shuffle in the above example can be eliminated: ``` == Physical Plan == (5) SortMergeJoin [i3#108], [i4#119], Inner :- (3) Sort [i3#108 ASC NULLS FIRST], false, 0 : +- (3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight : :- (3) SortMergeJoin [i1#86], [i2#97], Inner : : :- (1) Sort [i1#86 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#86, 200), true, [id=#120] : : : +- LocalTableScan [i1#86, j1#87] : : +- (2) Sort [i2#97 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#97, 200), true, [id=#121] : : +- LocalTableScan [i2#97, j2#98] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126] : +- LocalTableScan [i3#108, j3#109] +- (4) Sort [i4#119 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#119, 200), true, [id=#130] +- LocalTableScan [i4#119, j4#120] ``` ### How was this patch tested? Added new tests. Closes #28676 from imback82/broadcast_join_output. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:25:51 +00:00
Gengliang Wang	d0c83f372b	[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in https://github.com/apache/spark/pull/24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:17:31 +00:00
Anton Okolnychyi	0aca1a6ed4	[SPARK-32276][SQL] Remove redundant sorts before repartition nodes ### What changes were proposed in this pull request? This PR proposes to remove redundant sorts before repartition nodes whenever the data is ordered after the repartitioning. ### Why are the changes needed? It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes that don't affect the final output ordering. It seems safe to perform the following rewrites: - `Sort -> Repartition -> Sort -> Scan` as `Sort -> Repartition -> Scan` - `Sort -> Repartition -> Project -> Sort -> Scan` as `Sort -> Repartition -> Project -> Scan` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? More test cases. Closes #29089 from aokolnychyi/spark-32276. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-19 12:11:26 -07:00

1 2 3 4 5 ...

4591 commits