ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuanjian Li	9f010a8eb2	[SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS ### What changes were proposed in this pull request? The implementation for the save operation of RocksDBFileManager. ### Why are the changes needed? Save all the files in the given local checkpoint directory as a committed version in DFS. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. Closes #32582 from xuanyuanking/SPARK-35436. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-09 14:09:28 +09:00
gengjiaan	8013f985a4	[SPARK-35378][SQL] Eagerly execute commands in QueryExecution instead of caller sides ### What changes were proposed in this pull request? Currently, Spark eagerly executes commands on the caller side of `QueryExecution`, which is a bit hacky as `QueryExecution` is not aware of it and leads to confusion. For example, if you run `sql("show tables").collect()`, you will see two queries with identical query plans in the web UI. ![image](https://user-images.githubusercontent.com/3182036/121193729-a72d0480-c8a0-11eb-8b12-379019607ad5.png) ![image](https://user-images.githubusercontent.com/3182036/121193822-bc099800-c8a0-11eb-9d2a-34ab1329e2f7.png) ![image](https://user-images.githubusercontent.com/3182036/121193845-c0ce4c00-c8a0-11eb-96d0-ef604a4dfab0.png) The first query is triggered at `Dataset.logicalPlan`, which eagerly executes the command. The second query is triggered at `Dataset.collect`, which is the normal query execution. From the web UI, it's hard to tell that these two queries are caused by eager command execution. This PR proposes to move the eager command execution to `QueryExecution`, and turn the command plan to `CommandResult` to indicate that command has been executed already. Now `sql("show tables").collect()` still triggers two queries, but the quey plans are not identical. The second query becomes: ![image](https://user-images.githubusercontent.com/3182036/121194850-b3659180-c8a1-11eb-9abf-2980f84f089d.png) In addition to the UI improvements, this PR also has other benefits: 1. Simplifies code as caller side no need to worry about eager command execution. `QueryExecution` takes care of it. 2. It helps https://github.com/apache/spark/pull/32442 , where there can be more plan nodes above commands, and we need to replace commands with something like local relation that produces unsafe rows. ### Why are the changes needed? Explained above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #32513 from beliefer/SPARK-35378. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 04:45:44 +00:00
Gengliang Wang	1b1a8e4eee	[SPARK-30993][FOLLOWUP][SQL] Refactor LocalDateTimeUDT as YearUDT in UserDefinedTypeSuite ### What changes were proposed in this pull request? Refactor LocalDateTimeUDT as YearUDT in UserDefinedTypeSuite ### Why are the changes needed? As we are going to support java.time.LocalDateTime as an external type of TimestampWithoutTZ type https://github.com/apache/spark/pull/32814, registering java.time.LocalDateTime as UDT will cause test failures: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139469/testReport/ This PR is to unblock https://github.com/apache/spark/pull/32814. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32824 from gengliangwang/UDTFollowUp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-09 10:02:37 +08:00
Kousuke Saruta	93a9dc479c	[SPARK-35602][SS] Update state schema to be able to accept long length JSON ### What changes were proposed in this pull request? This PR fixes an issue that both key and value of state schema cannot accept long length (>65535 bytes) JSON. To solve the problem explained below, JSON represented schema is divided into chunks whose maximum length is 65535 bytes, and each chunk is written by `DataOutputStream.writeUTF`. As the solution changes the format of the schema, the version is also changes from `1` to `2` but old version schema is still acceptable to ensures backward compatibility. ### Why are the changes needed? In the current implementation, writing state schema fails if the length of schema exceeds 65535 bytes and `UTFDataFormatException` is thrown. It's due to the limitation of `DataOutputStream.writeUTF`. `writeUTF` writes a length field first and it's 2 bytes width, meaning the maximum content length is limited to `2^16-1`=`65535` bytes. https://docs.oracle.com/javase/8/docs/api/java/io/DataOutputStream.html#writeUTF-java.lang.String- ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #32788 from sarutak/fix-UTFDataFormatException. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-09 10:09:57 +09:00
Chao Sun	66e38f48fe	[SPARK-35384][SQL][FOLLOWUP] Fix Scala doc for removed method parameters ### What changes were proposed in this pull request? Fix Scala doc for removed parameters for `InvokeLike.invoke`. ### Why are the changes needed? #32532 forgot to update the Scala doc after removing 2 parameters for `InvokeLike.invoke`. This fixes it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #32827 from sunchao/SPARK-35384-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-08 15:52:10 -07:00
Liang-Chi Hsieh	1226b9badd	[SPARK-35659][SS] Avoid write null to StateStore ### What changes were proposed in this pull request? This patch removes the usage of putting null into StateStore. ### Why are the changes needed? According to `get` method doc in `StateStore` API, it returns non-null row if the key exists. So basically we should avoid write null to `StateStore`. You cannot distinguish if the returned null row is because the key doesn't exist, or the value is actually null. And due to the defined behavior of `get`, it is quite easy to cause NPE error if the caller doesn't expect to get a null if the caller believes the key exists. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #32796 from viirya/fix-ss-joinstatemanager. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-08 09:10:19 -07:00
Satish Gopalani	2a331177ba	[SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger ### What changes were proposed in this pull request? This patch introduces a new option to specify the minimum number of offsets to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the infinite wait for the trigger. This new option will allow skipping trigger/batch when the number of records available in Kafka is low. This is a very useful feature in cases where we have a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. 'maxTriggerDelay' option will help to avoid cases of infinite delay in scheduling trigger and the trigger will happen irrespective of records available if the maxTriggerDelay time exceeds the last trigger. It would be an optional parameter with a default value of 15 mins. This option will be only applicable if minOffsetsPerTrigger is set. minOffsetsPerTrigger option would be optional of course, but once specified it would take precedence over maxOffestsPerTrigger which will be honored only after minOffsetsPerTrigger is satisfied. ### Why are the changes needed? There are many scenarios where there is a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. Tunning such jobs is difficult as decreasing trigger processing time increasing the number of batches and hence cluster resource usage and adds to small file issues. Increasing trigger processing time adds consumer lag. This patch tries to address this issue. ### How was this patch tested? This patch was tested by adding test cases as well as manually on a cluster where the job was running for a full one day with a data burst happening once a day. Here is the picture of databurst and hence consumer lag: <img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png"> This is how the job behaved at burst time running every 4.5 mins (which is the specified trigger time): <img width="1154" alt="Burst Time" src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png"> This is job behavior during the non-burst time where it is skipping 2 to 3 triggers and running once every 9 to 13.5 mins <img width="1154" alt="Non Burst Time" src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png"> Here are some more stats from the two-run i.e. one normal run and the other with minOffsetsperTrigger set: \| Run \| Data Size \| Number of Batch Runs \| Number of Files \| \| ------------- \| ------------- \|------------- \|------------- \| \| Normal Run \| 54.2 GB \| 320 \| 21968 \| \| Run with minOffsetsperTrigger \| 54.2 GB \| 120 \| 12104 \| Closes #32653 from satishgopalani/SPARK-35312. Authored-by: Satish Gopalani <satish.gopalani@pubmatic.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-08 23:48:09 +09:00
Cheng Pan	eee02739ed	[SPARK-34290][SQL][FOLLOWUP] Cleanup truncate table not supported for V2Table error ### What changes were proposed in this pull request? Cleanup unreachable code. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existed test. Closes #32791 from pan3793/cleanup. Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-08 13:24:11 +08:00
Gengliang Wang	33f26275f4	[SPARK-35663][SQL] Add Timestamp without time zone type ### What changes were proposed in this pull request? Extend Catalyst's type system by a new type that conforms to the SQL standard (see SQL:2016, section 4.6.2): TimestampWithoutTZType represents the timestamp without time zone type ### Why are the changes needed? Spark SQL today supports the TIMESTAMP data type. However the semantics provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. Timestamps embedded in a SQL query or passed through JDBC are presumed to be in session local timezone and cast to UTC before being processed. These are desirable semantics in many cases, such as when dealing with calendars. In many (more) other cases, such as when dealing with log files it is desirable that the provided timestamps not be altered. SQL users expect that they can model either behavior and do so by using TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH LOCAL TIME ZONE for time zone sensitive data. Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist in the standard. In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for standard semantic. Using these two types will provide clarity. This is a starting PR. See more details in https://issues.apache.org/jira/browse/SPARK-35662 ### Does this PR introduce _any_ user-facing change? Yes, a new data type for Timestamp without time zone type. It is still in development. ### How was this patch tested? Unit test Closes #32802 from gengliangwang/TimestampNTZType. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-07 14:21:31 +00:00
Wenchen Fan	a70e66ecfa	[SPARK-35665][SQL] Resolve UnresolvedAlias in CollectMetrics ### What changes were proposed in this pull request? It's a long-standing bug that we forgot to resolve `UnresolvedAlias` in `CollectMetrics`. It's a bit hard to trigger this bug before 3.2 as most likely people won't create `UnresolvedAlias` when calling `Dataset.observe`. However things have been changed after https://github.com/apache/spark/pull/30974 This PR proposes to handle `CollectMetrics` in the rule `ResolveAliases`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test Closes #32803 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 21:05:11 +09:00
Alkis Polyzotis	6f8c62047c	[SPARK-35558] Optimizes for multi-quantile retrieval ### What changes were proposed in this pull request? Optimizes the retrieval of approximate quantiles for an array of percentiles. * Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation. * Modifies the ApproximatePercentiles operator to call into the new method. All formatting changes are the result of running ./dev/scalafmt ### Why are the changes needed? The existing implementation does repeated calls per input percentile resulting in redundant computation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests for the new method. Closes #32700 from alkispoly-db/spark_35558_approx_quants_array. Authored-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-05 14:25:33 -05:00
Yingyi Bu	7bc364beed	[SPARK-35621][SQL] Add rule id pruning to the TypeCoercion rule ### What changes were proposed in this pull request? - Added TreeNode.transformUpWithBeforeAndAfterRuleOnChildren(...); - Call transformUpWithBeforeAndAfterRuleOnChildren in TypeCoercionRule. ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff : <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment (wo. ruleId) \| Experiment (wo. ruleId)/Baseline \| Experiment (w. ruleId) \| Experiment (w. ruleId)/Baseline -- \| -- \| -- \| -- \| -- \| -- CombinedTypeCoercionRule \| 665020354 \| 567320034 \| 0.85 \| 330798240 \| 0.50 </google-sheets-html-origin> Closes #32761 from sigmod/transform. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-05 14:49:16 +08:00
Marios Meimaris	b5678bee1e	[SPARK-35446] Override getJDBCType in MySQLDialect to map FloatType to FLOAT ### What changes were proposed in this pull request? Override `getJDBCType` method in `MySQLDialect` so that `FloatType` is mapped to `FLOAT` instead of `REAL` ### Why are the changes needed? MySQL treats `REAL` as a synonym to `DOUBLE` by default (see https://dev.mysql.com/doc/refman/8.0/en/numeric-types.html). Therefore, when creating a table with a column of `REAL` type, it will be created as `DOUBLE`. However, currently, `MySQLDialect` does not provide an implementation for `getJDBCType`, and will thus ultimately fall back to `JdbcUtils.getCommonJDBCType`, which maps `FloatType` to `REAL`. This change is needed so that we can properly map the `FloatType` to `FLOAT` for MySQL. ### Does this PR introduce _any_ user-facing change? Prior to this PR, when writing a dataframe with a `FloatType` column to a MySQL table, it will create a `DOUBLE` column. After the PR, it will create a `FLOAT` column. ### How was this patch tested? Added a test case in `JDBCSuite` that verifies the mapping. Closes #32605 from mariosmeim-db/SPARK-35446. Authored-by: Marios Meimaris <marios.meimaris@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-05 12:44:16 +09:00
Kent Yao	dc3317fdf9	[SPARK-21957][SQL][FOLLOWUP] Support CURRENT_USER without tailing parentheses ### What changes were proposed in this pull request? A followup for `345d35ed1a`, in this PR we support CURRENT_USER without tailing parentheses in default mode. And for ANSI mode, we can only use CURRENT_USER without tailing parentheses because it is a reserved keyword that cannot be used as a function name ### Why are the changes needed? 1. make it the same as current_date/current_timestamp 2. better ANSI compliance ### Does this PR introduce _any_ user-facing change? no, just a followup ### How was this patch tested? new tests Closes #32770 from yaooqinn/SPARK-21957-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-04 13:32:56 +00:00
Ke Jia	6ce5f2491c	[SPARK-35568][SQL] Add the BroadcastExchange after re-optimizing the physical plan to fix the UnsupportedOperationException when enabling both AQE and DPP ### What changes were proposed in this pull request? This PR is to fix the `UnsupportedOperationException` described in [PR#32705](https://github.com/apache/spark/pull/32705). When AQE and DPP are turned on at the same time, because the `BroadcastExchange` included in the DPP filter is not added through `EnsureRequirement` rule, Therefore, when AQE optimizes the DPP filter, there is no way to add `BroadcastExchange` through the `EnsureRequirement` rule in `reOptimize` method, which eventually leads to the loss of `BroadcastExchange` in the final physical plan. This PR adds `BroadcastExchange` node in the `reOptimize` method if the current plan is DPP filter. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? adding new ut Closes #32741 from JkSelf/fixDPP+AQEbug. Authored-by: Ke Jia <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-04 13:29:36 +00:00
ulysses-you	c7fb0e18be	[SPARK-35629][SQL] Use better exception type if database doesn't exist on `drop database` ### What changes were proposed in this pull request? Add database if exists check in `SeesionCatalog` ### Why are the changes needed? Curently execute `drop database test` will throw unfriendly error msg. ``` Error in query: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropDatabase(HiveExternalCatalog.scala:200) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropDatabase(ExternalCatalogWithListener.scala:53) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropDatabase(SessionCatalog.scala:273) at org.apache.spark.sql.execution.command.DropDatabaseCommand.run(ddl.scala:111) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3707) ``` ### Does this PR introduce _any_ user-facing change? Yes, more cleaner error msg. ### How was this patch tested? Add test. Closes #32768 from ulysses-you/SPARK-35629. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-04 15:52:21 +08:00
Karen Feng	53a758b51b	[SPARK-35636][SQL] Lambda keys should not be referenced outside of the lambda function ### What changes were proposed in this pull request? Sets `references` for `NamedLambdaVariable` and `LambdaFunction`. \| Expression \| NamedLambdaVariable \| LambdaFunction \| \| --- \| --- \| --- \| \| References before \| None \| All function references \| \| References after \| self.toAttribute \| Function references minus arguments' references \| In `NestedColumnAliasing`, this means that `ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...)` now references both `attr` and `lv`, rather than just `attr`. As a result, it will not be included in the nested column references. ### Why are the changes needed? Before, lambda key was referenced outside of lambda function. #### Example 1 Before: ``` Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0] +- 'Join Cross :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0] : +- LocalRelation <empty>, [kvs#0] +- LocalRelation <empty>, [keys#0] ``` After: ``` Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419] +- Join Cross :- LocalRelation <empty>, [kvs#417] +- LocalRelation <empty>, [keys#418] ``` #### Example 2 Before: ``` Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#0, _extract_v1#0 AS _extract_v1#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0] +- LocalRelation <empty>, [kvs#0, keys#0] ``` After: ``` Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#428, kvs#427] +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [kvs#427, keys#428] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Scala unit tests for the examples above Closes #32773 from karenfeng/SPARK-35636. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 15:44:32 +09:00
fornaix	878527d9fa	[SPARK-35612][SQL] Support LZ4 compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support LZ4 compression in the ORC data source. ### Why are the changes needed? Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in the ORC data source BEFORE ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") java.lang.IllegalArgumentException: Codec [lz4] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none, zstd. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") ``` ```bash $ orc-tools meta /tmp/lz4 Processing data file file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc [length: 222] Structure for file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc File Version: 0.12 with ORC_517 Rows: 10 Compression: LZ4 Compression size: 262144 Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 File Statistics: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 Stripes: Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 222 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the newly added test case. Closes #32751 from fornaix/spark-35612. Authored-by: fornaix <foxnaix@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:07:26 -07:00
Liang-Chi Hsieh	0342dcb628	[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction ### What changes were proposed in this pull request? This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them. ### Why are the changes needed? The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions. Manual check gen-ed code for: ```scala val df = Seq(Seq(1, 2, 3)).toDF("a") df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect() ``` The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`. Before: ```java /* 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 028 / public UnsafeRow apply(InternalRow i) { ... / 034 / Object obj_0 = ((Expression) references[0]).eval(i); ... / 062 / Object obj_1 = ((Expression) references[1]).eval(i); ... / 093 / } ``` After: ```java / 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 031 / public UnsafeRow apply(InternalRow i) { ... / 033 / subExpr_0(i); ... / 086 / private void subExpr_0(InternalRow i) { / 087 / Object obj_0 = ((Expression) references[0]).eval(i); / 088 / boolean isNull_0 = obj_0 == null; / 089 / ArrayData value_0 = null; / 090 / if (!isNull_0) { / 091 / value_0 = (ArrayData) obj_0; / 092 / } / 093 / subExprIsNull_0 = isNull_0; / 094 / mutableStateArray_0[0] = value_0; / 095 / } / 096 / / 097 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual check gen-ed code. Closes #32735 from viirya/higher-func-canonicalize. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-03 09:16:47 -07:00
Fu Chen	cfde117c6f	[SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate ### What changes were proposed in this pull request? This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`. Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`. For instance: ```scala spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain ``` before this pr: ``` == Physical Plan == (1) Filter cast(id#5 as bigint) IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` after this pr: ``` == Physical Plan == (1) Filter id#95 IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct<id:int> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32488 from cfmcgrady/SPARK-35316. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-03 14:45:17 +00:00
Yuming Wang	8041aed296	[SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in EliminateOuterJoin ### What changes were proposed in this pull request? This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin. ### Why are the changes needed? We can always removes outer join if it only has DISTINCT on streamed side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32744 from wangyum/SPARK-34808-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 14:14:37 +00:00
gengjiaan	9f7cdb89f7	[SPARK-35059][SQL] Group exception messages in hive/execution ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32694 from beliefer/SPARK-35059. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:06:55 +00:00
Kent Yao	345d35ed1a	[SPARK-21957][SQL] Support current_user function ### What changes were proposed in this pull request? Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution. `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. This PR add `current_user()` as a SQL function. And, they are the same. In this PR, we add these functions w/o ambiguity. 1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` . 2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser. ### Why are the changes needed? `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. ### Does this PR introduce _any_ user-facing change? yes, added `current_user()` as a SQL function ### How was this patch tested? new tests in thrift server and sql/catalyst Closes #32718 from yaooqinn/SPARK-21957. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:04:40 +00:00
ulysses-you	daf9d198dc	[SPARK-35585][SQL] Support propagate empty relation through project/filter ### What changes were proposed in this pull request? Add rule `ConvertToLocalRelation` into AQE Optimizer. ### Why are the changes needed? Support propagate empty local relation through project and filter like such SQL case: ``` Aggregate Project Join ShuffleStage ShuffleStage ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #32724 from ulysses-you/SPARK-35585. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 07:49:56 +00:00
Cheng Su	54e9999d39	[SPARK-35604][SQL] Fix condition check for FULL OUTER sort merge join ### What changes were proposed in this pull request? The condition check for FULL OUTER sort merge join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1368 ) has unnecessary trip when `leftIndex == leftMatches.size` or `rightIndex == rightMatches.size`. Though this does not affect correctness (`scanNextInBuffered()` returns false anyway). But we can avoid it in the first place. ### Why are the changes needed? Better readability for developers and avoid unnecessary execution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests, such as `OuterJoinSuite.scala`. Closes #32736 from c21/join-bug. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-06-02 14:01:34 +08:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
Yingyi Bu	3f6322f9aa	[SPARK-35077][SQL] Migrate to transformWithPruning for leftover optimizer rules ### What changes were proposed in this pull request? Migrate to transformWithPruning for the following queries: - SimplifyExtractValueOps - NormalizeFloatingNumbers - PushProjectionThroughUnion - PushDownPredicates - ExtractPythonUDFFromAggregate - ExtractPythonUDFFromJoinCondition - ExtractGroupingPythonUDFFromAggregate - ExtractPythonUDFs - CleanupDynamicPruningFilters </google-sheets-html-origin> ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- SimplifyExtractValueOps \| 99367049 \| 3679579 \| 0.04 NormalizeFloatingNumbers \| 24717928 \| 20451094 \| 0.83 PushProjectionThroughUnion \| 14130245 \| 7913551 \| 0.56 PushDownPredicates \| 276333542 \| 261246842 \| 0.95 ExtractPythonUDFFromAggregate \| 6459451 \| 2683556 \| 0.42 ExtractPythonUDFFromJoinCondition \| 5695404 \| 2504573 \| 0.44 ExtractGroupingPythonUDFFromAggregate \| 5546701 \| 1858755 \| 0.34 ExtractPythonUDFs \| 58726458 \| 1598518 \| 0.03 CleanupDynamicPruningFilters \| 26606652 \| 15417936 \| 0.58 OptimizeSubqueries \| 3072287940 \| 2876462708 \| 0.94 </google-sheets-html-origin> Closes #32721 from sigmod/pushdown. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 11:46:33 +08:00
Liang-Chi Hsieh	dbf0b50757	[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions ### What changes were proposed in this pull request? This patch proposes to improve subexpression evaluation under whole-stage codegen for the cases of nested subexpressions. ### Why are the changes needed? In the cases of nested subexpressions, whole-stage codegen's subexpression elimination will do redundant subexpression evaluation. We should reduce it. For example, if we have two sub-exprs: 1. `simpleUDF($"id")` 2. `functions.length(simpleUDF($"id"))` We should only evaluate `simpleUDF($"id")` once, i.e. ```java subExpr1 = simpleUDF($"id"); subExpr2 = functions.length(subExpr1); ``` Snippets of generated codes: Before: ```java /* 040 / private int project_subExpr_1(long project_expr_0_0) { / 041 / boolean project_isNull_6 = false; / 042 / UTF8String project_value_6 = null; / 043 / if (!false) { / 044 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 045 / } / 046 / / 047 / Object project_arg_1 = null; / 048 / if (project_isNull_6) { / 049 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 050 / } else { / 051 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 052 / } / 053 / / 054 / UTF8String project_result_1 = null; / 055 / try { / 056 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 057 / } catch (Throwable e) { / 058 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 059 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 060 / } / 061 / / 062 / boolean project_isNull_5 = project_result_1 == null; / 063 / UTF8String project_value_5 = null; / 064 / if (!project_isNull_5) { / 065 / project_value_5 = project_result_1; / 066 / } / 067 / boolean project_isNull_4 = project_isNull_5; / 068 / int project_value_4 = -1; / 069 / / 070 / if (!project_isNull_5) { / 071 / project_value_4 = (project_value_5).numChars(); / 072 / } / 073 / project_subExprIsNull_1 = project_isNull_4; / 074 / return project_value_4; / 075 / } ... / 149 / private UTF8String project_subExpr_0(long project_expr_0_0) { / 150 / boolean project_isNull_2 = false; / 151 / UTF8String project_value_2 = null; / 152 / if (!false) { / 153 / project_value_2 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 154 / } / 155 / / 156 / Object project_arg_0 = null; / 157 / if (project_isNull_2) { / 158 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(null); / 159 / } else { / 160 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(project_value_2); / 161 / } / 162 / / 163 / UTF8String project_result_0 = null; / 164 / try { / 165 / project_result_0 = (UTF8String)((scala.Function1[]) references[1] / converters /)[1].apply(((scala.Function1) references[2] / udf /).apply(project_arg_0) ); / 166 / } catch (Throwable e) { / 167 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 168 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 169 / } / 170 / / 171 / boolean project_isNull_1 = project_result_0 == null; / 172 / UTF8String project_value_1 = null; / 173 / if (!project_isNull_1) { / 174 / project_value_1 = project_result_0; / 175 / } / 176 / project_subExprIsNull_0 = project_isNull_1; / 177 / return project_value_1; / 178 / } ``` After: ```java / 041 / private void project_subExpr_1(long project_expr_0_0) { / 042 / boolean project_isNull_8 = project_subExprIsNull_0; / 043 / int project_value_8 = -1; / 044 / / 045 / if (!project_subExprIsNull_0) { / 046 / project_value_8 = (project_mutableStateArray_0[0]).numChars(); / 047 / } / 048 / project_subExprIsNull_1 = project_isNull_8; / 049 / project_subExprValue_0 = project_value_8; / 050 / } / 056 / ... / 123 / / 124 / private void project_subExpr_0(long project_expr_0_0) { / 125 / boolean project_isNull_6 = false; / 126 / UTF8String project_value_6 = null; / 127 / if (!false) { / 128 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 129 / } / 130 / / 131 / Object project_arg_1 = null; / 132 / if (project_isNull_6) { / 133 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 134 / } else { / 135 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 136 / } / 137 / / 138 / UTF8String project_result_1 = null; / 139 / try { / 140 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 141 / } catch (Throwable e) { / 142 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 143 / "DataFrameSuite$$Lambda$6430/2004847941", "string", "string", e); / 144 / } / 145 / / 146 / boolean project_isNull_5 = project_result_1 == null; / 147 / UTF8String project_value_5 = null; / 148 / if (!project_isNull_5) { / 149 / project_value_5 = project_result_1; / 150 / } / 151 / project_subExprIsNull_0 = project_isNull_5; / 152 / project_mutableStateArray_0[0] = project_value_5; / 153 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32699 from viirya/improve-subexpr. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-01 19:13:12 -07:00
Gengliang Wang	9d0d4edb43	[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender ### What changes were proposed in this pull request? A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger: https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=true https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/ To fix it, I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs. ### Why are the changes needed? Fix a flaky test case. Also, reduce unnecessary memory cost in tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32725 from gengliangwang/fixFlakyLogAppender. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 10:05:29 +08:00
Gengliang Wang	6a277bb7c6	[SPARK-35600][TESTS] Move Set command related test cases to SetCommandSuite ### What changes were proposed in this pull request? Move `Set` command related test cases from `SQLQuerySuite` to a new test suite `SetCommandSuite`. There are 7 test cases in total. ### Why are the changes needed? Code refactoring. `SQLQuerySuite` is becoming big. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #32732 from gengliangwang/setsuite. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:36:21 +09:00
Max Gekk	a59063d544	[SPARK-35581][SQL] Support special datetime values in typed literals only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-01 15:29:05 +03:00
Yingyi Bu	1dd0ca23f6	[SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules ### What changes were proposed in this pull request? Added the following TreePattern enums: - AGGREGATE_EXPRESSION - ALIAS - GROUPING_ANALYTICS - GENERATOR - HIGH_ORDER_FUNCTION - LAMBDA_FUNCTION - NEW_INSTANCE - PIVOT - PYTHON_UDF - TIME_WINDOW - TIME_ZONE_AWARE_EXPRESSION - UP_CAST - COMMAND - EVENT_TIME_WATERMARK - UNRESOLVED_RELATION - WITH_WINDOW_DEFINITION - UNRESOLVED_ALIAS - UNRESOLVED_ATTRIBUTE - UNRESOLVED_DESERIALIZER - UNRESOLVED_ORDINAL - UNRESOLVED_FUNCTION - UNRESOLVED_HINT - UNRESOLVED_SUBQUERY_COLUMN_ALIAS - UNRESOLVED_FUNC Added tree pattern pruning to the following Analyzer rules: - ResolveBinaryArithmetic - WindowsSubstitution - ResolveAliases - ResolveGroupingAnalytics - ResolvePivot - ResolveOrdinalInOrderByAndGroupBy - LookupFunction - ResolveSubquery - ResolveSubqueryColumnAliases - ApplyCharTypePadding - UpdateOuterReferences - ResolveCreateNamedStruct - TimeWindowing - CleanupAliases - EliminateUnions - EliminateSubqueryAliases - HandleAnalysisOnlyCommand - ResolveNewInstances - ResolveUpCast - ResolveDeserializer - ResolveOutputRelation - ResolveEncodersInUDF - HandleNullInputsForUDF - ResolveGenerate - ExtractGenerator - GlobalAggregates - ResolveAggregateFunctions ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- ResolveBinaryArithmetic \| 43264874 \| 34707117 \| 0.80 WindowsSubstitution \| 3322996 \| 2734192 \| 0.82 ResolveAliases \| 24859263 \| 21359941 \| 0.86 ResolveGroupingAnalytics \| 39249143 \| 25417569 \| 0.80 ResolvePivot \| 6393408 \| 2843314 \| 0.44 ResolveOrdinalInOrderByAndGroupBy \| 10750806 \| 3386715 \| 0.32 LookupFunction \| 22087384 \| 15481294 \| 0.70 ResolveSubquery \| 1129139340 \| 944402323 \| 0.84 ResolveSubqueryColumnAliases \| 5055038 \| 2808210 \| 0.56 ApplyCharTypePadding \| 76285576 \| 63785681 \| 0.84 UpdateOuterReferences \| 6548321 \| 3092539 \| 0.47 ResolveCreateNamedStruct \| 38111477 \| 17350249 \| 0.46 TimeWindowing \| 41694190 \| 3739134 \| 0.09 CleanupAliases \| 48683506 \| 39584921 \| 0.81 EliminateUnions \| 3405069 \| 2372506 \| 0.70 EliminateSubqueryAliases \| 9626649 \| 9518216 \| 0.99 HandleAnalysisOnlyCommand \| 2562123 \| 2661432 \| 1.04 ResolveNewInstances \| 16208966 \| 1982314 \| 0.12 ResolveUpCast \| 14067843 \| 1868615 \| 0.13 ResolveDeserializer \| 17991103 \| 2320308 \| 0.13 ResolveOutputRelation \| 5815277 \| 2088787 \| 0.36 ResolveEncodersInUDF \| 14182892 \| 1045113 \| 0.07 HandleNullInputsForUDF \| 19850838 \| 1329528 \| 0.07 ResolveGenerate \| 5587345 \| 1953192 \| 0.35 ExtractGenerator \| 120378046 \| 3386286 \| 0.03 GlobalAggregates \| 16510455 \| 13553155 \| 0.82 ResolveAggregateFunctions \| 1041848509 \| 828049280 \| 0.79 </google-sheets-html-origin> Closes #32686 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-01 11:39:42 +08:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
Wenchen Fan	bb2a0747d2	[SPARK-35578][SQL][TEST] Add a test case for a bug in janino ### What changes were proposed in this pull request? This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries. A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark. ### Why are the changes needed? make it easy for people to debug janino, as I'm not a janino expert. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32716 from cloud-fan/janino. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:51:05 +09:00
Gengliang Wang	8e11f5f007	[SPARK-35576][SQL] Redact the sensitive info in the result of Set command ### What changes were proposed in this pull request? Currently, the results of following SQL queries are not redacted: ``` SET [KEY]; SET; ``` For example: ``` scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set javax.jdo.option.ConnectionPassword").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set").show() +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|javax.jdo.option....\| 123456\| ``` We should hide the sensitive information and redact the query output. ### Why are the changes needed? Security. ### Does this PR introduce _any_ user-facing change? Yes, the sensitive information in the output of Set commands are redacted ### How was this patch tested? Unit test Closes #32712 from gengliangwang/redactSet. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 14:50:18 -07:00
shahid	cd2ef9cb43	[SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes ### What changes were proposed in this pull request? Explain cost command in spark currently doesn't show statistics for all the nodes. It misses some nodes in almost all the TPCDS queries. In this PR, we are collecting all the plan nodes including the subqueries and computing the statistics for each node, if it doesn't exists in stats cache, ### Why are the changes needed? Before Fix For eg: Query1, Project node doesn't have statistics ![image](https://user-images.githubusercontent.com/23054875/120123442-868feb00-c1cc-11eb-9af9-3a87bf2117d2.png) Query15, Aggregate node doesn't have statistics ![image](https://user-images.githubusercontent.com/23054875/120123296-a4108500-c1cb-11eb-89df-7fddd651572e.png) After Fix: Query1: ![image](https://user-images.githubusercontent.com/23054875/120123559-1df53e00-c1cd-11eb-938a-53704f5240e6.png) Query 15: ![image](https://user-images.githubusercontent.com/23054875/120123665-bb507200-c1cd-11eb-8ea2-84c732215bac.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual testing Closes #32704 from shahidki31/shahid/fixshowstats. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-01 00:55:29 +08:00
Tengfei Huang	1603775934	[SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON ### What changes were proposed in this pull request? Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values. Fallback to use reflection to get all the values for constructor parameters when we meet such cases. ### Why are the changes needed? Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT case added. Closes #32713 from ivoson/SPARK-35411-followup. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 22:15:26 +08:00
Yuming Wang	6cd6c438f2	[SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side ### What changes were proposed in this pull request? This pr add new rule to removes outer join if it only has distinct on streamed side. For example: ```scala spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) ``` Before this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [a#2L] +- Join LeftOuter, (a#2L = b#6L) :- Project [id#0L AS a#2L] : +- Range (0, 200, step=1, splits=Some(2)) +- Project [id#4L AS b#6L] +- Range (0, 300, step=1, splits=Some(2)) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [id#0L AS a#2L] +- Range (0, 200, step=1, splits=Some(2)) ``` ### Why are the changes needed? Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature: ![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31908 from wangyum/SPARK-34808. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-31 18:14:15 +08:00
Liang-Chi Hsieh	73ba4492b1	[SPARK-35566][SS] Fix StateStoreRestoreExec output rows ### What changes were proposed in this pull request? This is a minor change to update how `StateStoreRestoreExec` computes its number of output rows. Previously we only count input rows, but the optionally restored rows are not counted in. ### Why are the changes needed? Currently the number of output rows of `StateStoreRestoreExec` only counts the each input row. But it actually outputs input rows + optional restored rows. We should provide correct number of output rows. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32703 from viirya/fix-outputrows. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-31 16:45:56 +09:00
allisonwang-db	806da9d6fa	[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions ### What changes were proposed in this pull request? This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`. ### Why are the changes needed? Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up. For example: `SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2` During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output: `29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)` This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32687 from allisonwang-db/refactor-subquery-expr. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 04:57:24 +00:00
yangjie01	09d039da56	[SPARK-35526][CORE][SQL][ML][MLLIB] Re-Cleanup `procedure syntax is deprecated` compilation warning in Scala 2.13 ### What changes were proposed in this pull request? After SPARK-29291 and SPARK-33352, there are still some compilation warnings about `procedure syntax is deprecated` as follows: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:748: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `unregisterMergeResult`'s return type [WARNING] [Warn] /spark/core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala:223: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testSimpleSpillingForAllCodecs`'s return type [WARNING] [Warn] /spark/mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASBenchmark.scala:53: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `runBLASBenchmark`'s return type [WARNING] [Warn] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala:110: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `assertEmptyRootPath`'s return type [WARNING] [Warn] /spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:602: [deprecation \| origin= \| version=2.13.0] procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `executeCTASWithNonEmptyLocation`'s return type ``` So the main change of this pr is cleanup these compilation warnings. ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #32669 from LuciferYang/re-clean-procedure-syntax. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-30 16:49:47 -07:00
Yingyi Bu	5c8a141d03	[SPARK-35538][SQL] Migrate transformAllExpressions call sites to use transformAllExpressionsWithPruning ### What changes were proposed in this pull request? Added the following TreePattern enums: - EXCHANGE - IN_SUBQUERY_EXEC - UPDATE_FIELDS Migrated `transformAllExpressions` call sites to use `transformAllExpressionsWithPruning` ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Perf diff: Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline OptimizeUpdateFields \| 54646396 \| 27444424 \| 0.5 ReplaceUpdateFieldsExpression \| 24694303 \| 2087517 \| 0.08 Closes #32643 from sigmod/all_expressions. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2021-05-28 15:36:25 -07:00
Wenchen Fan	678592a612	[SPARK-35559][TEST] Speed up one test in AdaptiveQueryExecSuite ### What changes were proposed in this pull request? I just noticed that `AdaptiveQueryExecSuite.SPARK-34091: Batch shuffle fetch in AQE partition coalescing` takes more than 10 minutes to finish, which is unacceptable. This PR sets the shuffle partitions to 10 in that test, so that the test can finish with 5 seconds. ### Why are the changes needed? speed up the test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32695 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-28 12:39:34 -07:00
Kousuke Saruta	b763db3efd	[SPARK-35194][SQL][FOLLOWUP] Recover build error with Scala 2.13 on GA ### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. #32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-29 00:11:16 +09:00
Karen Feng	e8631660ec	[SPARK-35194][SQL] Refactor nested column aliasing for readability ### What changes were proposed in this pull request? Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability. ### Why are the changes needed? Improves readability for future maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32301 from karenfeng/refactor-nested-column-aliasing. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-28 13:18:44 +00:00
ulysses-you	3b94aad5e7	[SPARK-35552][SQL] Make query stage materialized more readable ### What changes were proposed in this pull request? Add a new method `isMaterialized` in `QueryStageExec`. ### Why are the changes needed? Currently, we use `resultOption().get.isDefined` to check if a query stage has materialized. The code is not readable at a glance. It's better to use a new method like `isMaterialized` to define it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass CI. Closes #32689 from ulysses-you/SPARK-35552. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-05-28 20:42:11 +08:00
Wenchen Fan	29ed1a2de4	[SPARK-35541][SQL] Simplify OptimizeSkewedJoin ### What changes were proposed in this pull request? Various small code simplification/cleanup for OptimizeSkewedJoin ### Why are the changes needed? code refactor ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #32685 from cloud-fan/skew-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-27 09:17:28 -07:00
Yuanjian Li	f98a063a4b	[SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata ### What changes were proposed in this pull request? Initial implementation of RocksDBCheckpointMetadata. It persists the metadata for RocksDBFileManager. ### Why are the changes needed? The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys. The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - [Directory Structure and Format for Files stored in DFS](https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2). ### Does this PR introduce _any_ user-facing change? No. Internal implementation only. ### How was this patch tested? New UT added. Closes #32272 from xuanyuanking/SPARK-35172. Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-27 22:56:50 +09:00
dgd-contributor	52a1f8c000	[SPARK-33428][SQL] Match the behavior of conv function to MySQL's ### What changes were proposed in this pull request? Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it. However, seems Spark has different behavior in two cases: MySQL allows leading spaces but Spark does not. If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException This patch now help conv follow behavior in those two cases conv allows leading spaces conv will return the max unsigned long when the input string is way too long ### Why are the changes needed? fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL ### Does this PR introduce _any_ user-facing change? Yes, as pointed out above ### How was this patch tested? Add test Closes #32684 from dgd-contributor/SPARK-33428. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-27 12:12:39 +00:00
Gengliang Wang	5bcd1c29f0	[SPARK-35535][SQL] New data source V2 API: LocalScan ### What changes were proposed in this pull request? Add a new data source V2 API: `LocalScan`. It is a special Scan that will happen on Driver locally instead of Executors. ### Why are the changes needed? The new API improves the flexibility of the DSV2 API. It allows developers to implement connectors for data sources of small data sizes. For example, we can build a data source for Spark History applications from Spark History Server RESTFUL API. The result set is small and fetching all the results from the Spark driver is good enough. Making it a data source allows us to operate SQL queries with filters or table joins. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test Closes #32678 from gengliangwang/LocalScan. Lead-authored-by: Gengliang Wang <ltnwgl@gmail.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 19:31:56 +09:00

1 2 3 4 5 ...

11289 commits