ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	d21ff1318f	[SPARK-35716][SQL] Support casting of timestamp without time zone to date type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to DateType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32869 from gengliangwang/castToDate. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 23:37:02 +03:00
Emil Ejbyfeldt	e2e3fe7782	[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) * FAILED * (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] \|-- key: integer [info] \|-- value: struct (valueContainsNull = true) [info] \| \|-- i: integer (nullable = false) [info] \| \|-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-10 09:37:27 -07:00
Gengliang Wang	4180692135	[SPARK-35711][SQL] Support casting of timestamp without time zone to timestamp type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to TimestampType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32864 from gengliangwang/castToTimestamp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 23:03:52 +08:00
Terry Kim	88f1d82a46	[SPARK-34524][SQL][FOLLOWUP] Remove unused checkAlterTablePartition in CheckAnalysis.scala ### What changes were proposed in this pull request? #31637 removed the usage of `CheckAnalysis.checkAlterTablePartition` but didn't remove the function. ### Why are the changes needed? To removed an unused function. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32855 from imback82/SPARK-34524-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 12:43:09 +00:00
Fu Chen	5280f02747	[SPARK-35673][SQL] Fix user-defined hint and unrecognized hint in subquery ### What changes were proposed in this pull request? Use `UnresolvedHint.resolved = child.resolved` instead `UnresolvedHint.resolved = false`, then the plan contains `UnresolvedHint` child can be optimized by rule in batch `Resolution`. For instance, before this pr, the following plan can't be optimized by `ResolveReferences`. ``` !'Project [*] +- SubqueryAlias __auto_generated_subquery_name +- UnresolvedHint use_hash +- Project [42 AS 42#10] +- OneRowRelation ``` ### Why are the changes needed? fix hint in subquery bug ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32841 from cfmcgrady/SPARK-35673. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 15:32:10 +08:00
dgd-contributor	aa3de40773	[SPARK-35679][SQL] instantToMicros overflow ### Why are the changes needed? With Long.minValue cast to an instant, secs will be floored in function microsToInstant and cause overflow when multiply with Micros_per_second ``` def microsToInstant(micros: Long): Instant = { val secs = Math.floorDiv(micros, MICROS_PER_SECOND) // Unfolded Math.floorMod(us, MICROS_PER_SECOND) to reuse the result of // the above calculation of `secs` via `floorDiv`. val mos = micros - secs * MICROS_PER_SECOND <- it will overflow here Instant.ofEpochSecond(secs, mos * NANOS_PER_MICROS) } ``` But the overflow is acceptable because it won't produce any change to the result However, when convert the instant back to micro value, it will raise Overflow Error ``` def instantToMicros(instant: Instant): Long = { val us = Math.multiplyExact(instant.getEpochSecond, MICROS_PER_SECOND) <- It overflow here val result = Math.addExact(us, NANOSECONDS.toMicros(instant.getNano)) result } ``` Code to reproduce this error ``` instantToMicros(microToInstant(Long.MinValue)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test added Closes #32839 from dgd-contributor/SPARK-35679_instantToMicro. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 08:08:51 +03:00
Linhong Liu	87d2ffbbcf	[MINOR][SQL] No need to normolize name for built-in functions ### What changes were proposed in this pull request? Add an `internalRegisterFunction` for the built-in function registry. So that we can skip the unnecessary function normalization. ### Why are the changes needed? small refactor ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing ut Closes #32842 from linhongliu-db/function-refactor. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 04:35:26 +00:00
Kousuke Saruta	7e99b65295	[SPARK-35194][SQL][FOLLOWUP] Change Seq to collections.Seq in NestedColumnAliasing to work with Scala 2.13 ### What changes were proposed in this pull request? This PR changes an occurrence of `Seq` to `collections.Seq` in `NestedColumnAliasing`. ### Why are the changes needed? In the current master, `NestedColumnAliasing` doesn't work with Scala 2.13 and the relevant tests fail. The following are examples. * `NestedColumnAliasingSuite` * Subclasses of `SchemaPruningSuite` * `ColumnPruningSuite` ``` NestedColumnAliasingSuite: [info] - Pushing a single nested field projection * FAILED * (14 milliseconds) [info] scala.MatchError: (none#211451,ArrayBuffer(name#211451.middle)) (of class scala.Tuple2) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.$anonfun$getAttributeToExtractValues$5(NestedColumnAliasing.scala:258) [info] at scala.collection.StrictOptimizedMapOps.flatMap(StrictOptimizedMapOps.scala:31) [info] at scala.collection.StrictOptimizedMapOps.flatMap$(StrictOptimizedMapOps.scala:30) [info] at scala.collection.immutable.HashMap.flatMap(HashMap.scala:39) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.getAttributeToExtractValues(NestedColumnAliasing.scala:258) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran tests mentioned above and all passed with Scala 2.13. Closes #32848 from sarutak/followup-SPARK-35194-2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 02:14:40 +00:00
Kousuke Saruta	94b66f5e28	[MINOR][SQL] Modify the example of rand and randn ### What changes were proposed in this pull request? This PR fixes the examples of `rand` and `randn`. ### Why are the changes needed? SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`. Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc and confirmed it. ![rand-doc](https://user-images.githubusercontent.com/4736016/121359059-145a9b80-c96e-11eb-84c2-2f2b313614f3.png) Closes #32844 from sarutak/rand-example. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:37:38 +09:00
Gengliang Wang	74b3df86f3	[SPARK-35698][SQL] Support casting of timestamp without time zone to strings ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to StringType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32846 from gengliangwang/tswtzToString. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 02:29:37 +08:00
allisonwang-db	f49bf1a072	[SPARK-34382][SQL] Support LATERAL subqueries ### What changes were proposed in this pull request? This PR adds support for lateral subqueries. A lateral subquery is a subquery preceded by the `LATERAL` keyword in the FROM clause of a query that can reference columns in the preceding FROM items. For example: ```sql SELECT * FROM t1, LATERAL (SELECT * FROM t2 WHERE t1.a = t2.c) ``` A new subquery expression`LateralSubquery` is used to represent a lateral subquery. It is similar to `ScalarSubquery` but can return multiple rows and columns. A new logical unary node `LateralJoin` is used to represent a lateral join. Here is the analyzed plan for the above query: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a], Inner : +- Project [c, d] : +- Filter (outer(a) = c) : +- Relation [c, d] +- Relation [a, b] ``` Similar to a correlated subquery, a lateral subquery can be viewed as a dependent (nested loop) join where the evaluation of the right subtree depends on the current value of the left subtree. The same technique to decorrelate a subquery is used to decorrelate a lateral join: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a && a = c], Inner // pull up correlated predicates as join conditions : +- Project [c, d] : +- Relation [c, d] +- Relation [a, b] ``` Then the lateral join can be rewritten into a normal join: ```scala Join Inner (a = c) :- Relation [a, b] +- Relation [c, d] ``` #### Follow-ups: 1. Similar to rewriting correlated scalar subqueries, rewriting lateral joins is also subject to the COUNT bug (See SPARK-15370 for more details). This is not handled in the current PR as it requires a sizeable amount of refactoring. It will be addressed in a subsequent PR (SPARK-35551). 2. Currently Spark does use outer query references to resolve star expressions in subqueries. This is not lateral subquery specific and can be handled in a separate PR (SPARK-35618) ### Why are the changes needed? To support an ANSI SQL feature. ### Does this PR introduce _any_ user-facing change? Yes. It allows users to use lateral subqueries in the FROM clause of a query. ### How was this patch tested? - Parser test: `PlanParserSuite.scala` - Analyzer test: `ResolveSubquerySuite.scala` - Optimizer test: `PullupCorrelatedPredicatesSuite.scala` - SQL test: `join-lateral.sql`, `postgreSQL/join.sql` Closes #32303 from allisonwang-db/spark-34382-lateral. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 17:08:32 +00:00
Gengliang Wang	313dc2d4ed	[SPARK-35697][SQL][TESTS] Test TimestampWithoutTZType as ordered and atomic type ### What changes were proposed in this pull request? Add `TimestampWithoutTZType` to `DataTypeTestUtils.ordered`/`atomicTypes`, and implement values generation of those types in `LiteralGenerator`/`RandomDataGenerator`. In this way, the types will be tested automatically in: 1. ArithmeticExpressionSuite: - "function least" - "function greatest" 2. PredicateSuite - "BinaryComparison consistency check" - "AND, OR, EqualTo, EqualNullSafe consistency check" 3. ConditionalExpressionSuite - "if" 4. RandomDataGeneratorSuite - "Basic types" 5. CastSuite - "null cast" - "up-cast" - "SPARK-27671: cast from nested null type in struct" 6. OrderingSuite - "GenerateOrdering with TimestampWithoutTZType" 7. PredicateSuite - "IN with different types" 8. UnsafeRowSuite - "calling get(ordinal, datatype) on null columns" 9. SortSuite - "sorting on TimestampWithoutTZType ..." ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #32843 from gengliangwang/atomicTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 15:19:25 +00:00
Chao Sun	7d8181b62f	[SPARK-35390][SQL] Handle type coercion when resolving V2 functions ### What changes were proposed in this pull request? Handle type coercion when resolving V2 function. In particular: - prior to evaluating function arguments, insert cast whenever the argument type doesn't match the expected input type. - use `BoundFunction.inputTypes()` to lookup magic method for scalar function ### Why are the changes needed? For V2 functions, the actual argument types should not necessarily match those of the input types, and Spark should handle type coercion whenever it is needed. ### Does this PR introduce _any_ user-facing change? Yes. Now V2 function resolution should be able to handle type coercion properly. ### How was this patch tested? Added a few new tests. Closes #32764 from sunchao/SPARK-35390. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 13:22:19 +00:00
beliefer	ebb4858f71	[SPARK-35058][SQL] Group exception messages in hive/client ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/client`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32763 from beliefer/SPARK-35058. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 08:23:09 +00:00
Gengliang Wang	84c5ca33f9	[SPARK-35664][SQL] Support java.time.LocalDateTime as an external type of TimestampWithoutTZ type ### What changes were proposed in this pull request? In the PR, I propose to extend Spark SQL API to accept `java.time.LocalDateTime` as an external type of recently added new Catalyst type - `TimestampWithoutTZ`. The Java class `java.time.LocalDateTime` has a similar semantic to ANSI SQL timestamp without timezone type, and it is the most suitable to be an external type for `TimestampWithoutTZType`. In more details: * Added `TimestampWithoutTZConverter` which converts java.time.LocalDateTime instances to/from internal representation of the Catalyst type `TimestampWithoutTZType` (to Long type). The `TimestampWithoutTZConverter` object uses new methods of DateTimeUtils: * localDateTimeToMicros() converts the input date time to the total length in microseconds. * microsToLocalDateTime() obtains a java.time.LocalDateTime * Support new type `TimestampWithoutTZType` in RowEncoder via the methods createDeserializerForLocalDateTime() and createSerializerForLocalDateTime(). * Extended the Literal API to construct literals from `java.time.LocalDateTime` instances. ### Why are the changes needed? To allow users parallelization of `java.time.LocalDateTime` collections, and construct timestamp without time zone columns. Also to collect such columns back to the driver side. ### Does this PR introduce _any_ user-facing change? The PR extends existing functionality. So, users can parallelize instances of the java.time.LocalDateTime class and collect them back. ``` scala> val ds = Seq(java.time.LocalDateTime.parse("1970-01-01T00:00:00")).toDS ds: org.apache.spark.sql.Dataset[java.time.LocalDateTime] = [value: timestampwithouttz] scala> ds.collect() res0: Array[java.time.LocalDateTime] = Array(1970-01-01T00:00) ``` ### How was this patch tested? New unit tests Closes #32814 from gengliangwang/LocalDateTime. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-09 14:59:46 +08:00
Chao Sun	66e38f48fe	[SPARK-35384][SQL][FOLLOWUP] Fix Scala doc for removed method parameters ### What changes were proposed in this pull request? Fix Scala doc for removed parameters for `InvokeLike.invoke`. ### Why are the changes needed? #32532 forgot to update the Scala doc after removing 2 parameters for `InvokeLike.invoke`. This fixes it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #32827 from sunchao/SPARK-35384-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-08 15:52:10 -07:00
Satish Gopalani	2a331177ba	[SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger ### What changes were proposed in this pull request? This patch introduces a new option to specify the minimum number of offsets to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the infinite wait for the trigger. This new option will allow skipping trigger/batch when the number of records available in Kafka is low. This is a very useful feature in cases where we have a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. 'maxTriggerDelay' option will help to avoid cases of infinite delay in scheduling trigger and the trigger will happen irrespective of records available if the maxTriggerDelay time exceeds the last trigger. It would be an optional parameter with a default value of 15 mins. This option will be only applicable if minOffsetsPerTrigger is set. minOffsetsPerTrigger option would be optional of course, but once specified it would take precedence over maxOffestsPerTrigger which will be honored only after minOffsetsPerTrigger is satisfied. ### Why are the changes needed? There are many scenarios where there is a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. Tunning such jobs is difficult as decreasing trigger processing time increasing the number of batches and hence cluster resource usage and adds to small file issues. Increasing trigger processing time adds consumer lag. This patch tries to address this issue. ### How was this patch tested? This patch was tested by adding test cases as well as manually on a cluster where the job was running for a full one day with a data burst happening once a day. Here is the picture of databurst and hence consumer lag: <img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png"> This is how the job behaved at burst time running every 4.5 mins (which is the specified trigger time): <img width="1154" alt="Burst Time" src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png"> This is job behavior during the non-burst time where it is skipping 2 to 3 triggers and running once every 9 to 13.5 mins <img width="1154" alt="Non Burst Time" src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png"> Here are some more stats from the two-run i.e. one normal run and the other with minOffsetsperTrigger set: \| Run \| Data Size \| Number of Batch Runs \| Number of Files \| \| ------------- \| ------------- \|------------- \|------------- \| \| Normal Run \| 54.2 GB \| 320 \| 21968 \| \| Run with minOffsetsperTrigger \| 54.2 GB \| 120 \| 12104 \| Closes #32653 from satishgopalani/SPARK-35312. Authored-by: Satish Gopalani <satish.gopalani@pubmatic.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-08 23:48:09 +09:00
Cheng Pan	eee02739ed	[SPARK-34290][SQL][FOLLOWUP] Cleanup truncate table not supported for V2Table error ### What changes were proposed in this pull request? Cleanup unreachable code. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existed test. Closes #32791 from pan3793/cleanup. Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-08 13:24:11 +08:00
Gengliang Wang	33f26275f4	[SPARK-35663][SQL] Add Timestamp without time zone type ### What changes were proposed in this pull request? Extend Catalyst's type system by a new type that conforms to the SQL standard (see SQL:2016, section 4.6.2): TimestampWithoutTZType represents the timestamp without time zone type ### Why are the changes needed? Spark SQL today supports the TIMESTAMP data type. However the semantics provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. Timestamps embedded in a SQL query or passed through JDBC are presumed to be in session local timezone and cast to UTC before being processed. These are desirable semantics in many cases, such as when dealing with calendars. In many (more) other cases, such as when dealing with log files it is desirable that the provided timestamps not be altered. SQL users expect that they can model either behavior and do so by using TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH LOCAL TIME ZONE for time zone sensitive data. Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist in the standard. In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for standard semantic. Using these two types will provide clarity. This is a starting PR. See more details in https://issues.apache.org/jira/browse/SPARK-35662 ### Does this PR introduce _any_ user-facing change? Yes, a new data type for Timestamp without time zone type. It is still in development. ### How was this patch tested? Unit test Closes #32802 from gengliangwang/TimestampNTZType. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-07 14:21:31 +00:00
Wenchen Fan	a70e66ecfa	[SPARK-35665][SQL] Resolve UnresolvedAlias in CollectMetrics ### What changes were proposed in this pull request? It's a long-standing bug that we forgot to resolve `UnresolvedAlias` in `CollectMetrics`. It's a bit hard to trigger this bug before 3.2 as most likely people won't create `UnresolvedAlias` when calling `Dataset.observe`. However things have been changed after https://github.com/apache/spark/pull/30974 This PR proposes to handle `CollectMetrics` in the rule `ResolveAliases`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test Closes #32803 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 21:05:11 +09:00
Alkis Polyzotis	6f8c62047c	[SPARK-35558] Optimizes for multi-quantile retrieval ### What changes were proposed in this pull request? Optimizes the retrieval of approximate quantiles for an array of percentiles. * Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation. * Modifies the ApproximatePercentiles operator to call into the new method. All formatting changes are the result of running ./dev/scalafmt ### Why are the changes needed? The existing implementation does repeated calls per input percentile resulting in redundant computation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests for the new method. Closes #32700 from alkispoly-db/spark_35558_approx_quants_array. Authored-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-05 14:25:33 -05:00
Yingyi Bu	7bc364beed	[SPARK-35621][SQL] Add rule id pruning to the TypeCoercion rule ### What changes were proposed in this pull request? - Added TreeNode.transformUpWithBeforeAndAfterRuleOnChildren(...); - Call transformUpWithBeforeAndAfterRuleOnChildren in TypeCoercionRule. ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff : <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment (wo. ruleId) \| Experiment (wo. ruleId)/Baseline \| Experiment (w. ruleId) \| Experiment (w. ruleId)/Baseline -- \| -- \| -- \| -- \| -- \| -- CombinedTypeCoercionRule \| 665020354 \| 567320034 \| 0.85 \| 330798240 \| 0.50 </google-sheets-html-origin> Closes #32761 from sigmod/transform. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-05 14:49:16 +08:00
Kent Yao	dc3317fdf9	[SPARK-21957][SQL][FOLLOWUP] Support CURRENT_USER without tailing parentheses ### What changes were proposed in this pull request? A followup for `345d35ed1a`, in this PR we support CURRENT_USER without tailing parentheses in default mode. And for ANSI mode, we can only use CURRENT_USER without tailing parentheses because it is a reserved keyword that cannot be used as a function name ### Why are the changes needed? 1. make it the same as current_date/current_timestamp 2. better ANSI compliance ### Does this PR introduce _any_ user-facing change? no, just a followup ### How was this patch tested? new tests Closes #32770 from yaooqinn/SPARK-21957-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-04 13:32:56 +00:00
ulysses-you	c7fb0e18be	[SPARK-35629][SQL] Use better exception type if database doesn't exist on `drop database` ### What changes were proposed in this pull request? Add database if exists check in `SeesionCatalog` ### Why are the changes needed? Curently execute `drop database test` will throw unfriendly error msg. ``` Error in query: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropDatabase(HiveExternalCatalog.scala:200) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropDatabase(ExternalCatalogWithListener.scala:53) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropDatabase(SessionCatalog.scala:273) at org.apache.spark.sql.execution.command.DropDatabaseCommand.run(ddl.scala:111) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3707) ``` ### Does this PR introduce _any_ user-facing change? Yes, more cleaner error msg. ### How was this patch tested? Add test. Closes #32768 from ulysses-you/SPARK-35629. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-04 15:52:21 +08:00
Karen Feng	53a758b51b	[SPARK-35636][SQL] Lambda keys should not be referenced outside of the lambda function ### What changes were proposed in this pull request? Sets `references` for `NamedLambdaVariable` and `LambdaFunction`. \| Expression \| NamedLambdaVariable \| LambdaFunction \| \| --- \| --- \| --- \| \| References before \| None \| All function references \| \| References after \| self.toAttribute \| Function references minus arguments' references \| In `NestedColumnAliasing`, this means that `ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...)` now references both `attr` and `lv`, rather than just `attr`. As a result, it will not be included in the nested column references. ### Why are the changes needed? Before, lambda key was referenced outside of lambda function. #### Example 1 Before: ``` Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0] +- 'Join Cross :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0] : +- LocalRelation <empty>, [kvs#0] +- LocalRelation <empty>, [keys#0] ``` After: ``` Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419] +- Join Cross :- LocalRelation <empty>, [kvs#417] +- LocalRelation <empty>, [keys#418] ``` #### Example 2 Before: ``` Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#0, _extract_v1#0 AS _extract_v1#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0] +- LocalRelation <empty>, [kvs#0, keys#0] ``` After: ``` Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#428, kvs#427] +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [kvs#427, keys#428] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Scala unit tests for the examples above Closes #32773 from karenfeng/SPARK-35636. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 15:44:32 +09:00
fornaix	878527d9fa	[SPARK-35612][SQL] Support LZ4 compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support LZ4 compression in the ORC data source. ### Why are the changes needed? Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in the ORC data source BEFORE ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") java.lang.IllegalArgumentException: Codec [lz4] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none, zstd. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") ``` ```bash $ orc-tools meta /tmp/lz4 Processing data file file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc [length: 222] Structure for file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc File Version: 0.12 with ORC_517 Rows: 10 Compression: LZ4 Compression size: 262144 Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 File Statistics: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 Stripes: Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 222 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the newly added test case. Closes #32751 from fornaix/spark-35612. Authored-by: fornaix <foxnaix@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:07:26 -07:00
Liang-Chi Hsieh	0342dcb628	[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction ### What changes were proposed in this pull request? This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them. ### Why are the changes needed? The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions. Manual check gen-ed code for: ```scala val df = Seq(Seq(1, 2, 3)).toDF("a") df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect() ``` The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`. Before: ```java /* 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 028 / public UnsafeRow apply(InternalRow i) { ... / 034 / Object obj_0 = ((Expression) references[0]).eval(i); ... / 062 / Object obj_1 = ((Expression) references[1]).eval(i); ... / 093 / } ``` After: ```java / 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 031 / public UnsafeRow apply(InternalRow i) { ... / 033 / subExpr_0(i); ... / 086 / private void subExpr_0(InternalRow i) { / 087 / Object obj_0 = ((Expression) references[0]).eval(i); / 088 / boolean isNull_0 = obj_0 == null; / 089 / ArrayData value_0 = null; / 090 / if (!isNull_0) { / 091 / value_0 = (ArrayData) obj_0; / 092 / } / 093 / subExprIsNull_0 = isNull_0; / 094 / mutableStateArray_0[0] = value_0; / 095 / } / 096 / / 097 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual check gen-ed code. Closes #32735 from viirya/higher-func-canonicalize. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-03 09:16:47 -07:00
Fu Chen	cfde117c6f	[SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate ### What changes were proposed in this pull request? This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`. Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`. For instance: ```scala spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain ``` before this pr: ``` == Physical Plan == (1) Filter cast(id#5 as bigint) IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` after this pr: ``` == Physical Plan == (1) Filter id#95 IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct<id:int> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32488 from cfmcgrady/SPARK-35316. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-03 14:45:17 +00:00
Yuming Wang	8041aed296	[SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in EliminateOuterJoin ### What changes were proposed in this pull request? This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin. ### Why are the changes needed? We can always removes outer join if it only has DISTINCT on streamed side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32744 from wangyum/SPARK-34808-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 14:14:37 +00:00
gengjiaan	9f7cdb89f7	[SPARK-35059][SQL] Group exception messages in hive/execution ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32694 from beliefer/SPARK-35059. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:06:55 +00:00
Kent Yao	345d35ed1a	[SPARK-21957][SQL] Support current_user function ### What changes were proposed in this pull request? Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution. `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. This PR add `current_user()` as a SQL function. And, they are the same. In this PR, we add these functions w/o ambiguity. 1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` . 2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser. ### Why are the changes needed? `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. ### Does this PR introduce _any_ user-facing change? yes, added `current_user()` as a SQL function ### How was this patch tested? new tests in thrift server and sql/catalyst Closes #32718 from yaooqinn/SPARK-21957. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:04:40 +00:00
Yingyi Bu	3f6322f9aa	[SPARK-35077][SQL] Migrate to transformWithPruning for leftover optimizer rules ### What changes were proposed in this pull request? Migrate to transformWithPruning for the following queries: - SimplifyExtractValueOps - NormalizeFloatingNumbers - PushProjectionThroughUnion - PushDownPredicates - ExtractPythonUDFFromAggregate - ExtractPythonUDFFromJoinCondition - ExtractGroupingPythonUDFFromAggregate - ExtractPythonUDFs - CleanupDynamicPruningFilters </google-sheets-html-origin> ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- SimplifyExtractValueOps \| 99367049 \| 3679579 \| 0.04 NormalizeFloatingNumbers \| 24717928 \| 20451094 \| 0.83 PushProjectionThroughUnion \| 14130245 \| 7913551 \| 0.56 PushDownPredicates \| 276333542 \| 261246842 \| 0.95 ExtractPythonUDFFromAggregate \| 6459451 \| 2683556 \| 0.42 ExtractPythonUDFFromJoinCondition \| 5695404 \| 2504573 \| 0.44 ExtractGroupingPythonUDFFromAggregate \| 5546701 \| 1858755 \| 0.34 ExtractPythonUDFs \| 58726458 \| 1598518 \| 0.03 CleanupDynamicPruningFilters \| 26606652 \| 15417936 \| 0.58 OptimizeSubqueries \| 3072287940 \| 2876462708 \| 0.94 </google-sheets-html-origin> Closes #32721 from sigmod/pushdown. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 11:46:33 +08:00
Liang-Chi Hsieh	dbf0b50757	[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions ### What changes were proposed in this pull request? This patch proposes to improve subexpression evaluation under whole-stage codegen for the cases of nested subexpressions. ### Why are the changes needed? In the cases of nested subexpressions, whole-stage codegen's subexpression elimination will do redundant subexpression evaluation. We should reduce it. For example, if we have two sub-exprs: 1. `simpleUDF($"id")` 2. `functions.length(simpleUDF($"id"))` We should only evaluate `simpleUDF($"id")` once, i.e. ```java subExpr1 = simpleUDF($"id"); subExpr2 = functions.length(subExpr1); ``` Snippets of generated codes: Before: ```java /* 040 / private int project_subExpr_1(long project_expr_0_0) { / 041 / boolean project_isNull_6 = false; / 042 / UTF8String project_value_6 = null; / 043 / if (!false) { / 044 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 045 / } / 046 / / 047 / Object project_arg_1 = null; / 048 / if (project_isNull_6) { / 049 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 050 / } else { / 051 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 052 / } / 053 / / 054 / UTF8String project_result_1 = null; / 055 / try { / 056 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 057 / } catch (Throwable e) { / 058 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 059 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 060 / } / 061 / / 062 / boolean project_isNull_5 = project_result_1 == null; / 063 / UTF8String project_value_5 = null; / 064 / if (!project_isNull_5) { / 065 / project_value_5 = project_result_1; / 066 / } / 067 / boolean project_isNull_4 = project_isNull_5; / 068 / int project_value_4 = -1; / 069 / / 070 / if (!project_isNull_5) { / 071 / project_value_4 = (project_value_5).numChars(); / 072 / } / 073 / project_subExprIsNull_1 = project_isNull_4; / 074 / return project_value_4; / 075 / } ... / 149 / private UTF8String project_subExpr_0(long project_expr_0_0) { / 150 / boolean project_isNull_2 = false; / 151 / UTF8String project_value_2 = null; / 152 / if (!false) { / 153 / project_value_2 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 154 / } / 155 / / 156 / Object project_arg_0 = null; / 157 / if (project_isNull_2) { / 158 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(null); / 159 / } else { / 160 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(project_value_2); / 161 / } / 162 / / 163 / UTF8String project_result_0 = null; / 164 / try { / 165 / project_result_0 = (UTF8String)((scala.Function1[]) references[1] / converters /)[1].apply(((scala.Function1) references[2] / udf /).apply(project_arg_0) ); / 166 / } catch (Throwable e) { / 167 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 168 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 169 / } / 170 / / 171 / boolean project_isNull_1 = project_result_0 == null; / 172 / UTF8String project_value_1 = null; / 173 / if (!project_isNull_1) { / 174 / project_value_1 = project_result_0; / 175 / } / 176 / project_subExprIsNull_0 = project_isNull_1; / 177 / return project_value_1; / 178 / } ``` After: ```java / 041 / private void project_subExpr_1(long project_expr_0_0) { / 042 / boolean project_isNull_8 = project_subExprIsNull_0; / 043 / int project_value_8 = -1; / 044 / / 045 / if (!project_subExprIsNull_0) { / 046 / project_value_8 = (project_mutableStateArray_0[0]).numChars(); / 047 / } / 048 / project_subExprIsNull_1 = project_isNull_8; / 049 / project_subExprValue_0 = project_value_8; / 050 / } / 056 / ... / 123 / / 124 / private void project_subExpr_0(long project_expr_0_0) { / 125 / boolean project_isNull_6 = false; / 126 / UTF8String project_value_6 = null; / 127 / if (!false) { / 128 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 129 / } / 130 / / 131 / Object project_arg_1 = null; / 132 / if (project_isNull_6) { / 133 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 134 / } else { / 135 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 136 / } / 137 / / 138 / UTF8String project_result_1 = null; / 139 / try { / 140 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 141 / } catch (Throwable e) { / 142 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 143 / "DataFrameSuite$$Lambda$6430/2004847941", "string", "string", e); / 144 / } / 145 / / 146 / boolean project_isNull_5 = project_result_1 == null; / 147 / UTF8String project_value_5 = null; / 148 / if (!project_isNull_5) { / 149 / project_value_5 = project_result_1; / 150 / } / 151 / project_subExprIsNull_0 = project_isNull_5; / 152 / project_mutableStateArray_0[0] = project_value_5; / 153 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32699 from viirya/improve-subexpr. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-01 19:13:12 -07:00
Gengliang Wang	9d0d4edb43	[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender ### What changes were proposed in this pull request? A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger: https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=true https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/ To fix it, I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs. ### Why are the changes needed? Fix a flaky test case. Also, reduce unnecessary memory cost in tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32725 from gengliangwang/fixFlakyLogAppender. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 10:05:29 +08:00
Max Gekk	a59063d544	[SPARK-35581][SQL] Support special datetime values in typed literals only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-01 15:29:05 +03:00
Yingyi Bu	1dd0ca23f6	[SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules ### What changes were proposed in this pull request? Added the following TreePattern enums: - AGGREGATE_EXPRESSION - ALIAS - GROUPING_ANALYTICS - GENERATOR - HIGH_ORDER_FUNCTION - LAMBDA_FUNCTION - NEW_INSTANCE - PIVOT - PYTHON_UDF - TIME_WINDOW - TIME_ZONE_AWARE_EXPRESSION - UP_CAST - COMMAND - EVENT_TIME_WATERMARK - UNRESOLVED_RELATION - WITH_WINDOW_DEFINITION - UNRESOLVED_ALIAS - UNRESOLVED_ATTRIBUTE - UNRESOLVED_DESERIALIZER - UNRESOLVED_ORDINAL - UNRESOLVED_FUNCTION - UNRESOLVED_HINT - UNRESOLVED_SUBQUERY_COLUMN_ALIAS - UNRESOLVED_FUNC Added tree pattern pruning to the following Analyzer rules: - ResolveBinaryArithmetic - WindowsSubstitution - ResolveAliases - ResolveGroupingAnalytics - ResolvePivot - ResolveOrdinalInOrderByAndGroupBy - LookupFunction - ResolveSubquery - ResolveSubqueryColumnAliases - ApplyCharTypePadding - UpdateOuterReferences - ResolveCreateNamedStruct - TimeWindowing - CleanupAliases - EliminateUnions - EliminateSubqueryAliases - HandleAnalysisOnlyCommand - ResolveNewInstances - ResolveUpCast - ResolveDeserializer - ResolveOutputRelation - ResolveEncodersInUDF - HandleNullInputsForUDF - ResolveGenerate - ExtractGenerator - GlobalAggregates - ResolveAggregateFunctions ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- ResolveBinaryArithmetic \| 43264874 \| 34707117 \| 0.80 WindowsSubstitution \| 3322996 \| 2734192 \| 0.82 ResolveAliases \| 24859263 \| 21359941 \| 0.86 ResolveGroupingAnalytics \| 39249143 \| 25417569 \| 0.80 ResolvePivot \| 6393408 \| 2843314 \| 0.44 ResolveOrdinalInOrderByAndGroupBy \| 10750806 \| 3386715 \| 0.32 LookupFunction \| 22087384 \| 15481294 \| 0.70 ResolveSubquery \| 1129139340 \| 944402323 \| 0.84 ResolveSubqueryColumnAliases \| 5055038 \| 2808210 \| 0.56 ApplyCharTypePadding \| 76285576 \| 63785681 \| 0.84 UpdateOuterReferences \| 6548321 \| 3092539 \| 0.47 ResolveCreateNamedStruct \| 38111477 \| 17350249 \| 0.46 TimeWindowing \| 41694190 \| 3739134 \| 0.09 CleanupAliases \| 48683506 \| 39584921 \| 0.81 EliminateUnions \| 3405069 \| 2372506 \| 0.70 EliminateSubqueryAliases \| 9626649 \| 9518216 \| 0.99 HandleAnalysisOnlyCommand \| 2562123 \| 2661432 \| 1.04 ResolveNewInstances \| 16208966 \| 1982314 \| 0.12 ResolveUpCast \| 14067843 \| 1868615 \| 0.13 ResolveDeserializer \| 17991103 \| 2320308 \| 0.13 ResolveOutputRelation \| 5815277 \| 2088787 \| 0.36 ResolveEncodersInUDF \| 14182892 \| 1045113 \| 0.07 HandleNullInputsForUDF \| 19850838 \| 1329528 \| 0.07 ResolveGenerate \| 5587345 \| 1953192 \| 0.35 ExtractGenerator \| 120378046 \| 3386286 \| 0.03 GlobalAggregates \| 16510455 \| 13553155 \| 0.82 ResolveAggregateFunctions \| 1041848509 \| 828049280 \| 0.79 </google-sheets-html-origin> Closes #32686 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-01 11:39:42 +08:00
Wenchen Fan	bb2a0747d2	[SPARK-35578][SQL][TEST] Add a test case for a bug in janino ### What changes were proposed in this pull request? This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries. A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark. ### Why are the changes needed? make it easy for people to debug janino, as I'm not a janino expert. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32716 from cloud-fan/janino. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:51:05 +09:00
Gengliang Wang	8e11f5f007	[SPARK-35576][SQL] Redact the sensitive info in the result of Set command ### What changes were proposed in this pull request? Currently, the results of following SQL queries are not redacted: ``` SET [KEY]; SET; ``` For example: ``` scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set javax.jdo.option.ConnectionPassword").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set").show() +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|javax.jdo.option....\| 123456\| ``` We should hide the sensitive information and redact the query output. ### Why are the changes needed? Security. ### Does this PR introduce _any_ user-facing change? Yes, the sensitive information in the output of Set commands are redacted ### How was this patch tested? Unit test Closes #32712 from gengliangwang/redactSet. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 14:50:18 -07:00
Tengfei Huang	1603775934	[SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON ### What changes were proposed in this pull request? Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values. Fallback to use reflection to get all the values for constructor parameters when we meet such cases. ### Why are the changes needed? Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT case added. Closes #32713 from ivoson/SPARK-35411-followup. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 22:15:26 +08:00
Yuming Wang	6cd6c438f2	[SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side ### What changes were proposed in this pull request? This pr add new rule to removes outer join if it only has distinct on streamed side. For example: ```scala spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) ``` Before this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [a#2L] +- Join LeftOuter, (a#2L = b#6L) :- Project [id#0L AS a#2L] : +- Range (0, 200, step=1, splits=Some(2)) +- Project [id#4L AS b#6L] +- Range (0, 300, step=1, splits=Some(2)) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [id#0L AS a#2L] +- Range (0, 200, step=1, splits=Some(2)) ``` ### Why are the changes needed? Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature: ![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31908 from wangyum/SPARK-34808. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-31 18:14:15 +08:00
allisonwang-db	806da9d6fa	[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions ### What changes were proposed in this pull request? This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`. ### Why are the changes needed? Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up. For example: `SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2` During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output: `29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)` This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32687 from allisonwang-db/refactor-subquery-expr. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 04:57:24 +00:00
Yingyi Bu	5c8a141d03	[SPARK-35538][SQL] Migrate transformAllExpressions call sites to use transformAllExpressionsWithPruning ### What changes were proposed in this pull request? Added the following TreePattern enums: - EXCHANGE - IN_SUBQUERY_EXEC - UPDATE_FIELDS Migrated `transformAllExpressions` call sites to use `transformAllExpressionsWithPruning` ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Perf diff: Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline OptimizeUpdateFields \| 54646396 \| 27444424 \| 0.5 ReplaceUpdateFieldsExpression \| 24694303 \| 2087517 \| 0.08 Closes #32643 from sigmod/all_expressions. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2021-05-28 15:36:25 -07:00
Kousuke Saruta	b763db3efd	[SPARK-35194][SQL][FOLLOWUP] Recover build error with Scala 2.13 on GA ### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. #32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-29 00:11:16 +09:00
Karen Feng	e8631660ec	[SPARK-35194][SQL] Refactor nested column aliasing for readability ### What changes were proposed in this pull request? Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability. ### Why are the changes needed? Improves readability for future maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32301 from karenfeng/refactor-nested-column-aliasing. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-28 13:18:44 +00:00
dgd-contributor	52a1f8c000	[SPARK-33428][SQL] Match the behavior of conv function to MySQL's ### What changes were proposed in this pull request? Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it. However, seems Spark has different behavior in two cases: MySQL allows leading spaces but Spark does not. If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException This patch now help conv follow behavior in those two cases conv allows leading spaces conv will return the max unsigned long when the input string is way too long ### Why are the changes needed? fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL ### Does this PR introduce _any_ user-facing change? Yes, as pointed out above ### How was this patch tested? Add test Closes #32684 from dgd-contributor/SPARK-33428. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-27 12:12:39 +00:00
gengjiaan	3e190807bc	[SPARK-35057][SQL] Group exception messages in hive/thriftserver ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32646 from beliefer/SPARK-35057. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-27 07:31:14 +00:00
ulysses-you	dc7b5a99f0	[SPARK-35282][SQL] Support AQE side shuffled hash join formula using rule ### What changes were proposed in this pull request? The main code change is: * Change rule `DemoteBroadcastHashJoin` to `DynamicJoinSelection` and add shuffle hash join selection code. * Specify a join strategy hint `SHUFFLE_HASH` if AQE think a join can be converted to SHJ. * Skip `preferSortMerge` config check in AQE side if a join can be converted to SHJ. ### Why are the changes needed? Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number. Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely. ### Does this PR introduce _any_ user-facing change? Yes, add a new config. ### How was this patch tested? Add test. Closes #32550 from ulysses-you/SPARK-35282-2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-26 14:16:04 +00:00
Linhong Liu	af1dba7ca5	[SPARK-35440][SQL] Add function type to `ExpressionInfo` for UDF ### What changes were proposed in this pull request? Add the function type, such as "scala_udf", "python_udf", "java_udf", "hive", "built-in" to the `ExpressionInfo` for UDF. ### Why are the changes needed? Make the `ExpressionInfo` of UDF more meaningful ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing and newly added UT Closes #32587 from linhongliu-db/udf-language. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-26 04:40:53 +00:00
ulysses-you	631077db08	[SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer ### What changes were proposed in this pull request? * remove `EliminateUnnecessaryJoin`, using `AQEPropagateEmptyRelation` instead. * eliminate join, aggregate, limit, repartition, sort, generate which is beneficial. ### Why are the changes needed? Make `EliminateUnnecessaryJoin` available with more case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #32602 from ulysses-you/SPARK-35455. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-25 08:59:59 +00:00
tanel.kiis@gmail.com	548e37b00b	[SPARK-33122][SQL][FOLLOWUP] Extend RemoveRedundantAggregates optimizer rule to apply to more cases ### What changes were proposed in this pull request? Addressed the dongjoon-hyun comments on the previous PR #30018. Extended the `RemoveRedundantAggregates` rule to remove redundant aggregations in even more queries. For example in ``` dataset .dropDuplicates() .groupBy('a) .agg(max('b)) ``` the `dropDuplicates` is not needed, because the result on `max` does not depend on duplicate values. ### Why are the changes needed? Improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31914 from tanelk/SPARK-33122_redundant_aggs_followup. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-25 10:04:37 +09:00

1 2 3 4 5 ...

5436 commits