ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	ac1c6aa45c	[SPARK-35987][SQL] The ANSI flags of Sum and Avg should be kept after being copied ### What changes were proposed in this pull request? Make the ANSI flag part of expressions `Sum` and `Average`'s parameter list, instead of fetching it from the sessional SQLConf. ### Why are the changes needed? For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list. We should make it consistent for the expressions `Sum` and `Average` ### Does this PR introduce _any_ user-facing change? Yes, the `Sum` and `Average` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session. ### How was this patch tested? Existing UT Closes #33186 from gengliangwang/sumAndAvg. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `51103cdcdd`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-05 12:34:39 +08:00
Wenchen Fan	ec84982191	[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient ### What changes were proposed in this pull request? This PR uses 2 ideas to make `EquivalentExpressions` more efficient: 1. do not keep all the equivalent expressions, we only need a count 2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs` This PR also fixes several small bugs (exposed by the refactoring), please see PR comments. ### Why are the changes needed? code cleanup and small perf improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33142 from cloud-fan/codegen. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `e6ce220690`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-03 08:28:59 -07:00
Wenchen Fan	c1d8178817	[SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing ### What changes were proposed in this pull request? By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions. However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless. This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config. ### Why are the changes needed? AQE is default on now, we should make the perf better in the default case. ### Does this PR introduce _any_ user-facing change? yes, a new config. ### How was this patch tested? new tests Closes #33172 from cloud-fan/aqe2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0c9c8ff569`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-02 16:07:46 +08:00
Karen Feng	1fda011d71	[SPARK-35955][SQL] Check for overflow in Average in ANSI mode ### What changes were proposed in this pull request? Fixes decimal overflow issues for decimal average in ANSI mode, so that overflows throw an exception rather than returning null. ### Why are the changes needed? Query: ``` scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> val df = Seq( \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(mean("decNum")) df2: org.apache.spark.sql.DataFrame = [avg(decNum): decimal(38,22)] scala> df2.show(40,false) ``` Before: ``` +-----------+ \|avg(decNum)\| +-----------+ \|null \| +-----------+ ``` After: ``` 21/07/01 19:48:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24) java.lang.ArithmeticException: Overflow in sum of decimals. at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInSumOfDecimalError(QueryExecutionErrors.scala:162) at org.apache.spark.sql.errors.QueryExecutionErrors.overflowInSumOfDecimalError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:499) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:502) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33177 from karenfeng/SPARK-35955. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-02 12:41:24 +08:00
Rahul Mahadev	47485a3c2d	[SPARK-35897][SS] Support user defined initial state with flatMapGroupsWithState in Structured Streaming ### What changes were proposed in this pull request? This PR aims to add support for specifying a user defined initial state for arbitrary structured streaming stateful processing using [flat]MapGroupsWithState operator. ### Why are the changes needed? Users can load previous state of their stateful processing as an initial state instead of redoing the entire processing once again. ### Does this PR introduce _any_ user-facing change? Yes this PR introduces new API ``` def mapGroupsWithState[S: Encoder, U: Encoder]( timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S])( func: (K, Iterator[V], GroupState[S]) => U): Dataset[U] def flatMapGroupsWithState[S: Encoder, U: Encoder]( outputMode: OutputMode, timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S])( func: (K, Iterator[V], GroupState[S]) => Iterator[U]) ``` ### How was this patch tested? Through unit tests in FlatMapGroupsWithStateSuite Closes #33093 from rahulsmahadev/flatMapGroupsWithState. Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-02 11:53:17 +08:00
Anton Okolnychyi	fceabe2372	[SPARK-35779][SQL] Dynamic filtering for Data Source V2 ### What changes were proposed in this pull request? This PR implemented the proposal per [design doc](https://docs.google.com/document/d/1RfFn2e9o_1uHJ8jFGsSakp-BZMizX1uRrJSybMe2a6M) for SPARK-35779. ### Why are the changes needed? Spark supports dynamic partition filtering that enables reusing parts of the query to skip unnecessary partitions in the larger table during joins. This optimization has proven to be beneficial for star-schema queries which are common in the industry. Unfortunately, dynamic pruning is currently limited to partition pruning during joins and is only supported for built-in v1 sources. As more and more Spark users migrate to Data Source V2, it is important to generalize dynamic filtering and expose it to all v2 connectors. Please, see the design doc for more information on this effort. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new optional mix-in interface for `Scan` in Data Source V2. ### How was this patch tested? This PR comes with tests. Closes #32921 from aokolnychyi/dynamic-filtering-wip. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-01 17:00:12 -07:00
Gengliang Wang	a643076d4e	[SPARK-35975][SQL] New configuration `spark.sql.timestampType` for the default timestamp type ### What changes were proposed in this pull request? Add a new configuration `spark.sql.timestampType`, which configures the default timestamp type of Spark SQL, including SQL DDL and Cast clause. Setting the configuration as `TIMESTAMP_NTZ` will use `TIMESTAMP WITHOUT TIME ZONE` as the default type while putting it as `TIMESTAMP_LTZ` will use `TIMESTAMP WITH LOCAL TIME ZONE`. The default value of the new configuration is TIMESTAMP_LTZ, which is consistent with previous Spark releases. ### Why are the changes needed? A new configuration for switching the default timestamp type as timestamp without time zone. ### Does this PR introduce _any_ user-facing change? No, it's a new feature. ### How was this patch tested? Unit test Closes #33176 from gengliangwang/newTsTypeConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-01 23:25:18 +03:00
SaurabhChawla	ca1217667c	[SPARK-35756][SQL] unionByName supports struct having same col names but different sequence ### What changes were proposed in this pull request? unionByName does not supports struct having same col names but different sequence ``` val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b") val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b") val unionDF = df1.unionByName(df2) ``` it gives the exception `org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table; 'Union false, false :- LocalRelation [_1#38, _2#39] +- LocalRelation _1#45, _2#46` In this case the col names are same so this unionByName should have the support to check within in the Struct if col names are same it should not throw this exception and works. after fix we are getting the result ``` val unionDF = df1.unionByName(df2) scala> unionDF.show +---+------+ \| a\| b\| +---+------+ \| 1\|{1, 2}\| \| 1\|{2, 1}\| +---+------+ ``` ### Why are the changes needed? As per unionByName functionality based on name, does the union. In the case of struct this scenario was missing where all the columns names are same but sequence is different, so added this functionality. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the unit test and also done the testing through spark shell Closes #32972 from SaurabhChawla100/SPARK-35756. Authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 17:37:09 +00:00
Gengliang Wang	3acc4b973b	[SPARK-35971][SQL] Rename the type name of TimestampNTZType as "timestamp_ntz" ### What changes were proposed in this pull request? Rename the type name string of TimestampNTZType from "timestamp without time zone" to "timestamp_ntz". ### Why are the changes needed? This is to make the column header shorter and simpler. Snowflake and Flink uses similar approach: https://docs.snowflake.com/en/sql-reference/data-types-datetime.html https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/concepts/timezone/ ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit tests Closes #33173 from gengliangwang/reviseTypeName. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-01 20:50:19 +08:00
Linhong Liu	3c683434fa	[SPARK-35686][SQL] Not allow using auto-generated alias when creating view ### What changes were proposed in this pull request? As described in #32831, Spark has compatible issues when querying a view created by an older version. The root cause is that Spark changed the auto-generated alias name. To avoid this in the future, we could ask the user to specify explicit column names when creating a view. ### Why are the changes needed? Avoid compatible issue when querying a view ### Does this PR introduce _any_ user-facing change? Yes. User will get error when running query below after this change ``` CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t ``` ### How was this patch tested? not yet Closes #32832 from linhongliu-db/SPARK-35686-no-auto-alias. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 12:47:38 +00:00
Linhong Liu	0c34b96541	[SPARK-35685][SQL] Prompt recreating the view when there is an incompatible schema issue ### What changes were proposed in this pull request? If the user creates a view in 2.4 and reads it in 3.1/3.2, there will be an incompatible schema issue. So this PR adds a view ddl in the error message to prompt the user recreating the view to fix the incompatible issue. For example: ```sql -- create view in 2.4 CREATE TABLE IF NOT EXISTS t USING parquet AS SELECT '1' as a, '20210420' as b" CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t -- select view in master SELECT * FROM v ``` Then we will get below error: ``` cannot resolve '`to_date(spark_catalog.default.t.b, 'yyyyMMdd')`' given input columns: [a, to_date(b, yyyyMMdd)]; ``` ### Why are the changes needed? Improve the error message ### Does this PR introduce _any_ user-facing change? Yes, the error message will change ### How was this patch tested? newly added test case Closes #32831 from linhongliu-db/SPARK-35685-view-compatible. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 09:45:14 +00:00
allisonwang-db	f281736fbd	[SPARK-35618][SQL] Resolve star expressions in subqueries using outer query plans ### What changes were proposed in this pull request? This PR supports resolving star expressions in subqueries using outer query plans. ### Why are the changes needed? Currently, Spark can only resolve star expressions using the inner query plan when resolving subqueries. Instead, it should also be able to resolve star expressions using the outer query plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests Closes #32787 from allisonwang-db/spark-35618-resolve-star-in-subquery. Lead-authored-by: allisonwang-db <allison.wang@databricks.com> Co-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 09:22:55 +00:00
Gengliang Wang	f2492772ba	[SPARK-35963][SQL] Rename TimestampWithoutTZType to TimestampNTZType ### What changes were proposed in this pull request? Rename TimestampWithoutTZType to TimestampNTZType ### Why are the changes needed? The time name of `TimestampWithoutTZType` is verbose. Rename it as `TimestampNTZType` so that 1. it is easier to read and type. 2. As we have the function to_timestamp_ntz, this makes the names consistent. 3. We will introduce a new SQL configuration `spark.sql.timestampType` for the default timestamp type. The configuration values can be "TIMESTMAP_NTZ" or "TIMESTMAP_LTZ" for simplicity. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Run `git grep -i WithoutTZ` and there is no result. And Ci tests. Closes #33167 from gengliangwang/rename. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 08:49:15 +00:00
gengjiaan	5d74ace648	[SPARK-35065][SQL] Group exception messages in spark/sql (core) ### What changes were proposed in this pull request? This PR group all exception messages in `sql/core/src/main/scala/org/apache/spark/sql`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32958 from beliefer/SPARK-35065. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 02:38:06 +00:00
ulysses-you	d46c1e38ec	[SPARK-35725][SQL] Support optimize skewed partitions in RebalancePartitions ### What changes were proposed in this pull request? * Add a new rule `ExpandShufflePartitions` in AQE `queryStageOptimizerRules` * Add a new config `spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled` to decide if should enable the new rule The new rule `OptimizeSkewInRebalancePartitions` only handle two shuffle origin `REBALANCE_PARTITIONS_BY_NONE` and `REBALANCE_PARTITIONS_BY_COL` for data skew issue. And re-use the exists config `ADVISORY_PARTITION_SIZE_IN_BYTES` to decide what partition size should be. ### Why are the changes needed? Currently, we don't support expand partition dynamically in AQE which is not friendly for some data skew job. Let's say if we have a simple query: ``` SELECT /+ REBALANCE(col) / * FROM table ``` The column of `col` is skewed, then some shuffle partitions would handle too much data than others. If we haven't inroduced extra shuffle, we can optimize this case by expanding partitions in AQE. ### Does this PR introduce _any_ user-facing change? Yes, a new config ### How was this patch tested? Add test Closes #32883 from ulysses-you/expand-partition. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-30 18:04:50 +00:00
Gengliang Wang	733e85f1f4	[SPARK-35953][SQL] Support extracting date fields from timestamp without time zone ### What changes were proposed in this pull request? Support extracting date fields from timestamp without time zone, which includes: - year - month - day - year of week - week - day of week - quarter - day of month - day of year ### Why are the changes needed? Support basic operations for the new timestamp type. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not released yet. ### How was this patch tested? Unit tests Closes #33156 from gengliangwang/dateField. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-01 00:44:48 +08:00
Angerszhuuuu	2febd5c3f0	[SPARK-35735][SQL] Take into account day-time interval fields in cast ### What changes were proposed in this pull request? Support take into account day-time interval field in cast. ### Why are the changes needed? To conform to the SQL standard. ### Does this PR introduce _any_ user-facing change? An user can use `cast(str, DayTimeInterval(DAY, HOUR))`, for instance. ### How was this patch tested? Added UT. Closes #32943 from AngersZhuuuu/SPARK-35735. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-30 16:05:04 +03:00
Gengliang Wang	e88aa49287	[SPARK-35932][SQL] Support extracting hour/minute/second from timestamp without time zone ### What changes were proposed in this pull request? Support extracting hour/minute/second fields from timestamp without time zone values. In details, the following syntaxes are supported: - extract [hour \| minute \| second] from timestampWithoutTZ - date_part('[hour \| minute \| second]', timestampWithoutTZ) - hour(timestampWithoutTZ) - minute(timestampWithoutTZ) - second(timestampWithoutTZ) ### Why are the changes needed? Support basic operations for the new timestamp type. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit test Closes #33136 from gengliangwang/field. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-30 19:36:47 +08:00
Karen Feng	e3bd817d65	[SPARK-34920][CORE][SQL] Add error classes with SQLSTATE ### What changes were proposed in this pull request? Unifies exceptions thrown from Spark under a single base trait `SparkError`, which unifies: - Error classes - Parametrized error messages - SQLSTATE, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Add-error-IDs-td31126.html. ### Why are the changes needed? - Adding error classes creates a consistent label for exceptions, even as error messages change - Creating a single, centralized source-of-truth for parametrized error messages improves auditing for error message quality - Adding SQLSTATE helps ODBC/JDBC users receive standardized error codes ### Does this PR introduce _any_ user-facing change? Yes, changes ODBC experience by: - Adding error classes to error messages - Adding SQLSTATE to TStatus ### How was this patch tested? Unit tests, as well as local tests with PyODBC. Closes #32850 from karenfeng/SPARK-34920. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-30 09:22:02 +00:00
Gengliang Wang	ad4b6796f6	[SPARK-35937][SQL] Extracting date field from timestamp should work in ANSI mode ### What changes were proposed in this pull request? Add a new ANSI type coercion rule: when getting a date field from a Timestamp column, cast the column as Date type. This is Spark's current hack to make the implementation simple. In the default type coercion rules, the implicit cast rule does the work. However, The ANSI implicit cast rule doesn't allow converting Timestamp type as Date type, so we need to have this additional rule to make sure the date field extraction from Timestamp columns works. ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? No, the new type coercion rules are not released yet. ### How was this patch tested? Unit test Closes #33138 from gengliangwang/fixGetDateField. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-30 13:53:51 +08:00
Liang-Chi Hsieh	064230de97	[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin ### What changes were proposed in this pull request? This patch refactors the evaluation of subexpressions. There are two changes: 1. Clean up subexpression code after evaluation to avoid duplicate evaluation. 2. Evaluate all children subexpressions when evaluating a subexpression. ### Why are the changes needed? Currently `subexpressionEliminationForWholeStageCodegen` return the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32980 from viirya/subexpr-eval. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-29 22:14:37 -07:00
Yuming Wang	4a17e7a5ae	[SPARK-35906][SQL] Remove order by if the maximum number of rows less than or equal to 1 ### What changes were proposed in this pull request? This PR removes order by if the maximum number of rows less than or equal to 1. For example: ```scala spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33100 from wangyum/SPARK-35906. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 11:04:54 -07:00
Gengliang Wang	78e6263cce	[SPARK-35927][SQL] Remove type collection AllTimestampTypes ### What changes were proposed in this pull request? Replace the type collection `AllTimestampTypes` with the new data type `AnyTimestampType` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33115#discussion_r659866760, it is more convenient to have a new data type "AnyTimestampType" instead of using type collection `AllTimestampTypes`: 1. simplify the pattern match 2. In the default type coercion rules, when implicit casting a type to a TypeCollection type, Spark chooses the first convertible data type as the result. If we are going to make the default timestamp type configurable, having AnyTimestampType is better ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33129 from gengliangwang/allTimestampTypes. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 16:57:08 +08:00
Gengliang Wang	7635114d53	[SPARK-35916][SQL] Support subtraction among Date/Timestamp/TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operations: - TimestampWithoutTZ - Date - Date - TimestampWithoutTZ - TimestampWithoutTZ - Timestamp - Timestamp - TimestampWithoutTZ - TimestampWithoutTZ - TimestampWithoutTZ For subtraction between `TimestampWithoutTZ` and `Timestamp`, the `Timestamp` column is cast as TimestampWithoutTZType. ### Why are the changes needed? Support basic subtraction among Date/Timestamp/TimestampWithoutTZ. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33115 from gengliangwang/subtractTimestampWithoutTz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 14:45:09 +08:00
Anton Okolnychyi	8a21d2dcfe	[SPARK-35899][SQL][FOLLOWUP] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR addresses post-review comments on PR #33096: - removes `private[sql]` modifier - removes the option to pass a resolver to simplify the API ### Why are the changes needed? These changes are needed to simply the utility API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33120 from aokolnychyi/spark-35899-follow-up. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:22:07 -07:00
Kousuke Saruta	880bbd6aaa	[SPARK-35876][SQL] ArraysZip should retain field names to avoid being re-written by analyzer/optimizer ### What changes were proposed in this pull request? This PR fixes an issue that field names of structs generated by `arrays_zip` function could be unexpectedly re-written by analyzer/optimizer. Here is an example. ``` val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", "b1").selectExpr("arrays_zip(a1, b1) as zipped") df.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = false) \| \| \|-- a1: integer (nullable = true) // OK. a1 is expected name \| \| \|-- b1: integer (nullable = true) // OK. b1 is expected name df.explain == Physical Plan == *(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12] // Not OK. field names are re-written as _1 and _2 respectively df.write.parquet("/tmp/test.parquet") val df2 = spark.read.parquet("/tmp/test.parquet") df2.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = true) \| \| \|-- _1: integer (nullable = true) // Not OK. a1 is expected but got _1 \| \| \|-- _2: integer (nullable = true) // Not OK. b1 is expected but got _2 ``` This issue happens when aliases are eliminated by `AliasHelper.replaceAliasButKeepName` or `AliasHelper.trimNonTopLevelAliases` called via analyzer/optimizer `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L883)` `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L3759)` I investigated functions which can be affected this issue but I found only `arrays_zip` so far. To fix this issue, this PR changes the definition of `ArraysZip` to retain field names to avoid being re-written by analyzer/optimizer. ### Why are the changes needed? This is apparently a bug. ### Does this PR introduce _any_ user-facing change? No. After this change, the field names are no longer re-written but it should be expected behavior for users. ### How was this patch tested? New tests. Closes #33106 from sarutak/arrays-zip-retain-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 12:28:41 +09:00
Terry Kim	620fde4767	[SPARK-34302][SQL] Migrate ALTER TABLE ... CHANGE COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... CHANGE COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... CHANGE COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33113 from imback82/alter_change_column. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-29 02:53:05 +00:00
PengLei	8fbbd2e6d7	[SPARK-33898][SQL] Support SHOW CREATE TABLE In V2 ### What changes were proposed in this pull request? 1. Implement V2 execution node `ShowCreateTableExec` similar to V1 `ShowCreateTableCommand` 2. No support `SHOW CREATE TABLE XXX AS SERDE` ### Why are the changes needed? [SPARK-33898](https://issues.apache.org/jira/browse/SPARK-33898) ### Does this PR introduce _any_ user-facing change? Yes. Support the user to execute `SHOW CREATE TABLE` command in V2 table ### How was this patch tested? Add two UT tests 1. ./dev/scalastyle 2. run test DataSourceV2SQLSuite Closes #32931 from Peng-Lei/SPARK-33898. Lead-authored-by: PengLei <18066542445@189.cn> Co-authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-29 10:14:46 +08:00
PengLei	356aef48b8	[SPARK-35728][SPARK-35778][SQL][TESTS] Check multiply/divide of day-time and year-month interval of any fields by a numeric ### What changes were proposed in this pull request? [SPARK-35728](https://issues.apache.org/jira/browse/SPARK-35728): Add test case to check multiply/divide of day-time intervals of any fields by numeric [SPARK-35778](https://issues.apache.org/jira/browse/SPARK-35778): Add test case to check multiply/divide of year-month intervals of any fields by numeric ### Why are the changes needed? Improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Lead-authored-by: Lei Peng <peng.8leigmail.com> Co-authored-by: AngersZhuuuu <angers.zhugmail.com> Closes #33080 from Peng-Lei/SPARK-35728-35778. Lead-authored-by: PengLei <peng.8lei@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-28 13:35:54 +03:00
Yuming Wang	108635af17	Revert "[SPARK-35904][SQL] Collapse above RebalancePartitions" This reverts commit `def29e50`	2021-06-28 16:23:23 +08:00
dgd-contributor	1c81ad2029	[SPARK-35064][SQL] Group error in spark-catalyst ### What changes were proposed in this pull request? This PR group exception messages in sql/catalyst/src/main/scala/org/apache/spark/sql (except catalyst) ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce any user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32916 from dgd-contributor/SPARK-35064_catalyst_group_error. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-28 07:21:24 +00:00
Liang-Chi Hsieh	b89cd8d75a	[SPARK-35886][SQL] PromotePrecision should not overwrite genCode ### What changes were proposed in this pull request? This patch fixes `PromotePrecision` where it overwrites `genCode` where subexpression elimination should happen. ### Why are the changes needed? `PromotePrecision` overwrites `genCode` where subexpression elimination should happen. So if it is most top expression of a subexpression, it is never replaced. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #33103 from viirya/fix-precision. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-26 23:19:58 -07:00
Yuming Wang	def29e5075	[SPARK-35904][SQL] Collapse above RebalancePartitions ### What changes were proposed in this pull request? 1. Make `RebalancePartitions` extend `RepartitionOperation`. 2. Make `CollapseRepartition` support `RebalancePartitions`. ### Why are the changes needed? `CollapseRepartition` can optimize `RebalancePartitions` if possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33099 from wangyum/SPARK-35904. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-26 21:19:58 -07:00
Gengliang Wang	645fb59652	[SPARK-35895][SQL] Support subtracting Intervals from TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operation: - TimestampWithoutTZ - Year-Month interval The following operation is actually supported in https://github.com/apache/spark/pull/33076/. This PR is to add end-to-end tests for them: - TimestampWithoutTZ - Calendar interval - TimestampWithoutTZ - Daytime interval ### Why are the changes needed? Support subtracting all 3 interval types from a timestamp without time zone ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33086 from gengliangwang/subtract. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-26 13:19:00 +03:00
Anton Okolnychyi	63cd1314d2	[SPARK-35899][SQL] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR adds a utility to convert public connector expressions to Catalyst expressions. Notable differences: - Switched to `QueryCompilationErrors` from an explicit `AnalysisException`. - Decoupled the resolving logic for v2 references into separate methods to use in other places. ### Why are the changes needed? These changes are needed as more and more places require this logic and it is better to implement it in a single place. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33096 from aokolnychyi/spark-35899. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 18:04:07 -07:00
Gengliang Wang	9814cf8853	[SPARK-35889][SQL] Support adding TimestampWithoutTZ with Interval types ### What changes were proposed in this pull request? Supprot the following operations: - TimestampWithoutTZ + Calendar interval - TimestampWithoutTZ + Year-Month interval - TimestampWithoutTZ + Daytime interval ### Why are the changes needed? Support basic '+' operator for timestamp without time zone type. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33076 from gengliangwang/addForNewTS. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-25 19:58:42 +08:00
Terry Kim	f1ad34558c	[SPARK-35883][SQL] Migrate ALTER TABLE RENAME COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... RENAME COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... RENAME COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33066 from imback82/alter_rename. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-25 05:53:56 +00:00
Kousuke Saruta	156b9b5d14	[SPARK-35736][SPARK-35774][SQL][FOLLOWUP] Prohibit to specify the same units for FROM and TO with unit-to-unit interval syntax ### What changes were proposed in this pull request? This PR change the behavior of unit-to-unit interval syntax to prohibit the case that the same units are specified for FROM and TO. ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33057 from sarutak/prohibit-unit-pattern. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 23:13:31 +03:00
Adam Binford	14b1836313	[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling ### What changes were proposed in this pull request? This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally. ### Why are the changes needed? Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs. ### Does this PR introduce _any_ user-facing change? Yes, nested struct fields will be in a different order after unionByName with null filling than before, though shouldn't cause much effective difference. ### How was this patch tested? Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally. Closes #33040 from Kimahriman/union-by-name-struct-order. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-24 09:21:30 -07:00
Terry Kim	5b4816cfc8	[SPARK-34320][SQL] Migrate ALTER TABLE DROP COLUMNS commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... DROP COLUMNS` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... DROP COLUMNS` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #32854 from imback82/alter_alternative. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 14:59:25 +00:00
Angerszhuuuu	de35675c61	[SPARK-35871][SQL] Literal.create(value, dataType) should support fields ### What changes were proposed in this pull request? Current Literal.create(data, dataType) for Period to YearMonthIntervalType and Duration to DayTimeIntervalType is not correct. if data type is Period/Duration, it will create converter of default YearMonthIntervalType/DayTimeIntervalType, then the result is not correct, this pr fix this bug. ### Why are the changes needed? Fix bug when use Literal.create() ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33056 from AngersZhuuuu/SPARK-35871. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 17:36:48 +03:00
Max Gekk	d40a1a2552	Revert "[SPARK-35728][SQL][TESTS] Check multiply/divide of day-time intervals of any fields by numeric" ### What changes were proposed in this pull request? Revert `8a1995f936` ### Why are the changes needed? The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all. ### Does this PR introduce _any_ user-facing change? No. This is tests. ### How was this patch tested? By existing GAs. Closes #33060 from MaxGekk/revert-Peng-Lei-SPARK-35728. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 14:36:07 +03:00
Max Gekk	345d3db83d	Revert "[SPARK-35778][SQL][TESTS] Check multiply/divide of year month interval of any fields by numeric" ### What changes were proposed in this pull request? Revert `3904c0edba` ### Why are the changes needed? The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all. ### Does this PR introduce _any_ user-facing change? No. This is tests. ### How was this patch tested? By existing GAs. Closes #33059 from MaxGekk/revert-Peng-Lei-SPARK-35778. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 14:34:42 +03:00
PengLei	3904c0edba	[SPARK-35778][SQL][TESTS] Check multiply/divide of year month interval of any fields by numeric ### What changes were proposed in this pull request? Check multiply/divide of year-month intervals of any fields by numeric. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Expanded existed test cases. Closes #33051 from Peng-Lei/SPARK-35778. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:25:06 +03:00
PengLei	8a1995f936	[SPARK-35728][SQL][TESTS] Check multiply/divide of day-time intervals of any fields by numeric ### What changes were proposed in this pull request? 1. The testcase is just cover the DayTimeIntervalType() / numeric 2. Add testcase for following intervals / numeric: INTERVAL DAY INTERVAL DAY TO HOUR INTERVAL DAY TO MINUTE INTERVAL HOUR INTERVAL HOUR TO MINUTE INTERVAL HOUR TO SECOND INTERVAL MINUTE INTERVAL MINUTE TO SECOND INTERVAL SECOND ### Why are the changes needed? Add testcase coverage. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existed testcase Closes #33014 from Peng-Lei/SPARK-35728. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:11:47 +03:00
ulysses-you	1295e8876c	[SPARK-35786][SQL] Add a new operator to distingush if AQE can optimize safely ### What changes were proposed in this pull request? * Add a new repartition operator `RebalanceRepartition`. * Support a new hint `REBALANCE` After this patch, user can run this query: ```sql SELECT /+ REBALANCE(c) / * FROM t ``` ### Why are the changes needed? Add a new hint to distingush if we can optimize it safely. This new hint can let AQE optimize with `CustomShuffleReaderExec` safely. Currently, AQE can only coalesce shuffle partitions but can not expand shuffle partitions due to the semantics of output partitioning. Let's say we have a query: ```sql SELECT /+ REPARTITION(col) / * FROM t ``` AQE can not expand the shuffle partitions even if `col` is skewed because expanding shuffle partitions will break the hashed output paritioning of `RepartitionByExpression`. But if the query is use`REPARTITION_BY_AQE`, AQE can optimize it without considering the semantics of output partitioning. ### Does this PR introduce _any_ user-facing change? Yes, a new hint. ### How was this patch tested? Add test. Closes #32932 from ulysses-you/SPARK-35786. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 09:04:38 +00:00
dgd-contributor	5b9c5c126f	[SPARK-35841][SQL] Casting string to decimal type doesn't work if the… … sum of the digits is greater than 38 ### What changes were proposed in this pull request? Since Spark 3.1.1, NULL is returned when casting a string with many decimal places to a decimal type. If the sum of the digits before and after the decimal point is less than 39, a value is returned. From 39 digits, however, NULL is returned. This worked until Spark 3.0.X. Code to reproduce: A string with 2 decimal places in front of the decimal point and 37 decimal places after the decimal point returns null ``` val data = Seq( "28.9259999999999983799625624669715762138", "28.925999999999998379962562466971576213", "2.9259999999999983799625624669715762138" ) val df = data.toDF("num") df.withColumn("numConverted", col("num").cast("decimal(38, 5)")).show() ``` before this pull request, the result is +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| null\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ the correct result should be +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| 28.92600\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ The problem occur since https://issues.apache.org/jira/browse/SPARK-32706, it because the fast fail is checking precision length, which should only check the whole number part length of the input value, not the precision length ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test added Closes #33011 from dgd-contributor/SPARK-35841_castStringToDecimalTypeError. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 16:44:58 +08:00
ulysses-you	ff9ba89dcb	[SPARK-35282][SQL][FOLLOWUP] Simplify condition code of shuffled hash join ### What changes were proposed in this pull request? Simplify the condition code which is introduced by [SPARK-35282](https://issues.apache.org/jira/browse/SPARK-35282). ### Why are the changes needed? Reduce the code size and make code more readable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33046 from ulysses-you/simplify-shj. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 08:42:24 +00:00
Angerszhuuuu	5e77ca8071	[SPARK-35768][SQL] Take into account year-month interval fields in cast ### What changes were proposed in this pull request? Support take into account year-month interval field in cast ##### Rule cast to target YearMonthIntervalType \| string \| demo \| strict target type \| months \| \|---\|---\|---\|---\| \| [+\\|-]y-m \| 1-1 \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| [+\\|-]y\| 1 \| YearMonthIntervalType(YEAR. YEAR) \| 12 \| \| [+\\|-]m \| 1 \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y-m' YEAR TO MONTH \| interval '1-1' year to month \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| INTERVAL [+\\|-]'[+\\|-]m' MONTH \| interval '1' month \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y' YEAR \| interval '1' year \| YearMonthIntervalType(YEAR.YEAR) \| 12 \| ### Why are the changes needed? Support take into account year-month interval field in cast ### Does this PR introduce _any_ user-facing change? user can use `cast(str, YearMonthInterval(YEAR, YEAR))` etc ### How was this patch tested? Added UT Closes #32940 from AngersZhuuuu/SPARK-35768. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 07:48:47 +00:00
PengLei	61bd036cb9	[SPARK-35852][SQL] Use DateAdd instead of TimeAdd for DateType +/- INTERVAL DAY ### What changes were proposed in this pull request? We use `DateAdd` to impl `DateType` `+`/`-` `INTERVAL DAY` ### Why are the changes needed? To improve the impl of `DateType` `+`/`-` `INTERVAL DAY` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut test Closes #33033 from Peng-Lei/SPARK-35852. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 08:47:29 +03:00

1 2 3 4 5 ...

5537 commits