ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	8643e5d9c5	[SPARK-31171][SQL][FOLLOWUP] update document ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/27936 to update document. ### Why are the changes needed? correct document ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27950 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:29:31 +09:00
Kent Yao	3d695954e5	[SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text ### What changes were proposed in this pull request? pattern `''` means literal `'` ```sql select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS"); 5377-02-14 06:27:19'000000519 ``` `0946a9514f` missed this case and this pr add it back. ### Why are the changes needed? bugfix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #27949 from yaooqinn/SPARK-31150-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:27:06 +09:00
yi.wu	8bfaa62f2f	[SPARK-31175][SQL] Avoid creating reverse comparator for each compare in InterpretedOrdering ### What changes were proposed in this pull request? Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering. ### Why are the changes needed? Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27938 from Ngone51/reverse_comparator. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 23:56:48 +08:00
Kent Yao	57fcc49306	[SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar ### What changes were proposed in this pull request? The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g. ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee'); ``` Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before. This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class. And 'e' and 'c' char checking in this method. Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g. ```sql spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'"); 20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")] java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'"); NULL ``` ### Why are the changes needed? avoid vagueness bug fix ### Does this PR introduce any user-facing change? no, these are not exposed yet ### How was this patch tested? add ut Closes #27939 from yaooqinn/SPARK-31176. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 20:19:50 +08:00
Kent Yao	f1d27cdd91	[SPARK-31119][SQL] Add interval value support for extract expression as extract source ### What changes were proposed in this pull request? ``` <extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren> <extract source> ::= <datetime value expression> \| <interval value expression> ``` We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval. This pr adds interval value support for `extract` expression as extract source ### Why are the changes needed? For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions. ### Does this PR introduce any user-facing change? yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals ### How was this patch tested? add unit tests Closes #27876 from yaooqinn/SPARK-31119. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 12:29:39 +08:00
Wenchen Fan	dc5ebc2d5b	[SPARK-31171][SQL] size(null) should return null under ansi mode ### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 11:48:54 -07:00
Kent Yao	0946a9514f	[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp ### What changes were proposed in this pull request? This PR is to support parsing timestamp values with variable length second fraction parts. e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7 ```sql select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values ('2019-10-06 10:11:12.'), ('2019-10-06 10:11:12.0'), ('2019-10-06 10:11:12.1'), ('2019-10-06 10:11:12.12'), ('2019-10-06 10:11:12.123UTC'), ('2019-10-06 10:11:12.1234'), ('2019-10-06 10:11:12.12345CST'), ('2019-10-06 10:11:12.123456PST') t(v) 2019-10-06 03:11:12.123 2019-10-06 08:11:12.12345 2019-10-06 10:11:12 2019-10-06 10:11:12 2019-10-06 10:11:12.1 2019-10-06 10:11:12.12 2019-10-06 10:11:12.1234 2019-10-06 10:11:12.123456 select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') NULL ``` Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated. ### Why are the changes needed? improve timestamp parsing and more compatible with 2.4.x ### Does this PR introduce any user-facing change? no, the related changes are newly added ### How was this patch tested? add uts Closes #27906 from yaooqinn/SPARK-31150. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 21:53:46 +08:00
Wenchen Fan	d7b97a1d0d	[SPARK-31166][SQL] UNION map<null, null> and other maps should not fail ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps. The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it. ### Why are the changes needed? To avoid breaking queries. ### Does this PR introduce any user-facing change? Yes, now some queries that work in 2.x can work in 3.0 as well. ### How was this patch tested? new test Closes #27926 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 12:01:29 +08:00
HyukjinKwon	6704103499	[SPARK-31146][SQL] Leverage the helper method for aliasing in built-in SQL expressions ### What changes were proposed in this pull request? This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable. - `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default. - When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`, `FormatString` and `XPathDouble`. This PR fixes the aliases of the functions below: \| class \| alias \| \|--------------------------\|------------------\| \|`Rand` \|`random` \| \|`Ceil` \|`ceiling` \| \|`Remainder` \|`mod` \| \|`Pow` \|`pow` \| \|`Signum` \|`sign` \| \|`Chr` \|`char` \| \|`Length` \|`char_length` \| \|`Length` \|`character_length`\| \|`FormatString` \|`printf` \| \|`Substring` \|`substr` \| \|`Upper` \|`ucase` \| \|`XPathDouble` \|`xpath_number` \| \|`DayOfMonth` \|`day` \| \|`CurrentTimestamp` \|`now` \| \|`Size` \|`cardinality` \| \|`Sha1` \|`sha` \| \|`CallMethodViaReflection` \|`java_method` \| Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser. Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller. ### Why are the changes needed? To respect expression name. ### Does this PR introduce any user-facing change? Yes, it will change the output column name. ### How was this patch tested? Manually tested, and unittests were added. Closes #27901 from HyukjinKwon/31146. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 11:22:34 -07:00
Wenchen Fan	50a29672e0	[SPARK-30958][SQL] do not set default era for DateTimeFormatter ### What changes were proposed in this pull request? It's not needed at all as now we replace "y" with "u" if there is no "G". So the era is either explicitly specified (e.g. "yyyy G") or can be inferred from the year (e.g. "uuuu"). ### Why are the changes needed? By default we use "uuuu" as the year pattern, which indicates the era already. If we set a default era, it can get conflicted and fail the parsing. ### Does this PR introduce any user-facing change? yea, now spark can parse date/timestamp with negative year via the "yyyy" pattern, which will be converted to "uuuu" under the hood. ### How was this patch tested? new tests Closes #27707 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-16 16:48:31 +09:00
Wenchen Fan	b27b3c91f1	[SPARK-31090][SPARK-25457] Revert "IntegralDivide returns data type of the operands" ### What changes were proposed in this pull request? This reverts commit `47d6e80a2e`. ### Why are the changes needed? There is no standard requiring that `div` must return the type of the operand, and always returning long type looks fine. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `div` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27835 from cloud-fan/revert2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-13 10:47:36 +09:00
Kent Yao	7b4b29e8d9	[SPARK-31131][SQL] Remove the unnecessary config spark.sql.legacy.timeParser.enabled ### What changes were proposed in this pull request? spark.sql.legacy.timeParser.enabled should be removed from SQLConf and the migration guide spark.sql.legacy.timeParsePolicy is the right one ### Why are the changes needed? fix doc ### Does this PR introduce any user-facing change? no ### How was this patch tested? Pass the jenkins Closes #27889 from yaooqinn/SPARK-31131. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-12 09:24:49 -07:00
Wenchen Fan	77c49cb702	[SPARK-31124][SQL] change the default value of minPartitionNum in AQE ### What changes were proposed in this pull request? AQE has a perf regression when using the default settings: if we coalesce the shuffle partitions into one or few partitions, we may leave many CPU cores idle and the perf is worse than with AQE off (which leverages all CPU cores). Technically, this is not a bad thing. If there are many queries running at the same time, it's better to coalesce shuffle partitions into fewer partitions. However, the default settings of AQE should try to avoid any perf regression as possible as we can. This PR changes the default value of minPartitionNum when coalescing shuffle partitions, to be `SparkContext.defaultParallelism`, so that AQE can leverage all the CPU cores. ### Why are the changes needed? avoid AQE perf regression ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27879 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-12 21:28:24 +08:00
yi.wu	feb9b9e771	[SPARK-31010][SQL][FOLLOW-UP] Give an example for typed Scala UDF in error message ### What changes were proposed in this pull request? In the error message, adding an example for typed Scala UDF. ### Why are the changes needed? Help user to know how to migrate to typed Scala UDF. ### Does this PR introduce any user-facing change? No, it's a new error message in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27884 from Ngone51/spark_31010_followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-12 21:16:02 +09:00
Wenchen Fan	8efb71013d	[SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27834 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-11 09:55:24 -07:00
Wenchen Fan	5be0d04f16	[SPARK-31117][SQL][TEST] reduce the test time of DateTimeUtilsSuite ### What changes were proposed in this pull request? `DateTimeUtilsSuite.daysToMicros and microsToDays` takes 30 seconds, which is too long for a UT. This PR changes the test to check random data, to reduce testing time. Now this test takes 1 second. ### Why are the changes needed? make test faster ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27873 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 23:47:13 +08:00
Maxim Gekk	3d3e366aa8	[SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time ### What changes were proposed in this pull request? In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components. ### Why are the changes needed? This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar. Here is the example which demonstrates the issue: ```sql scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) ``` ### Does this PR introduce any user-facing change? Yes, after the changes: ```sql scala> sql("select date '1100-10-10'").collect() res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10]) ``` ### How was this patch tested? By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27807 from MaxGekk/rebase-timestamp-before-1582. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 20:53:56 +08:00
Liang-Chi Hsieh	15557a7d05	[SPARK-31071][SQL] Allow annotating non-null fields when encoding Java Beans ### What changes were proposed in this pull request? When encoding Java Beans to Spark DataFrame, respecting `javax.annotation.Nonnull` and producing non-null fields. ### Why are the changes needed? When encoding Java Beans to Spark DataFrame, non-primitive types are encoded as nullable fields. Although It works for most cases, it can be an issue under a few situations, e.g. the one described in the JIRA ticket when saving DataFrame to Avro format with non-null field. We should allow Spark users more flexibility when creating Spark DataFrame from Java Beans. Currently, Spark users cannot create DataFrame with non-nullable fields in the schema from beans with non-nullable properties. Although it is possible to project top-level columns with SQL expressions like `AssertNotNull` to make it non-null, for nested fields it is more tricky to do it similarly. ### Does this PR introduce any user-facing change? Yes. After this change, Spark users can use `javax.annotation.Nonnull` to annotate non-null fields in Java Beans when encoding beans to Spark DataFrame. ### How was this patch tested? Added unit test. Closes #27851 from viirya/SPARK-31071. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 18:27:48 +08:00
Yuanjian Li	3493162c78	[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime ### What changes were proposed in this pull request? In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) ### Why are the changes needed? For backward compatibility. ### Does this PR introduce any user-facing change? No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. ### How was this patch tested? Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes #27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 14:11:13 +08:00
Kent Yao	3bd6ebff81	[SPARK-30189][SQL] Interval from year-month/date-time string should handle whitespaces ### What changes were proposed in this pull request? Currently, we parse interval from multi units strings or from date-time/year-month pattern strings, the former handles all whitespace, the latter not or even spaces. ### Why are the changes needed? behavior consistency ### Does this PR introduce any user-facing change? yes, interval in date-time/year-month like ``` select interval '\n-\t10\t 12:34:46.789\t' day to second -- !query 126 schema struct<INTERVAL '-10 days -12 hours -34 minutes -46.789 seconds':interval> -- !query 126 output -10 days -12 hours -34 minutes -46.789 seconds ``` is valid now. ### How was this patch tested? add ut. Closes #26815 from yaooqinn/SPARK-30189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 22:08:58 +08:00
Eric Wu	15df2a3f40	[SPARK-31079][SQL] Logging QueryExecutionMetering in RuleExecutor logger ### What changes were proposed in this pull request? RuleExecutor already support metering for analyzer/optimizer rules. By providing such information in `PlanChangeLogger`, user can get more information when debugging rule changes . This PR enhanced `PlanChangeLogger` to display RuleExecutor metrics. This can be easily done by calling the existing API `resetMetrics` and `dumpTimeSpent`, but there might be conflicts if user is also collecting total metrics of a sql job. Thus I introduced `QueryExecutionMetrics`, as the snapshot of `QueryExecutionMetering`, to better support this feature. Information added to `PlanChangeLogger` ``` === Metrics of Executed Rules === Total number of runs: 554 Total time: 0.107756568 seconds Total number of effective runs: 11 Total time of effective runs: 0.047615486 seconds ``` ### Why are the changes needed? Provide better plan change debugging user experience ### Does this PR introduce any user-facing change? Only add more debugging info of `planChangeLog`, default log level is TRACE. ### How was this patch tested? Update existing tests to verify the new logs Closes #27846 from Eric5553/ExplainRuleExecMetrics. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 19:08:59 +08:00
HyukjinKwon	815c7929c2	[SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source ### What changes were proposed in this pull request? This PR proposes two things: 1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string. 2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference. ### Why are the changes needed? To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option. ### Does this PR introduce any user-facing change? Yes, it does. ```scala import collection.JavaConverters._ import org.apache.spark.sql.functions._ spark.range(1).select(schema_of_json(lit("""{"id": ""}"""))).show() spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false) ``` Before: ``` struct<id:null> struct<drop:struct<drop:null>,id:string> ``` After: ``` struct<id:string> struct<id:string> ``` ### How was this patch tested? Manually tested, and unittests were added. Closes #27854 from HyukjinKwon/SPARK-31065. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-10 00:33:32 -07:00
Yuchen Huo	a22994333a	[SPARK-30902][SQL][FOLLOW-UP] Allow ReplaceTableAsStatement to have none provider ### What changes were proposed in this pull request? This is a follow up for https://github.com/apache/spark/pull/27650 where allow None provider for create table. Here we are doing the same thing for ReplaceTable. ### Why are the changes needed? Although currently the ASTBuilder doesn't seem to allow `replace` without `USING` clause. This would allow `DataFrameWriterV2` to use the statements instead of commands directly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #27838 from yuchenhuo/SPARK-30902. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 11:37:31 +08:00
Wenchen Fan	1aa184763a	[SPARK-31053][SQL] mark connector APIs as Evolving ### What changes were proposed in this pull request? The newly added catalog APIs are marked as Experimental but other DS v2 APIs are marked as Evolving. This PR makes it consistent and mark all Connector APIs as Evolving. ### Why are the changes needed? For consistency. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27811 from cloud-fan/tag. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-08 11:41:09 -07:00
beliefer	f8a3730fd7	[SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/27691, https://github.com/apache/spark/pull/27730 and https://github.com/apache/spark/pull/27770 I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.redaction.options.regex \| 2.2.2 \| SPARK-23850 \| 6a55d8b03053e616dcacb79cd2c29a06d219dc32#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.redaction.string.regex \| 2.3.0 \| SPARK-22791 \| 28315714ddef3ddcc192375e98dd5207cf4ecc98#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.function.concatBinaryAsString \| 2.3.0 \| SPARK-22771 \| f2b3525c17d660cf6f082bbafea8632615b4f58e#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.function.eltOutputAsString \| 2.3.0 \| SPARK-22937 \| bf853018cabcd3b3abf84bfe534d2981020b4a71#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sources.validatePartitionColumns \| 3.0.0 \| SPARK-26263 \| 5a140b7844936cf2b65f08853b8cfd8c499d4f13#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.continuous.epochBacklogQueueSize \| 3.0.0 \| SPARK-24063 \| c4bbfd177b4e7cb46f47b39df9fd71d2d9a12c6d#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.continuous.executorQueueSize \| 2.3.0 \| SPARK-22789 \| 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.continuous.executorPollIntervalMs \| 2.3.0 \| SPARK-22789 \| 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sources.useV1SourceList \| 3.0.0 \| SPARK-28747 \| cb06209fc908bac6ce6a8f20653865489773cbc3#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.disabledV2Writers \| 2.3.1 \| SPARK-23196 \| 588b9694c1967ff45774431441e84081ee6eb515#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.disabledV2MicroBatchReaders \| 2.4.0 \| SPARK-23362 \| 0a73aa31f41c83503d5d99eff3c9d7b406014ab3#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sources.partitionOverwriteMode \| 2.3.0 \| SPARK-20236 \| b96248862589bae1ddcdb14ce4c802789a001306#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.storeAssignmentPolicy \| 3.0.0 \| SPARK-28730 \| 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.ansi.enabled \| 3.0.0 \| SPARK-30125 \| d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.sortBeforeRepartition \| 2.1.4 \| SPARK-23207 and SPARK-22905 and SPARK-24564 and SPARK-25114 \| 4d2d3d47e00e78893b1ecd5a9a9070adc5243ac9#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.nestedSchemaPruning.enabled \| 2.4.1 \| SPARK-4502 \| dfcff38394929970fee454c69864d0e10d59f8d4#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.serializer.nestedSchemaPruning.enabled \| 3.0.0 \| SPARK-26837 \| 0f2c0b53e8fb18c86c67b5dd679c006db93f94a5#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.expression.nestedPruning.enabled \| 3.0.0 \| SPARK-27707 \| 127bc899ae78d73332a87f0972b5db3c9936c1f1#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.topKSortFallbackThreshold \| 2.4.0 \| SPARK-24193 \| 8a837bf4f3f2758f7825d2362cf9de209026651a#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.csv.parser.columnPruning.enabled \| 2.4.0 \| SPARK-24244 and SPARK-24368 \| 64fad0b519cf35b8c0a0dec18dd3df9488a5ed25#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.repl.eagerEval.enabled \| 2.4.0 \| SPARK-24215 \| 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.repl.eagerEval.maxNumRows \| 2.4.0 \| SPARK-24215 \| 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.repl.eagerEval.truncate \| 2.4.0 \| SPARK-24215 \| 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.codegen.aggregate.fastHashMap.capacityBit \| 2.4.0 \| SPARK-24978 \| 6193a202aab0271b4532ee4b740318290f2c44a1#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.avro.compression.codec \| 2.4.0 \| SPARK-24881 \| 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.avro.deflate.level \| 2.4.0 \| SPARK-24881 \| 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.sizeOfNull \| 2.4.0 \| SPARK-24605 \| d08f53dc61f662f5291f71bcbe1a7b9f531a34d2#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.replaceDatabricksSparkAvro.enabled \| 2.4.0 \| SPARK-25129 \| ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.setopsPrecedence.enabled \| 2.4.0 \| SPARK-24966 \| 73dd6cf9b558f9d752e1f3c13584344257ad7863#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.exponentLiteralAsDecimal.enabled \| 3.0.0 \| SPARK-29956 \| 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.allowNegativeScaleOfDecimal \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.createHiveTableByDefault.enabled \| 3.0.0 \| SPARK-30098 \| 58be82ad4b98fc17e821e916e69e77a6aa36209d#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.integralDivide.returnBigint \| 3.0.0 \| SPARK-25457 \| 47d6e80a2e64823fabb596503fb6a6cc6f51f713#diff-9a6b543db706f1a90f790783d6930a13 \| Exists in branch-3.0 branch, but the pom.xml file corresponding to the commit log is 2.5.0-SNAPSHOT spark.sql.legacy.bucketedTableScan.outputOrdering \| 3.0.0 \| SPARK-28595 \| 469423f33887a966aaa33eb75f5e7974a0a97beb#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.parser.havingWithoutGroupByAsWhere \| 2.4.1 \| SPARK-25708 \| 3dba5d41f1a66ae5eb08404d103284110c45a351#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.json.allowEmptyString.enabled \| 3.0.0 \| SPARK-25040 \| d3de7568f32e298442f07b0a28b2c906de72c797#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.createEmptyCollectionUsingStringType \| 3.0.0 \| SPARK-30790 \| 8ab6ae3ede96adb093347470a5cbbf17fe8c04e9#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.allowUntypedScalaUDF \| 3.0.0 \| SPARK-26580 \| bc30a07ce262840c99a752db4fbd3a423f652017#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.truncateTable.ignorePermissionAcl.enabled \| 2.4.6 \| SPARK-30312 \| 830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue \| 3.0.0 \| SPARK-26085 \| ab2eafb3cdc7631452650c6cac03a92629255347#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.debug.maxToStringFields \| 3.0.0 \| SPARK-26066 \| 81550b38e43fb20f89f529d2127575c71a54a538#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.maxPlanStringLength \| 3.0.0 \| SPARK-26103 \| 812ad5546148d2194ab0e4230ee85b8f6a5be2fb#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.setCommandRejectsSparkCoreConfs \| 3.0.0 \| SPARK-26060 \| 1ab3d3e474ce2e36d58aea8ad09fb61f0c73e5c5#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.datetime.java8API.enabled \| 3.0.0 \| SPARK-27008 \| 52671d631d2a64ed1cfa0c6e01168908faf92df8#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sources.binaryFile.maxLength \| 3.0.0 \| SPARK-27588 \| 618d6bff71073c8c93501ab7392c3cc579730f0b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.typeCoercion.datetimeToString.enabled \| 3.0.0 \| SPARK-27638 \| 83d289eef492de8c7f3e5145f9bd75431608b500#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.defaultCatalog \| 3.0.0 \| SPARK-29753 \| 942753a44beeae5f0142ceefa307e90cbc1234c5#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.catalog.$SESSION_CATALOG_NAME \| 3.0.0 \| SPARK-29412 \| 9407fba0375675d6ee6461253f3b8230e8d67509#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.doLooseUpcast \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.ctePrecedencePolicy \| 3.0.0 \| SPARK-30829 \| 00943be81afbca6be13e1e72b24536cd98a788d6#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.timeParserPolicy \| 3.1.0 \| SPARK-30668 \| 7db0af578585ecaeee9fd23f8189292289b52a97#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.followThreeValuedLogicInArrayExists \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.maven.additionalRemoteRepositories \| 3.0.0 \| SPARK-29175 \| 3d7359ad4202067b26a199657b6a3e1f38be0e4d#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.fromDayTimeString.enabled \| 3.0.0 \| SPARK-29864 and SPARK-29920 \| e933539cdd557297daf97ff5e532a3f098896979#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.notReserveProperties \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.addSingleFileInAddFile \| 3.0.0 \| SPARK-30234 \| 8a8d1fbb10af6da481f26831cd519ef46ccbce6c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.mssqlserver.numericMapping.enabled \| 2.4.5 \| SPARK-28152 \| 69de7f31c37a7e0298e66cc814afc1b0aa948bbb#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.csv.filterPushdown.enabled \| 3.0.0 \| SPARK-30323 \| 4e50f0291f032b4a5c0b46ed01fdef14e4cbb050#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.addPartitionInBatch.size \| 3.0.0 \| SPARK-29938 \| 5ccbb38a71890b114c707279e7395d1f6284ebfd#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.timeParser.enabled \| 3.0.0 \| SPARK-30668 \| 92f57237871400ab9d499e1174af22a867c01988#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.allowDuplicatedMapKeys \| 3.0.0 \| SPARK-25829 \| 33329caa81827a245b84158b13234b88a4746e56#diff-9a6b543db706f1a90f790783d6930a13 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27829 from beliefer/add-version-to-sql-config-part-four. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-08 12:33:02 +09:00
iRakson	cba17e07e9	[SPARK-30899][SQL] CreateArray/CreateMap's data type should not depend on SQLConf.get ### What changes were proposed in this pull request? Introduced a new parameter `emptyCollection` for `CreateMap` and `CreateArray` functiion to remove dependency on SQLConf.get. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27657 from iRakson/SPARK-30899. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 16:45:06 +08:00
Takeshi Yamamuro	71c73d58f6	[SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID ### What changes were proposed in this pull request? This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs; `e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)` For example, the query below generates wrong grouping IDs in the master; ``` scala> val numCols = 32 // or, 31 scala> val cols = (0 until numCols).map { i => s"c$i" } scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet") scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",") scala> sql(s"insert into test_$numCols values ($insertVals,3)") scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false) scala> sql(s"drop table test_$numCols") // numCols = 32 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|0 \|3 \| // Wrong Grouping ID +-------------+------+ // numCols = 31 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|1 \|3 \| +-------------+------+ ``` To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values. ### Why are the changes needed? To support more cases in `GROUPING_ID`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #26918 from maropu/FixGroupingIdIssue. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-06 16:57:03 +09:00
Maxim Gekk	59f1e76b82	[SPARK-31020][SPARK-31023][SPARK-31025][SPARK-31044][SQL] Support foldable args by `from_csv/json` and `schema_of_csv/json` ### What changes were proposed in this pull request? In the PR, I propose: 1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression. 2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression. 3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression. 4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression. ### Why are the changes needed? This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 ``` and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7; 'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)] +- OneRowRelation spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7; 'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)] +- OneRowRelation ``` ### Does this PR introduce any user-facing change? Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} ``` After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); struct<_c0:double,_c1:int> spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); struct<id:bigint,price:double> ``` ### How was this patch tested? Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`. Closes #27804 from MaxGekk/foldable-arg-csv-json-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 12:29:35 +08:00
Dongjoon Hyun	afb84e9d37	[SPARK-30886][SQL] Deprecate two-parameter TRIM/LTRIM/RTRIM functions ### What changes were proposed in this pull request? This PR aims to show a deprecation warning on two-parameter TRIM/LTRIM/RTRIM function usages based on the community decision. - https://lists.apache.org/thread.html/r48b6c2596ab06206b7b7fd4bbafd4099dccd4e2cf9801aaa9034c418%40%3Cdev.spark.apache.org%3E ### Why are the changes needed? For backward compatibility, SPARK-28093 is reverted. However, from Apache Spark 3.0.0, we should give a safe guideline to use SQL syntax instead of the esoteric function signatures. ### Does this PR introduce any user-facing change? Yes. This shows a directional warning. ### How was this patch tested? Pass the Jenkins with a newly added test case. Closes #27643 from dongjoon-hyun/SPARK-30886. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-05 20:09:39 -08:00
maryannxue	d705d36c0c	[SPARK-31045][SQL] Add config for AQE logging level ### What changes were proposed in this pull request? This PR adds an internal config for changing the logging level of adaptive execution query plan evolvement. ### Why are the changes needed? To make AQE debugging easier. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #27798 from maryannxue/aqe-log-level. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 11:41:45 +08:00
beliefer	d9254b26f1	[SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/27691 and https://github.com/apache/spark/pull/27730 I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.execution.useObjectHashAggregateExec \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.jsonGenerator.ignoreNullFields \| 3.0.0 \| SPARK-29444 \| 78b0cbe265c4e8cc3d4d8bf5d734f2998c04d376#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.fileSink.log.deletion \| 2.0.0 \| SPARK-14678 \| 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSink.log.compactInterval \| 2.0.0 \| SPARK-14678 \| 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSink.log.cleanupDelay \| 2.0.0 \| SPARK-14678 \| 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSource.log.deletion \| 2.0.1 \| SPARK-15698 \| 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSource.log.compactInterval \| 2.0.1 \| SPARK-15698 \| 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSource.log.cleanupDelay \| 2.0.1 \| SPARK-15698 \| 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.fileSource.schema.forceNullable \| 3.0.0 \| SPARK-28651 \| 5bb69945e4aaf519cd10a5c5083332f618039af0#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.fileSource.cleaner.numThreads \| 3.0.0 \| SPARK-29876 \| abf759a91e01497586b8bb6b7a314dd28fd6cff1#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.schemaInference \| 2.0.0 \| SPARK-15458 \| 1fb7b3a0a2e3a5c5f784aab662df93fcc1449c36#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.pollingDelay \| 2.0.0 \| SPARK-16002 \| afa14b71b28d788c53816bd2616ccff0c3967f40#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.stopTimeout \| 3.0.0 \| SPARK-30143 \| 4c37a8a3f4a489b52f1919d2db84f6e32c6a05cd#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.noDataProgressEventInterval \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.noDataMicroBatches.enabled \| 2.4.1 \| SPARK-24157 \| 535bf1cc9e6b54df7059ac3109b8cba30057d040#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.metricsEnabled \| 2.0.2 \| SPARK-17731 \| 881e0eb05782ea74cf92a62954466b14ea9e05b6#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.streaming.numRecentProgressUpdates \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.checkpointFileManagerClass \| 2.4.0 \| SPARK-23966 \| cbb41a0c5b01579c85f06ef42cc0585fbef216c5#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.streaming.checkpoint.escapedPathCheck.enabled \| 3.0.0 \| SPARK-26824 \| 77b99af57330cf2e5016a6acc69642d54041b041#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.statistics.parallelFileListingInStatsComputation.enabled \| 2.4.1 \| SPARK-24626 \| f11f44548903bbab7ab764574d6bed326cf4cd8d#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.defaultSizeInBytes \| 1.1.0 \| SPARK-2393 \| c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.statistics.fallBackToHdfs \| 2.0.0 \| SPARK-15960 \| 5c53442cc098dd618ba1430962727c74b2de2e68#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.statistics.ndv.maxError \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.statistics.histogram.enabled \| 2.3.0 \| SPARK-17074 \| 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.statistics.histogram.numBins \| 2.3.0 \| SPARK-17074 \| 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.statistics.percentile.accuracy \| 2.3.0 \| SPARK-17074 \| 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.statistics.size.autoUpdate.enabled \| 2.3.0 \| SPARK-21127 \| d5202259d9aa9ad95d572af253bf4a722b7b437a#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.enabled \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.planStats.enabled \| 3.0.0 \| SPARK-24690 \| 3f3a18fff116a02ff7996d45a1061f48a2de3102#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.joinReorder.enabled \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.joinReorder.dp.threshold \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.joinReorder.card.weight \| 2.2.0 \| SPARK-19915 \| c083b6b7dec337d680b54dabeaa40e7a0f69ae69#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.joinReorder.dp.star.filter \| 2.2.0 \| SPARK-20233 \| fbe4216e1e83d243a7f0521b76bfb20c25278281#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.starSchemaDetection \| 2.2.0 \| SPARK-17791 \| 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cbo.starJoinFTRatio \| 2.2.0 \| SPARK-17791 \| 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.session.timeZone \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.windowExec.buffer.in.memory.threshold \| 2.2.1 \| SPARK-21595 \| 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.windowExec.buffer.spill.threshold \| 2.2.0 \| SPARK-13450 \| 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sortMergeJoinExec.buffer.in.memory.threshold \| 2.2.1 \| SPARK-21595 \| 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sortMergeJoinExec.buffer.spill.threshold \| 2.2.0 \| SPARK-13450 \| 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cartesianProductExec.buffer.in.memory.threshold \| 2.2.1 \| SPARK-21595 \| 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.cartesianProductExec.buffer.spill.threshold \| 2.2.0 \| SPARK-13450 \| 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parser.quotedRegexColumnNames \| 2.3.0 \| SPARK-12139 \| 2cbfc975ba937a4eb761de7a6473b7747941f386#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.rangeExchange.sampleSizePerPartition \| 2.3.0 \| SPARK-22160 \| 323806e68f91f3c7521327186a37ddd1436267d0#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.enabled \| 2.3.0 \| SPARK-22159 \| d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.pyspark.enabled \| 3.0.0 \| SPARK-27834 \| db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.sparkr.enabled \| 3.0.0 \| SPARK-27834 \| db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.fallback.enabled \| 2.4.0 \| SPARK-23380 \| d6632d185e147fcbe6724545488ad80dce20277e#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.pyspark.fallback.enabled \| 3.0.0 \| SPARK-27834 \| db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.arrow.maxRecordsPerBatch \| 2.3.0 \| SPARK-13534 \| d03aebbe6508ba441dc87f9546f27aeb27553d77#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.pandas.udf.buffer.size \| 3.1.0 \| SPARK-27870 \| 692e3ddb4e517638156f7427ade8b62fb37634a7#diff-9a6b543db706f1a90f790783d6930a13 \| Exists in master, not branch-3.0 spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName \| 2.4.1 \| SPARK-24324 \| 3f203050ac764516e68fb43628bba0df5963e44d#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.execution.pandas.convertToArrowArraySafely \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.replaceExceptWithFilter \| 2.3.0 \| SPARK-22181 \| 01f6ba0e7a12ef818d56e7d5b1bd889b79f2b57c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.decimalOperations.allowPrecisionLoss \| 2.3.1 \| SPARK-22036 \| 8a98274823a4671cee85081dd19f40146e736325#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.legacy.literal.pickMinimumPrecision \| 2.3.3 \| SPARK-25454 \| 26d893a4f64de18222942568f7735114447a6ab7#diff-9a6b543db706f1a90f790783d6930a13 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27770 from beliefer/add-version-to-sql-config-part-three. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-06 11:19:08 +09:00
HyukjinKwon	fc12165f48	[SPARK-31036][SQL] Use stringArgs in Expression.toString to respect hidden parameters ### What changes were proposed in this pull request? This PR proposes to respect hidden parameters by using `stringArgs` in `Expression.toString `. By this, we can show the strings properly in some cases such as `NonSQLExpression`. ### Why are the changes needed? To respect "hidden" arguments in the string representation. ### Does this PR introduce any user-facing change? Yes, for example, on the top of https://github.com/apache/spark/pull/27657, ```scala val identify = udf((input: Seq[Int]) => input) spark.range(10).select(identify(array("id"))).show() ``` shows hidden parameter `useStringTypeWhenEmpty`. ``` +---------------------+ \|UDF(array(id, false))\| +---------------------+ \| [0]\| \| [1]\| ... ``` whereas: ```scala spark.range(10).select(array("id")).show() ``` ``` +---------+ \|array(id)\| +---------+ \| [0]\| \| [1]\| ... ``` ### How was this patch tested? Manually tested as below: ```scala val identify = udf((input: Boolean) => input) spark.range(10).select(identify(exists(array(col("id")), _ % 2 === 0))).show() ``` Before: ``` +-------------------------------------------------------------------------------------+ \|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false), true))\| +-------------------------------------------------------------------------------------+ \| true\| \| false\| \| true\| ... ``` After: ``` +-------------------------------------------------------------------------------+ \|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false)))\| +-------------------------------------------------------------------------------+ \| true\| \| false\| \| true\| ... ``` Closes #27788 from HyukjinKwon/arguments-str-repr. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-06 10:33:20 +09:00
DB Tsai	fe126a6a05	[SPARK-31058][SQL][TEST-HIVE1.2] Consolidate the implementation of `quoteIfNeeded` ### What changes were proposed in this pull request? There are two implementation of quoteIfNeeded. One is in `org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote` and the other is in `OrcFiltersBase.quoteAttributeNameIfNeeded`. This PR will consolidate them into one. ### Why are the changes needed? Simplify the codebase. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27814 from dbtsai/SPARK-31058. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-03-06 00:13:57 +00:00
Wenchen Fan	ba86524b25	[SPARK-31037][SQL] refine AQE config names ### What changes were proposed in this pull request? When introducing AQE to others, I feel the config names are a bit incoherent and hard to use. This PR refines the config names: 1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere. 2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`. 3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions` 4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace 5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum` 6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin` ### Why are the changes needed? Make the config names easy to understand. ### Does this PR introduce any user-facing change? deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize` ### How was this patch tested? N/A Closes #27793 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 00:46:34 +08:00
Maxim Gekk	1fd9a91c66	[SPARK-31005][SQL] Support time zone ids in casting strings to timestamps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +\|-h[h] - +\|-hh[:]mm - +\|-hh:mm:ss - +\|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see https://github.com/apache/spark/pull/27710#discussion_r385020173 - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 20:49:43 +08:00
Wenchen Fan	807ea413b4	[SPARK-31019][SQL] make it clear that people can deduplicate map keys ### What changes were proposed in this pull request? rename the config and make it non-internal. ### Why are the changes needed? Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday. ### Does this PR introduce any user-facing change? no, just rename a config which was added in 3.0 ### How was this patch tested? add more tests for the fail behavior. Closes #27772 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-05 20:43:52 +09:00
Kent Yao	f45ae7f2c5	[SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone ### What changes were proposed in this pull request? The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid. ### Why are the changes needed? improve configuration ### Does this PR introduce any user-facing change? yes, will fail fast if the value is a wrong timezone id ### How was this patch tested? add ut Closes #27792 from yaooqinn/SPARK-31038. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 19:38:20 +08:00
Terry Kim	66b4fd040e	[SPARK-31024][SQL] Allow specifying session catalog name `spark_catalog` in qualified column names for v1 tables ### What changes were proposed in this pull request? Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` fails with `cannot resolve 'spark_catalog.default.t.i`. This is inconsistent with v2 table behavior where catalog name can be used: ``` SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id ``` This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables. ### Why are the changes needed? Fixing an inconsistent behavior. ### Does this PR introduce any user-facing change? Yes, now the following query works: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` ### How was this patch tested? Added new tests. Closes #27776 from imback82/spark_catalog_col_name_resolution. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:33:59 +08:00
Yuanjian Li	7db0af5785	[SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter ### What changes were proposed in this pull request? This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. ### Why are the changes needed? Avoid silent result change for new behavior in 3.0. ### Does this PR introduce any user-facing change? Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. ### How was this patch tested? Extend existing UT. Closes #27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 15:29:39 +08:00
Terry Kim	b30278107f	[SPARK-30885][SQL][FOLLOW-UP] Fix issues where some V1 commands allow tables that are not fully qualified ### What changes were proposed in this pull request? There are few V1 commands such as `REFRESH TABLE` that still allow `spark_catalog.t` because they run the commands with parsed table names without trying to load them in the catalog. This PR addresses this issue. The PR also addresses the issue brought up in https://github.com/apache/spark/pull/27642#discussion_r382402104. ### Why are the changes needed? To fix a bug where for some V1 commands, `spark_catalog.t` is allowed. ### Does this PR introduce any user-facing change? Yes, a bug is fixed and `REFRESH TABLE spark_catalog.t` is not allowed. ### How was this patch tested? Added new test. Closes #27718 from imback82/fix_TempViewOrV1Table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 18:09:48 +08:00
Wenchen Fan	e4c61e35da	[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ \|jsontostructs({"time":"1123-2-22 2:22:22"})\| +-------------------------------------------+ \|[1123-02-22 02:22:22] \| +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ \|jsontostructs({"time":"2000-12-12T12:12:12"})\| +---------------------------------------------+ \|[2000-12-12 12:12:12] \| +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ \|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})\| +----------------------------------------------------------+ \|[1123-02-15 02:22:22.123] \| +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ \|jsontostructs({"time":"1123-02-22Tabc"})\| +----------------------------------------+ \|[1123-02-22] \| +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ \|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})\| +--------------------------------------------------+ \|[1123-02-22 02:22:22.123] \| +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes #27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 18:27:44 +09:00
Takeshi Yamamuro	4a1d273a4a	[SPARK-30997][SQL] Fix an analysis failure in generators with aggregate functions ### What changes were proposed in this pull request? We have supported generators in SQL aggregate expressions by SPARK-28782. But, the generator(explode) query with aggregate functions in DataFrame failed as follows; ``` // SPARK-28782: Generator support in aggregate expressions scala> spark.range(3).toDF("id").createOrReplaceTempView("t") scala> sql("select explode(array(min(id), max(id))) from t").show() +---+ \|col\| +---+ \| 0\| \| 2\| +---+ // A failure case handled in this pr scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() org.apache.spark.sql.AnalysisException: The query operator `Generate` contains one or more unsupported expression types Aggregate, Window or Generate. Invalid expressions: [min(`id`), max(`id`)];; Project [col#46L] +- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L] +- Range (0, 3, step=1, splits=Some(4)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129) ``` The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions before `GlobalAggregates` replaces it with an aggregate as follows; ``` scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn") scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#72L), max(id#72L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Project [explode(array(min(id#72L), max(id#72L))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Result of Batch Resolution === !'Project [explode(array(min('id), max('id))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) // the analysis failed here... ``` To avoid the case in `ExtractGenerator`, this pr addes a condition to ignore generators having aggregate functions. A correct sequence of rules is as follows; ``` 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === !'Project [explode(array(min(id#27L), max(id#27L))) AS List()] 'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] 'Project [explode(_gen_input_0#31) AS List()] !+- Range (0, 3, step=1, splits=Some(4)) +- Aggregate [array(min(id#27L), max(id#27L)) AS _gen_input_0#31] ! +- Range (0, 3, step=1, splits=Some(4)) ``` ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27749 from maropu/ExplodeInAggregate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-03 12:25:12 -08:00
Terry Kim	c263c15408	[SPARK-31015][SQL] Star() expression fails when used with qualified column names for v2 tables ### What changes were proposed in this pull request? For a v2 table created with `CREATE TABLE testcat.ns1.ns2.tbl (id bigint, name string) USING foo`, the following works as expected ``` SELECT testcat.ns1.ns2.tbl.id FROM testcat.ns1.ns2.tbl ``` , but a query with qualified column name with star() ``` SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl [info] org.apache.spark.sql.AnalysisException: cannot resolve 'testcat.ns1.ns2.tbl.' given input columns 'id, name'; ``` fails to resolve. And this PR proposes to fix this issue. ### Why are the changes needed? To fix a bug as describe above. ### Does this PR introduce any user-facing change? Yes, now `SELECT testcat.ns1.ns2.tbl. FROM testcat.ns1.ns2.tbl` works as expected. ### How was this patch tested? Added new test. Closes #27766 from imback82/fix_star_expression. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 00:55:26 +08:00
Takeshi Yamamuro	313e62c376	[SPARK-30998][SQL] ClassCastException when a generator having nested inner generators ### What changes were proposed in this pull request? A query below failed in the master; ``` scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show() 20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ... ``` This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators. ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27750 from maropu/HandleNestedGenerators. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-03 19:00:33 +09:00
Josh Rosen	f0010c81e2	[SPARK-31003][TESTS] Fix incorrect uses of assume() in tests ### What changes were proposed in this pull request? This patch fixes several incorrect uses of `assume()` in our tests. If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable). In contrast, `assert(condition)` will fail the test if the condition doesn't hold. If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test. This patch fixes several such cases, replacing certain `assume()` calls with `assert()`. Credit to ahirreddy for spotting this problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27754 from JoshRosen/fix-assume-vs-assert. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-02 15:20:45 -08:00
Jungtaek Lim (HeartSaVioR)	f24a46011c	[SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable ### What changes were proposed in this pull request? This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable. It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled. We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase. `0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)` Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value. ### Why are the changes needed? Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #27747 from HeartSaVioR/SPARK-30993. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-02 22:33:11 +08:00
Jiaan Geng	a429ac83e4	[SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/27691 I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.orc.compression.codec \| 2.3.0 \| SPARK-21839 \| d8f45408635d4fccac557cb1e877dfe9267fb326#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.impl \| 2.3.0 \| SPARK-20728 \| 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.enableVectorizedReader \| 2.3.0 \| SPARK-16060 \| 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.columnarReaderBatchSize \| 2.4.0 \| SPARK-23188 \| cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.orc.filterPushdown \| 1.4.0 \| SPARK-2883 \| 65d71bd9fbfe6fe1b741c80fed72d6ae3d22b028#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.orc.mergeSchema \| 3.0.0 \| SPARK-11412 \| 73183b3c8c2022846587f08e8dea5c387ed3b8d5#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.verifyPartitionPath \| 1.4.0 \| Spark-5068 \| 1f39a61118184e136f38381a9f3ba0b2d5d589d9#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.metastorePartitionPruning \| 1.5.0 \| SPARK-9386 \| ce89ff477aea6def68265ed218f6105680755c9a#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.manageFilesourcePartitions \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.filesourcePartitionFileCacheSize \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.hive.caseSensitiveInferenceMode \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.optimizer.metadataOnly \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.columnNameOfCorruptRecord \| 1.2.0 \| SPARK-3339 \| 1c7f0ab302de9f82b1bd6da852d133823bc67c66#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.broadcastTimeout \| 1.3.0 \| SPARK-4269 \| fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftserver.scheduler.pool \| 1.1.1 \| SPARK-3025 \| 496f62d9a98067256d8a51fd1e7a485ff6492fa8#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftServer.incrementalCollect \| 2.0.3 \| SPARK-18857 \| c94288b57b5ce2232e58e35cada558d8d5b8ec6e#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.thriftserver.ui.retainedStatements \| 1.4.0 \| SPARK-5100 \| 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.thriftserver.ui.retainedSessions \| 1.4.0 \| SPARK-5100 \| 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.default \| 1.3.0 \| SPARK-5658 \| a21090ebe1ef7a709709300712de7d928a923244#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.hive.convertCTAS \| 2.0.0 \| SPARK-15646 \| 5a835b99f9852b0c2a35f9c75a51d493474994ea#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.hive.gatherFastStats \| 2.0.1 \| SPARK-17063 \| 3d283f6c9d9daef53fa4e90b0ead2a94710a37a7#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.sources.partitionColumnTypeInference.enabled \| 1.5.0 \| SPARK-7939 \| 03ef6be9ce61a13dcd9d8c71298fb4be39119411#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.bucketing.enabled \| 2.0.0 \| SPARK-13486 \| 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.sources.bucketing.maxBuckets \| 2.4.0 \| SPARK-23997 \| de46df549acee7fda56bb0871f444d2f3b49e582#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.crossJoin.enabled \| 2.0.0 \| SPARK-15425 \| 4462da7071462084c5b55cc414c7faa0e1396a18#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.orderByOrdinal \| 2.0.0 \| SPARK-12789 \| 2c5b18fb0fdeabd378dd97e91f72d1eac4e21cc7#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.groupByOrdinal \| 2.0.0 \| SPARK-13957 \| 05f652d6c2bbd764a1dd5a45301811e14519486f#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.groupByAliases \| 2.2.0 \| SPARK-14471 \| af3a1411a28796d4d9a100eefb093b1d91532754#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.outputCommitterClass \| 1.4.0 \| SPARK-7567 \| a385f4b8dd22e0e056569cffc4fa63047cb7c8f2#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.commitProtocolClass \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.parallelPartitionDiscovery.threshold \| 1.5.0 \| SPARK-8125 \| a1064df0ee3daf496800be84293345a10e1497d9#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.sources.parallelPartitionDiscovery.parallelism \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.sources.ignoreDataLocality \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.selfJoinAutoResolveAmbiguity \| 1.4.0 \| SPARK-6231 \| e61083ccab7764d1929248490a3d2e83987241e0#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.analyzer.failAmbiguousSelfJoin \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.retainGroupColumns \| 1.4.0 \| SPARK-7462 \| 9c35f02b35fda80d6558573466735e79b3dd9124#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.pivotMaxValues \| 1.6.0 \| SPARK-8992 \| 5940fc71d2a245cc6e50edb455c3dd3dbb8de43a#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.runSQLOnFiles \| 1.6.0 \| SPARK-11197 \| f8c6bec65784de89b47e96a367d3f9790c1b3115#diff-41ef65b9ef5b518f77e2a03559893f4d spark.sql.codegen.wholeStage \| 2.0.0 \| SPARK-13486 \| 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.useIdInClassName \| 2.3.1 \| SPARK-23032 \| 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.maxFields \| 2.0.0 \| SPARK-14224 and SPARK-14223 and SPARK-14310 \| 5a4b11a901703464b9261dea0642d80cf8d4856c#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.factoryMode \| 2.4.0 \| SPARK-23711 \| a40ffc656d62372da85e0fa932b67207839e7fde#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.fallback \| 2.0.0 \| SPARK-15759 \| f0fa0a8946fb4bdf0f4697a8e389f49e98422871#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.logging.maxLines \| 2.3.0 \| SPARK-20871 \| 2a53fbfce72b3faef020e39a1e8628d68bc95beb#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.hugeMethodLimit \| 2.3.0 \| SPARK-21871 \| 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.methodSplitThreshold \| 3.0.0 \| SPARK-25850 \| e017cb39642a5039abd8ce8127ad41712901bdbc#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.splitConsumeFuncByOperator \| 2.3.1 \| SPARK-21717 \| c79e771f8952e6773c3a84cc617145216feddbcf#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.maxPartitionBytes \| 2.0.0 \| SPARK-13664 \| 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.files.openCostInBytes \| 2.0.0 \| SPARK-14259 \| 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.files.ignoreCorruptFiles \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.ignoreMissingFiles \| 2.3.0 \| SPARK-22366 \| 8e9863531bebbd4d83eafcbc2b359b8bd0ac5734#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.files.maxRecordsPerFile \| 2.2.0 \| SPARK-19944 \| 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.exchange.reuse \| 2.0.0 \| SPARK-13523 \| 3dc9ae2e158e5b51df6f799767946fe1d190156b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.execution.reuseSubquery \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stateStore.providerClass \| 2.3.0 \| SPARK-20883 and SPARK-20376 \| fa757ee1d41396ad8734a3f2dd045bb09bc82a2e#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stateStore.minDeltasForSnapshot \| 2.0.0 \| SPARK-13809 \| 8c826880f5eaa3221c4e9e7d3fece54e821a0b98#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion \| 2.4.0 \| SPARK-22187 \| b3d88ac02940eff4c867d3acb79fe5ff9d724e83#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.checkpointLocation \| 2.0.0 \| SPARK-13985 \| caea15214571d9b12dcf1553e5c1cc8b83a8ba5b#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.streaming.forceDeleteTempCheckpointLocation \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.minBatchesToRetain \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.maxBatchesToRetainInMemory \| 2.4.0 \| SPARK-24717 \| 8b7d4f842fdc90b8d1c37080bdd9b5e1d070f5c0#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.aggregation.stateFormatVersion \| 2.4.0 \| SPARK-24763 \| 6c5cb85856235efd464b109558896f81ae2c4c75#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.stopActiveRunOnRestart \| 3.0.0 \| SPARK-29568 \| 363af16c72abe19fc5cc5b5bdf9d8dc34975f2ba#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.join.stateFormatVersion \| 3.0.0 \| SPARK-26154 \| c941362cb94b24bdf48d4928a1a4dff1b13a1484#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.unsupportedOperationCheck \| 2.0.0 \| SPARK-14473 \| 775cf17eaaae1a38efe47b282b1d6bbdb99bd759#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.variable.substitute \| 2.0.0 \| SPARK-14769 \| 334c293ec0bcc2195d502c574ca40dbc4769d666#diff-32bb9518401c0948c5ea19377b5069ab spark.sql.codegen.aggregate.map.twolevel.enabled \| 2.3.0 \| SPARK-22159 \| d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.aggregate.map.vectorized.enable \| 3.0.0 \| SPARK-28257 \| 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.codegen.aggregate.splitAggregateFunc.enabled \| 3.0.0 \| SPARK-21870 \| cb0cddffe9452937033e0e6b1fc0e600d2c787ad#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.view.maxNestedViewDepth \| 2.2.0 \| SPARK-19877 \| ee36bc1c9043ead3c3ba4fba7e68c6c47ad7ae7a#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.commitProtocolClass \| 2.1.0 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 spark.sql.streaming.multipleWatermarkPolicy \| 2.4.0 \| SPARK-24730 \| 6078b891da8fe7fc36579699473168ae7443284c#diff-9a6b543db706f1a90f790783d6930a13 ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27730 from beliefer/add-version-to-sql-config-part-two. Lead-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 15:13:58 +09:00
iRakson	92a5ae2ae4	[SPARK-30234][SQL][FOLLOWUP] Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### What changes were proposed in this pull request? Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### Why are the changes needed? To follow the naming convention ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27725 from iRakson/SPARK-30234_CONFIG. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-01 10:55:41 +09:00
iRakson	a40a2f8338	[SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType' ### What changes were proposed in this pull request? Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`. ### Why are the changes needed? Better readability of configuration. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27719 from iRakson/SPARK-27619_FOLLOWUP. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 22:57:50 +08:00
Wenchen Fan	f21894e5fa	[SPARK-30902][SQL] Default table provider should be decided by catalog implementations ### What changes were proposed in this pull request? When `CREATE TABLE` SQL statement does not specify the provider, leave it to the catalog implementations to decide. ### Why are the changes needed? It's super weird if we set the default provider to parquet when creating a table in a JDBC catalog. ### Does this PR introduce any user-facing change? Yes, v2 catalog will not see a "provider" property in table properties if it's not specified in `CREATE TABLE` SQL statement. V2 catalog is new in 3.0. ### How was this patch tested? new tests Closes #27650 from cloud-fan/create_table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-28 15:14:23 +09:00
Liang-Chi Hsieh	ba032acf95	[SPARK-30955][SQL] Exclude Generate output when aliasing in nested column pruning ### What changes were proposed in this pull request? When aliasing in nested column pruning in Project on top of Generate, we should exclude Generate outputs. ### Why are the changes needed? Right now we would prune nested columns in Project on top of Generate. It is possible that referred nested columns are from Generate's outputs, not from its child. To address that case, we should exclude Generate outputs when aliasing in nested column pruning. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #27702 from viirya/fix-nested-pruning. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-28 12:29:46 +09:00
Kent Yao	2d2706cb86	[SPARK-30956][SQL][TESTS] Use intercept instead of try-catch to assert failures in IntervalUtilsSuite ### What changes were proposed in this pull request? In this PR, I addressed the comment from https://github.com/apache/spark/pull/27672#discussion_r383719562 to use `intercept` instead of `try-catch` block to assert failures in the IntervalUtilsSuite ### Why are the changes needed? improve tests ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27700 from yaooqinn/intervaltest. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-02-27 23:12:35 +09:00
beliefer	1515d45b8d	[SPARK-27924][SQL][FOLLOW-UP] Improve ANSI SQL Boolean-Predicate ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/25074 and improves the implement. ### Why are the changes needed? Improve code. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27699 from beliefer/improve-boolean-test. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 13:42:02 +08:00
beliefer	825d3dc11b	[SPARK-30841][SQL][DOC] Add version information to the configuration of SQL ### What changes were proposed in this pull request? Add version information to the configuration of Spark SQL. Note: Because SQLConf has a lot of configuration items, I split the items into two PR. Another PR will follows this PR. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.sql.analyzer.maxIterations \| 3.0.0 \| SPARK-30138 \| c2f29d5ea58eb4565cc5602937d6d0bb75558513#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.excludedRules \| 2.4.0 \| SPARK-24802 \| 434319e73f8cb6e080671bdde42a72228bd814ef#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.maxIterations \| 2.0.0 \| SPARK-14677 \| f4be0946af219379fb2476e6f80b2e50463adeb2#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.optimizer.inSetConversionThreshold \| 2.0.0 \| SPARK-14796 \| 3647120a5a879edf3a96a5fd68fb7aa849ad57ef#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.optimizer.inSetSwitchThreshold \| 3.0.0 \| SPARK-26205 \| 0c23a39384b7ae5fb4aeb4f7f6fe72007b84bbd2#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.planChangeLog.level \| 3.0.0 \| SPARK-25415 \| 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 \| This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT spark.sql.optimizer.planChangeLog.rules \| 3.0.0 \| SPARK-25415 \| 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 \| This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT spark.sql.optimizer.planChangeLog.batches \| 3.0.0 \| SPARK-27088 \| 074533334d01afdd7862a1ac6c5a7a672bcce3f8#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.dynamicPartitionPruning.enabled \| 3.0.0 \| SPARK-11150 \| a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.dynamicPartitionPruning.useStats \| 3.0.0 \| SPARK-11150 \| a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio \| 3.0.0 \| SPARK-11150 \| a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly \| 3.0.0 \| SPARK-30528 \| 59a13c9b7bc3b3aa5b5bc30a60344f849c0f8012#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.inMemoryColumnarStorage.compressed \| 1.0.1 \| SPARK-2631 \| 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.inMemoryColumnarStorage.batchSize \| 1.1.1 \| SPARK-2650 \| 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.inMemoryColumnarStorage.partitionPruning \| 1.2.0 \| SPARK-2961 \| 248067adbe90f93c7d5e23aa61b3072dfdf48a8a#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.inMemoryTableScanStatistics.enable \| 3.0.0 \| SPARK-28257 \| 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.inMemoryColumnarStorage.enableVectorizedReader \| 2.3.1 \| SPARK-23312 \| e5e9f9a430c827669ecfe9d5c13cc555fc89c980#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.columnVector.offheap.enabled \| 2.3.0 \| SPARK-20101 \| 572af5027e45ca96e0d283a8bf7c84dcf476f9bc#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.join.preferSortMergeJoin \| 2.0.0 \| SPARK-13977 \| 9c23c818ca0175c8f2a4a66eac261ec251d27c97#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.sort.enableRadixSort \| 2.0.0 \| SPARK-14724 \| e2b5647ab92eb478b3f7b36a0ce6faf83e24c0e5#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.autoBroadcastJoinThreshold \| 1.1.0 \| SPARK-2393 \| c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.limit.scaleUpFactor \| 2.1.1 \| SPARK-19944 \| 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.hive.advancedPartitionPredicatePushdown.enabled \| 2.3.0 \| SPARK-20331 \| d8cada8d1d3fce979a4bc1f9879593206722a3b9#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.shuffle.partitions \| 1.1.0 \| SPARK-1508 \| 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.adaptive.enabled \| 1.6.0 \| SPARK-9858 and SPARK-9859 and SPARK-9861 \| d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.adaptive.forceApply \| 3.0.0 \| SPARK-30719 \| b29cb1a82b1a1facf1dd040025db93d998dad4cd#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.shuffle.reducePostShufflePartitions \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.shuffle.minNumPostShufflePartitions \| 3.0.0 \| SPARK-9853 \| 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.shuffle.targetPostShuffleInputSize \| 1.6.0 \| SPARK-9858 and SPARK-9859 and SPARK-9861 \| d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.adaptive.shuffle.maxNumPostShufflePartitions \| 3.0.0 \| SPARK-9853 \| 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.shuffle.localShuffleReader.enabled \| 3.0.0 \| SPARK-29893 \| 6e581cf164c3a2930966b270ac1406dc1195c942#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.skewedJoinOptimization.enabled \| 3.0.0 \| SPARK-30812 \| b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.skewedJoinOptimization.skewedPartitionFactor \| 3.0.0 \| SPARK-30812 \| 5b36cdbbfef147e93b35eaa4f8e0bea9690b6d06#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin \| 3.0.0 \| SPARK-9853 and SPARK-29002 \| 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 and b2f06608b785f577999318c00f2c315f39d90889#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.subexpressionElimination.enabled \| 1.6.0 \| SPARK-10371 \| f38509a763816f43a224653fe65e4645894c9fc4#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.caseSensitive \| 1.4.0 \| SPARK-4699 \| 21bd7222e55b9cf684c072141998a0623a69f514#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.constraintPropagation.enabled \| 2.2.0 \| SPARK-19846 \| e011004bedca47be998a0c14fe22a6f9bb5090cd#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parser.escapedStringLiterals \| 2.2.1 \| SPARK-20399 \| 3d1908fd58fd9b1970cbffebdb731bfe4c776ad9#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.sources.fileCompressionFactor \| 2.3.1 \| SPARK-22790 \| 0fc5533e53ad03eb67590ddd231f40c2713150c3#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.mergeSchema \| 1.5.0 \| SPARK-8690 \| 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.respectSummaryFiles \| 1.5.0 \| SPARK-8838 \| 6175d6cfe795fbd88e3ee713fac375038a3993a8#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.binaryAsString \| 1.1.1 \| SPARK-2927 \| de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.int96AsTimestamp \| 1.3.0 \| SPARK-4987 \| 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.int96TimestampConversion \| 2.3.0 \| SPARK-12297 \| acf7ef3154e094875fa89f30a78ab111b267db91#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.outputTimestampType \| 2.3.0 \| SPARK-10365 \| 21a7bfd5c324e6c82152229f1394f26afeae771c#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.compression.codec \| 1.1.1 \| SPARK-3131 \| 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.filterPushdown \| 1.2.0 \| SPARK-4391 \| 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.filterPushdown.date \| 2.4.0 \| SPARK-23727 \| b02e76cbffe9e589b7a4e60f91250ca12a4420b2#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.filterPushdown.timestamp \| 2.4.0 \| SPARK-24718 \| 43e4e851b642bbee535d22e1b9e72ec6b99f6ed4#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.filterPushdown.decimal \| 2.4.0 \| SPARK-24549 \| 9549a2814951f9ba969955d78ac4bd2240f85989#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.filterPushdown.string.startsWith \| 2.4.0 \| SPARK-24638 \| 03545ce6de08bd0ad685c5f59b73bc22dfc40887#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.pushdown.inFilterThreshold \| 2.4.0 \| SPARK-17091 \| e1de34113e057707dfc5ff54a8109b3ec7c16dfb#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.writeLegacyFormat \| 1.6.0 \| SPARK-10400 \| 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.output.committer.class \| 1.5.0 \| SPARK-8139 \| 111d6b9b8a584b962b6ae80c7aa8c45845ce0099#diff-41ef65b9ef5b518f77e2a03559893f4d \| spark.sql.parquet.enableVectorizedReader \| 2.0.0 \| SPARK-13486 \| 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab \| spark.sql.parquet.recordLevelFilter.enabled \| 2.3.0 \| SPARK-17310 \| 673c67046598d33b9ecf864024ca7a937c1998d6#diff-9a6b543db706f1a90f790783d6930a13 \| spark.sql.parquet.columnarReaderBatchSize \| 2.4.0 \| SPARK-23188 \| cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exists UT Closes #27691 from beliefer/add-version-to-sql-config-part-one. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-27 10:58:44 +09:00
iRakson	c913b9d8b5	[SPARK-27619][SQL] MapType should be prohibited in hash expressions ### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 01:48:12 +08:00
Terry Kim	73305475c1	[SPARK-30782][SQL] Column resolution doesn't respect current catalog/namespace for v2 tables ### What changes were proposed in this pull request? This PR proposes to fix an issue where qualified columns are not matched for v2 tables if current catalog/namespace are used. For v1 tables, you can currently perform the following: ```SQL SELECT default.t.id FROM t; ``` For v2 tables, the following fails: ```SQL USE testcat.ns1.ns2; SELECT testcat.ns1.ns2.t.id FROM t; org.apache.spark.sql.AnalysisException: cannot resolve '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7; ``` ### Why are the changes needed? It is a bug since qualified column names cannot match if current catalog/namespace are used. ### Does this PR introduce any user-facing change? Yes, now the following works: ```SQL USE testcat.ns1.ns2; SELECT testcat.ns1.ns2.t.id FROM t; ``` ### How was this patch tested? Added new tests Closes #27532 from imback82/qualifed_col_respect_current. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 00:21:38 +08:00
gatorsmile	28b8713036	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT ### What changes were proposed in this pull request? This patch is to bump the master branch version to 3.1.0-SNAPSHOT. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27698 from gatorsmile/updateVersion. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-25 19:44:31 -08:00
Wenchen Fan	8f247e5d36	[SPARK-30918][SQL] improve the splitting of skewed partitions ### What changes were proposed in this pull request? Use the average size of the non-skewed partitions as the target size when splitting skewed partitions, instead of ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD ### Why are the changes needed? The goal of skew join optimization is to make the data distribution move even. So it makes more sense the use the average size of the non-skewed partitions as the target size. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27669 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-25 14:10:29 -08:00
Maxim Gekk	ffc0935e64	[SPARK-30869][SQL] Convert dates to/from timestamps in microseconds precision ### What changes were proposed in this pull request? In the PR, I propose to replace: 1. `millisToDays()` by `microsToDays()` which accepts microseconds since the epoch and returns days since the epoch in the specified time zone. The last one is the internal representation of Catalyst's DateType. 2. `daysToMillis()` by `daysToMicros()` which accepts days since the epoch in some time zone and returns the number of microseconds since the epoch. The last one is internal representation of Catalyst's TimestampType. 3. `fromMillis()` by `millisToMicros()` 4. `toMillis()` by `microsToMillis()` ### Why are the changes needed? Spark stores timestamps in microseconds precision, so, there is no actual need to convert dates to milliseconds, and then to microseconds. As examples, look at DateTimeUtils functions `monthsBetween()` and `truncTimestamp()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites UnivocityParserSuite, DateExpressionsSuite, ComputeCurrentTimeSuite, DateTimeUtilsSuite, DateFunctionsSuite, JsonSuite, StreamSuite. Closes #27618 from MaxGekk/replace-millis-by-micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 23:05:28 +08:00
Kent Yao	761209c1f2	[SPARK-30919][SQL] Make interval multiply and divide's overflow behavior consistent with other operations ### What changes were proposed in this pull request? The current behavior of interval multiply and divide follows the ANSI SQL standard when overflow, it is compatible with other operations when `spark.sql.ansi.enabled` is true, but not compatible when `spark.sql.ansi.enabled` is false. When `spark.sql.ansi.enabled` is false, as the factor is a double value, so it should use java's rounding or truncation behavior for casting double to integrals. when divided by zero, it returns `null`. we also follow the natural rules for intervals as defined in the Gregorian calendar, so we do not add the month fraction to days but add days fraction to microseconds. ### Why are the changes needed? Make interval multiply and divide's overflow behavior consistent with other interval operations ### Does this PR introduce any user-facing change? no, these are new features in 3.0 ### How was this patch tested? add uts Closes #27672 from yaooqinn/SPARK-30919. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 22:19:24 +08:00
Josh Rosen	f152d2a0a8	[SPARK-30944][BUILD] Update URL for Google Cloud Storage mirror of Maven Central ### What changes were proposed in this pull request? This PR is a followup to #27307: per https://travis-ci.community/t/maven-builds-that-use-the-gcs-maven-central-mirror-should-update-their-paths/5926, the Google Cloud Storage mirror of Maven Central has updated its URLs: the new paths are updated more frequently. The new paths are listed on https://storage-download.googleapis.com/maven-central/index.html This patch updates our build files to use these new URLs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing build + tests. Closes #27688 from JoshRosen/update-gcs-mirror-url. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-25 17:04:13 +09:00
Peter Toth	1a4e2423b2	[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure ### What changes were proposed in this pull request? This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues: ``` SELECT explodedvalue.* FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value) LATERAL VIEW explode(value) AS explodedvalue ``` This is a regression from Spark 2.4. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT. Closes #27675 from peter-toth/SPARK-30870. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-24 13:46:21 -08:00
beliefer	621e37e2ab	[SPARK-28880][SQL] Support ANSI nested bracketed comments ### What changes were proposed in this pull request? Spark SQL support single comments and bracketed comments now. This PR will support nested bracketed comments. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Expressions/Comments.htm?zoom_highlight=comments Note: Because Spark SQL not exists UT for single comments and bracketed comments, so I add some UT for them. ### Why are the changes needed? nested bracketed comments is ANSI standard. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT Closes #27495 from beliefer/nested-brancket-comments. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-24 00:28:46 -08:00
Peter Toth	a372f76cbd	[SPARK-30898][SQL] The behavior of MakeDecimal should not depend on SQLConf.get ### What changes were proposed in this pull request? This PR adds a new `nullOnOverflow` parameter to `MakeDecimal` so as to avoid its value depending on `SQLConf.get` and change during planning. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27656 from peter-toth/SPARK-30898. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-24 16:48:48 +09:00
Peter Toth	612f63f39e	[SPARK-30897][SQL] The behavior of ArrayExists should not depend on SQLConf.get ### What changes were proposed in this pull request? This PR adds a new `followThreeValuedLogic` parameter to `ArrayExists` so as to avoid its value depending on `SQLConf.get` and change during planning. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27655 from peter-toth/SPARK-30897. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-24 16:47:08 +09:00
Burak Yavuz	4ff2718d54	[SPARK-30924][SQL][3.0] Add additional checks to Merge Into ### What changes were proposed in this pull request? Merge Into is currently missing additional validation around: 1. The lack of any WHEN statements 2. The first WHEN MATCHED statement needs to have a condition if there are two WHEN MATCHED statements. 3. Single use of UPDATE/DELETE This PR introduces these validations. (1) is required, because otherwise the MERGE statement is useless. (2) is required, because otherwise the second WHEN MATCHED condition becomes dead code (3) is up for debate, but the idea there is that a single expression should be sufficient to specify when you would like to update or delete your records. We restrict it for now to reduce surface area and ambiguity. ### Why are the changes needed? To ease DataSource developers when building implementations for MERGE ### Does this PR introduce any user-facing change? Adds additional validation checks ### How was this patch tested? Unit tests Closes #27677 from brkyvz/mergeChecks. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 15:16:37 +08:00
jiake	f4696ba252	[SPARK-30922][SQL] remove the max splits config in skewed join ### What changes were proposed in this pull request? When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits` is to resolve the above ui shown issue. And after [PR#27493](https://github.com/apache/spark/pull/27493) combined the skewed readers into one, we not need this config. ### Why are the changes needed? remove the unnecessary config ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test Closes #27673 from JkSelf/removeMaxSplitNum. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 14:29:25 +08:00
Maxim Gekk	c41ef39819	[SPARK-30925][SQL] Prevent overflow/round errors in conversions of milliseconds to/from microseconds ### What changes were proposed in this pull request? - Use `Math.multiplyExact()` in `DateTimeUtils.fromMillis()` to prevent silent overflow in conversion milliseconds to microseconds. - Use `DateTimeUtils.fromMillis()` in all places where milliseconds are converted to microseconds - Use `DateTimeUtils.toMillis()` in all places where microseconds are converted to milliseconds ### Why are the changes needed? 1. To prevent silent arithmetic overflow while multiplying by 1000 in `fromMillis()`. Instead of it, `new ArithmeticException("long overflow")` will be thrown, and handled accordantly. 2. To correctly round microseconds in conversion to milliseconds. For example, `1965-01-01 10:11:12.123456` is represented as `-157700927876544` in micro precision. In milliseconds precision the above needs to be represented as `-157700927877` or `1965-01-01 10:11:12.123`. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By `TimestampFormatterSuite`, `CastSuite`, `DateExpressionsSuite`, `IntervalExpressionsSuite`, `ExpressionParserSuite`, `ExpressionParserSuite`, `DateTimeUtilsSuite`, `IntervalUtilsSuite` Closes #27676 from MaxGekk/millis-2-micros-overflow. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 14:06:25 +08:00
Maxim Gekk	310c14ac8d	[MINOR][SQL] Add a comment for `removedSQLConfigs` ### What changes were proposed in this pull request? In the PR, I propose to explain in the description of `removedSQLConfigs` when removed SQL configs should NOT be placed to the map. ### Why are the changes needed? To make the cases when SQL configs should be added to `removedSQLConfigs` more clear. Recently, `spark.sql.variable.substitute.depth` was removed from the map by #27646 because it contradicts to the condition described by the PR. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `./dev/scalastyle` Closes #27653 from MaxGekk/removedSQLConfigs-comment. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-22 09:48:10 +09:00
beliefer	59d6d5cbb0	[SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder ### What changes were proposed in this pull request? Spark `ConfigEntry` and `ConfigBuilder` missing Spark version information of each configuration at release. This is not good for Spark user when they visiting the page of spark configuration. http://spark.apache.org/docs/latest/configuration.html The new Spark SQL config docs looks like: ![sql配置截屏](https://user-images.githubusercontent.com/8486025/74604522-cb882f00-50f9-11ea-8683-57a90f9e3347.png) ``` > SET -v spark.sql.adaptive.enabled false When true, enable adaptive query execution. spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin 0.2 The relation with a non-empty partition ratio lower than this config will not be considered as the build side of a broadcast-hash join in adaptive execution regardless of its size.This configuration only has an effect when 'spark.sql.adaptive.enabled' is enabled. spark.sql.adaptive.optimizeSkewedJoin.enabled true When true and adaptive execution is enabled, a skewed join is automatically handled at runtime. spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor 10 A partition is considered as a skewed partition if its size is larger than this factor multiple the median partition size and also larger than spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionMaxSplits 5 Configures the maximum number of task to handle a skewed partition in adaptive skewedjoin. spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold 64MB Configures the minimum size in bytes for a partition that is considered as a skewed partition in adaptive skewed join. spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled true Whether to fetch the continuous shuffle blocks in batch. Instead of fetching blocks one by one, fetching continuous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple continuous blocks exist in single fetch request only happen when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled, this feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol. spark.sql.adaptive.shuffle.localShuffleReader.enabled true When true and 'spark.sql.adaptive.enabled' is enabled, this enables the optimization of converting the shuffle reader to local shuffle reader for the shuffle exchange of the broadcast hash join in probe side. spark.sql.adaptive.shuffle.maxNumPostShufflePartitions <undefined> The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled. ``` Note: Because there are so many configuration items that are exposed and require a lot of finishing, I will add the version numbers of these configuration items in another PR. ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Exists UT Closes #27592 from beliefer/add-version-to-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-22 09:46:42 +09:00
Eric Wu	1f0300fb16	[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style ### What changes were proposed in this pull request? The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in https://github.com/apache/spark/pull/27368#discussion_r376694496 https://github.com/apache/spark/pull/27368#discussion_r376927143 Observations/Ideas: 1. Using comma as the separator is not clear, especially commas are used inside the expressions too. 2. Show the column counts first? For example, `Results [4]: …` 3. Currently the attribute names are automatically generated, this need to refined. 4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between `EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`. 5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123 This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability. ### Why are the changes needed? The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan. ### Does this PR introduce any user-facing change? Yes, `EXPLAIN FORMATTED` output style changed. ### How was this patch tested? Update expect results of test cases in explain.sql Closes #27509 from Eric5553/ExplainFormattedRefine. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 23:36:14 +08:00
Yuanjian Li	a5efbb284e	[SPARK-30809][SQL] Review and fix issues in SQL API docs ### What changes were proposed in this pull request? - Add missing `since` annotation. - Don't show classes under `org.apache.spark.sql.dynamicpruning` package in API docs. - Fix the scope of `xxxExactNumeric` to remove it from the API docs. ### Why are the changes needed? Avoid leaking APIs unintentionally in Spark 3.0.0. ### Does this PR introduce any user-facing change? No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0. ### How was this patch tested? Manually generated the API docs and verified the above issues have been fixed. Closes #27560 from xuanyuanking/SPARK-30809. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 17:03:22 +08:00
Maxim Gekk	abe0821ee9	[SPARK-30894][SQL] Make Size's nullable independent from SQL config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `legacySizeOfNull ` parameter to the `Size` expression, and pass the value of `spark.sql.legacy.sizeOfNull` if `legacySizeOfNull` is not provided on creation of `Size`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `CollectionExpressionsSuite`. Closes #27658 from MaxGekk/Size-SQLConf-get-deps. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 15:32:11 +08:00
yi.wu	82ce4753aa	[SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 14:46:54 +08:00
yi.wu	4d356554a6	[MINOR][SQL] Fix error position of NOSCAN ### What changes were proposed in this pull request? Point to correct position when miswrite `NOSCAN` detects. ### Why are the changes needed? Before: ``` [info] org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 0) [info] [info] == SQL == [info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN [info] ^^^ ``` After: ``` [info] org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 78) [info] [info] == SQL == [info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN [info] ------------------------------------------------------------------------------^^^ ``` ### Does this PR introduce any user-facing change? Yes, user will see better error message. ### How was this patch tested? Manually test. Closes #27662 from Ngone51/fix_noscan_reference. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-02-21 15:21:53 +09:00
wuyi	5eb004f4bb	Revert "[SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue" ### What changes were proposed in this pull request? This reverts commit `bef5d9d6c3`. ### Why are the changes needed? Revert it according to https://github.com/apache/spark/pull/24902#issuecomment-584511167. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27540 from Ngone51/revert_spark_28093. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 12:55:32 +08:00
Maxim Gekk	bb40ab09f4	[SPARK-30892][SQL] Exclude `spark.sql.variable.substitute.depth` from `removedSQLConfigs` ### What changes were proposed in this pull request? Exclude the SQL config `spark.sql.variable.substitute.depth` from `SQLConf.removedSQLConfigs` ### Why are the changes needed? By the #27169, the config was placed to `SQLConf.removedSQLConfigs`. And as a consequence of that when an user set it non-default value (1 for example), he/she will get an exception. It is acceptable for SQL configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark 3.0 more difficult. ### Does this PR introduce any user-facing change? Yes, before the changes users get an exception when he/she set `spark.sql.variable.substitute.depth` to a value different from `40`. ### How was this patch tested? Run `spark.conf.set("spark.sql.variable.substitute.depth", 1)` in `spark-shell`. Closes #27646 from MaxGekk/remove-substitute-depth-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 00:44:09 +08:00
Maxim Gekk	a551715fd2	[SPARK-29930][SPARK-30416][SQL][FOLLOWUP] Move deprecated/removed config checks from RuntimeConfig to SQLConf ### What changes were proposed in this pull request? - Output warnings for deprecated SQL configs in `SQLConf. setConfWithCheck()` and in `SQLConf. unsetConf()` - Throw an exception for removed SQL configs in `SQLConf. setConfWithCheck()` when they set to non-default values - Remove checking of deprecated and removed SQL configs from RuntimeConfig ### Why are the changes needed? Currently, warnings/exceptions are printed only when a SQL config is set dynamically, for instance via `spark.conf.set()`. After the changes, removed/deprecated SQL configs will be checked when they set statically. For example: ``` $ bin/spark-shell --conf spark.sql.fromJsonForceNullableSchema=false scala> spark.emptyDataFrame java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': ... Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value. ``` ``` $ bin/spark-shell --conf spark.sql.hive.verifyPartitionPath=false scala> spark.emptyDataFrame 20/02/20 02:10:26 WARN SQLConf: The SQL config 'spark.sql.hive.verifyPartitionPath' has been deprecated in Spark v3.0 and may be removed in the future. This config is replaced by 'spark.files.ignoreMissingFiles'. ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By `SQLConfSuite` Closes #27645 from MaxGekk/remove-sql-configs-followup-2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 00:00:48 +08:00
Maxim Gekk	4248b7fbb9	[SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. OptionsAttachments ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `ArithmeticExpressionSuite`. Closes #27628 from MaxGekk/integral-divide-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-20 21:26:20 +08:00
Gengliang Wang	92d5d40c8e	[SPARK-30881][SQL][DOCS] Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold ### What changes were proposed in this pull request? Revise the doc of SQL configuration `spark.sql.sources.parallelPartitionDiscovery.threshold`. ### Why are the changes needed? The doc of configuration "spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on the part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources". We should revise it as effective on all the file-based data sources. ### Does this PR introduce any user-facing change? No ### How was this patch tested? None. It's just doc. Closes #27639 from gengliangwang/reviseParallelPartitionDiscovery. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-20 00:59:22 -08:00
herman	c92d437c46	[SPARK-30811][SQL] CTE should not cause stack overflow when it refers to non-existent table with same name ### Why are the changes needed? This ports the tests introduced in `7285eea683` to master to avoid future regressions. ### Background A query with Common Table Expressions can cause a stack overflow when it contains a CTE that refers a non-existing table with the same name. The name of the table need to have a database qualifier. This is caused by a couple of things: - CTESubstitution runs analysis on the CTE, but this does not throw an exception because the table has a database qualifier. The reason is that we don't fail is because we re-attempt to resolve the relation in a later rule; - CTESubstitution replace logic does not check if the table it is replacing has a database, it shouldn't replace the relation if it does. So now we will happily replace nonexist.t with t; Note that this not an issue for master or the spark-3.0 branch. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added regression test to `AnalysisErrorSuite` and `DataFrameSuite`. Closes #27635 from hvanhovell/SPARK-30811-master. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-19 10:17:46 -08:00
jiake	10a4eafcfe	[SPARK-30812][SQL] update the skew join configs by adding the prefix "skewedJoinOptimization" ### What changes were proposed in this pull request? This is a follow up in [PR#27563](https://github.com/apache/spark/pull/27563). This PR adds the prefix of "skewedJoinOptimization" in the skew join related configs. ### Why are the changes needed? address remaining address ### Does this PR introduce any user-facing change? No ### How was this patch tested? only update config and no need new ut. Closes #27630 from JkSelf/renameskewjoinconfig. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-19 15:55:29 +08:00
Wenchen Fan	1b67d546bd	revert SPARK-29663 and SPARK-29688 ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/26325 and https://github.com/apache/spark/pull/26347 ### Why are the changes needed? When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double. However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption. ### Does this PR introduce any user-facing change? yes, remove 2 features added in 3.0 ### How was this patch tested? N/A Closes #27619 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-02-18 21:19:57 +01:00
yi.wu	68d7edf949	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy ### What changes were proposed in this pull request? Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html): SQL: * spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083) * spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252) * spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544) * spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183) * spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389) * spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344) * spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074) * spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811) * spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586) * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052) * spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189) * spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853) CORE: * spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855) * spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235) * spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001) * spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371) * spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429) ### Why are the changes needed? To comply with the config naming policy. ### Does this PR introduce any user-facing change? No. Configurations listed above are all newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27563 from Ngone51/revise_boolean_conf_name. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 20:39:50 +08:00
yi.wu	643a480b11	[SPARK-30863][SQL] Distinguish Cast and AnsiCast in toString ### What changes were proposed in this pull request? Prefix by `ansi_` in `toString` if it's a `AnsiCast` or ansi enabled `Cast`. E.g. run `spark.sql("select cast('51' as int)").queryExecution.analyzed` under ansi mode. Before this PR: ``` Project [cast(51 as int) AS CAST(51 AS INT)#0] +- OneRowRelation ``` After this PR: ``` Project [ansi_cast(51 as int) AS CAST(51 AS INT)#0] +- OneRowRelation ``` ### Why are the changes needed? This is useful while comparing `LogicalPlan`s literally. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27608 from Ngone51/ansi_cast_tostring. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 16:10:43 +08:00
HyukjinKwon	9618806f44	[SPARK-30847][SQL] Take productPrefix into account in MurmurHash3.productHash ### What changes were proposed in this pull request? This PR proposes to port Scala's bugfix https://github.com/scala/scala/pull/7693 (Scala 2.13) to address https://github.com/scala/bug/issues/10495 issue. In short, it is possible for different product instances having the same children to have the same hash. See: ```scala scala> spark.range(1).selectExpr("id - 1").queryExecution.analyzed.semanticHash() res0: Int = -565572825 scala> spark.range(1).selectExpr("id + 1").queryExecution.analyzed.semanticHash() res1: Int = -565572825 ``` ### Why are the changes needed? It was found during the review of https://github.com/apache/spark/pull/27565. We should better produce different hash for different objects. ### Does this PR introduce any user-facing change? No, it's not identified. Possibly performance related issue. ### How was this patch tested? Manually tested, and unittest was added. Closes #27601 from HyukjinKwon/SPARK-30847. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 14:28:44 +08:00
Terry Kim	5866bc77d7	[SPARK-30814][SQL] ALTER TABLE ... ADD COLUMN position should be able to reference columns being added ### What changes were proposed in this pull request? In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema: ``` root: - a: string - b: long ``` , the following should work: ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue. ### Why are the changes needed? To fix a bug described above. ### Does this PR introduce any user-facing change? Yes, now ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` works as expected. ### How was this patch tested? Added new tests Closes #27584 from imback82/alter_table_pos_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 13:01:45 +08:00
beliefer	d8d3ce5c76	[SPARK-30825][SQL][DOC] Update documents information for window function ### What changes were proposed in this pull request? I checked the all the window function and found all of them not add parameter information and version information to the document. This PR will make a supplement. ### Why are the changes needed? Documentation is missing and does not meet new standards. ### Does this PR introduce any user-facing change? Yes. User will face the information of parameters and version. ### How was this patch tested? Exists UT Closes #27572 from beliefer/add_since_for_window_function. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-18 09:44:34 +09:00
Yuanjian Li	e4a541b278	[SPARK-30829][SQL] Define LegacyBehaviorPolicy enumeration as the common value for result change configs ### What changes were proposed in this pull request? Define a new enumeration `LegacyBehaviorPolicy` in SQLConf, it will be used as the common value for result change configs. ### Why are the changes needed? During API auditing for the 3.0 release, we found several new approaches that will change the results silently. For these features, we need a common three-value config. ### Does this PR introduce any user-facing change? Yes, original config `spark.sql.legacy.ctePrecedence.enabled` change to `spark.sql.legacy.ctePrecedencePolicy`. ### How was this patch tested? Existing UT. Closes #27579 from xuanyuanking/SPARK-30829. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 00:52:05 +08:00
Maxim Gekk	06217cfded	[SPARK-30793][SQL] Fix truncations of timestamps before the epoch to minutes and seconds ### What changes were proposed in this pull request? In the PR, I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `SECOND` and `MINUTE` levels. ### Why are the changes needed? This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `SECOND` and `MINUTE` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp: ```sql spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123'); 2020-02-11 00:01:02 ``` but seconds in the truncated timestamp before the epoch are increased by 1: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:03 ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the example above outputs correct result: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:02 ``` ### How was this patch tested? Added new tests to `DateFunctionsSuite`. Closes #27543 from MaxGekk/fix-second-minute-truc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:51:56 +08:00
Yuanjian Li	ab186e3659	[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:06:58 +08:00
Maxim Gekk	9107f77f15	[SPARK-30843][SQL] Fix getting of time components before 1582 year ### What changes were proposed in this pull request? 1. Rewrite DateTimeUtils methods `getHours()`, `getMinutes()`, `getSeconds()`, `getSecondsWithFraction()`, `getMilliseconds()` and `getMicroseconds()` using Java 8 time APIs. This will automatically switch the `Hour`, `Minute`, `Second` and `DatePart` expressions on Proleptic Gregorian calendar. 2. Remove unused methods and constant of DateTimeUtils - `to2001`, `YearZero `, `toYearZero` and `absoluteMicroSecond()`. 3. Remove unused value `timeZone` from `TimeZoneAwareExpression` since all expressions have been migrated to Java 8 time API, and legacy instance of `TimeZone` is not needed any more. 4. Change signatures of modified DateTimeUtils methods, and pass `ZoneId` instead of `TimeZone`. This will allow to avoid unnecessary conversions `TimeZone` -> `String` -> `ZoneId`. 5. Modify tests in `DateTimeUtilsSuite` and in `DateExpressionsSuite` to pass `ZoneId` instead of `TimeZone`. Correct the tests, to pass tested zone id instead of None. ### Why are the changes needed? The changes fix the issue of wrong results returned by the `hour()`, `minute()`, `second()`, `date_part('millisecond', ...)` and `date_part('microsecond', ....)`, see example in [SPARK-30843](https://issues.apache.org/jira/browse/SPARK-30843). ### Does this PR introduce any user-facing change? Yes. After the changes, the results of examples from SPARK-30843: ```sql spark-sql> select hour(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select minute(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select second(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select date_part('milliseconds', timestamp '0010-01-01 00:00:00'); 0.000 spark-sql> select date_part('microseconds', timestamp '0010-01-01 00:00:00'); 0 ``` ### How was this patch tested? - By existing test suites `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`. - Add new tests to `DateExpressionsSuite` and `DateTimeUtilsSuite` for 10 year, like: ```scala input = date(10, 1, 1, 0, 0, 0, 0, zonePST) assert(getHours(input, zonePST) === 0) ``` - Re-run `DateTimeBenchmark` using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 \| Closes #27596 from MaxGekk/localtimestamp-greg-cal. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-1-30.us-west-2.compute.internal> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 13:59:21 +08:00
Wenchen Fan	ab07c6300c	[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view ### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 12:07:46 +08:00
DB Tsai	d0f9614760	[SPARK-30289][SQL] Partitioned by Nested Column for `InMemoryTable` ### What changes were proposed in this pull request? 1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct. This PR implements partitioned by nested column for `InMemoryTable`. ### Why are the changes needed? This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests and new tests. Closes #26929 from dbtsai/addTests. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-02-14 21:46:01 +00:00
Maxim Gekk	7137a6d065	[SPARK-30766][SQL] Fix the timestamp truncation to the `HOUR` and `DAY` levels ### What changes were proposed in this pull request? In the PR, I propose to use Java 8 time API in timestamp truncations to the levels of `HOUR` and `DAY`. The problem is in the usage of `timeZone.getOffset(millis)` in days/hours truncations where the combined calendar (Julian + Gregorian) is used underneath. ### Why are the changes needed? The change fix wrong truncations. For example, the following truncation to hours should print `0010-01-01 01:00:00` but it outputs wrong timestamp: ```scala Seq("0010-01-01 01:02:03.123456").toDF() .select($"value".cast("timestamp").as("ts")) .select(date_trunc("HOUR", $"ts").cast("string")) .show(false) +------------------------------------+ \|CAST(date_trunc(HOUR, ts) AS STRING)\| +------------------------------------+ \|0010-01-01 01:30:17 \| +------------------------------------+ ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the result of the example above is: ```scala +------------------------------------+ \|CAST(date_trunc(HOUR, ts) AS STRING)\| +------------------------------------+ \|0010-01-01 01:00:00 \| +------------------------------------+ ``` ### How was this patch tested? - Added new test to `DateFunctionsSuite` - By `DateExpressionsSuite` and `DateTimeUtilsSuite` Closes #27512 from MaxGekk/fix-trunc-old-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-14 22:16:57 +08:00
Yuming Wang	fb0e07b08c	[SPARK-29231][SQL] Constraints should be inferred from cast equality constraint ### What changes were proposed in this pull request? This PR add support infer constraints from cast equality constraint. For example: ```scala scala> spark.sql("create table spark_29231_1(c1 bigint, c2 bigint)") res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("create table spark_29231_2(c1 int, c2 bigint)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain == Physical Plan == (2) Project [c1#5L, c2#6L] +- (2) BroadcastHashJoin [c1#5L], [cast(c1#7 as bigint)], Inner, BuildRight :- (2) Project [c1#5L, c2#6L] : +- (2) Filter (isnotnull(c1#5L) AND (c1#5L = 1)) : +- (2) ColumnarToRow : +- FileScan parquet default.spark_29231_1[c1#5L,c2#6L] Batched: true, DataFilters: [isnotnull(c1#5L), (c1#5L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#209] +- (1) Project [c1#7] +- (1) Filter isnotnull(c1#7) +- (1) ColumnarToRow +- FileScan parquet default.spark_29231_2[c1#7] Batched: true, DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int> ``` After this PR: ```scala scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain == Physical Plan == (2) Project [c1#0L, c2#1L] +- (2) BroadcastHashJoin [c1#0L], [cast(c1#2 as bigint)], Inner, BuildRight :- (2) Project [c1#0L, c2#1L] : +- (2) Filter (isnotnull(c1#0L) AND (c1#0L = 1)) : +- (2) ColumnarToRow : +- FileScan parquet default.spark_29231_1[c1#0L,c2#1L] Batched: true, DataFilters: [isnotnull(c1#0L), (c1#0L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#99] +- (1) Project [c1#2] +- (1) Filter ((cast(c1#2 as bigint) = 1) AND isnotnull(c1#2)) +- (1) ColumnarToRow +- FileScan parquet default.spark_29231_2[c1#2] Batched: true, DataFilters: [(cast(c1#2 as bigint) = 1), isnotnull(c1#2)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #27252 from wangyum/SPARK-29231. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 22:48:27 +08:00
Terry Kim	a6b4b914f2	[SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax ### What changes were proposed in this pull request? This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns The user now can do the following: ```SQL CREATE TABLE t (col1 int, col2 int) USING Foo; ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3'); ``` , which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`. ### Why are the changes needed? This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful. ### Does this PR introduce any user-facing change? Yes, adding a new DDL statement. ### How was this patch tested? More tests to be added. Closes #27482 from imback82/replace_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 20:13:36 +08:00
maryannxue	453d5261b2	[SPARK-30528][SQL] Turn off DPP subquery duplication by default ### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes #27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 19:32:38 +08:00
iRakson	926e3a1efe	[SPARK-30790] The dataType of map() should be map<null,null> ### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to #27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes #27542 from iRakson/SPARK-30790. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 12:23:40 +08:00
Maxim Gekk	aa0d13683c	[SPARK-30760][SQL] Port `millisToDays` and `daysToMillis` on Java 8 time API ### What changes were proposed in this pull request? In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API. I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL. ### Why are the changes needed? New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations. ### Does this PR introduce any user-facing change? Yes, this might effect behavior for old dates before 1582 year. ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`. Closes #27494 from MaxGekk/millis-2-days-java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 02:31:48 +08:00
Maxim Gekk	61b1e608f0	[SPARK-30759][SQL][TESTS][FOLLOWUP] Check cache initialization in StringRegexExpression ### What changes were proposed in this pull request? Added new test to `RegexpExpressionsSuite` which checks that `cache` of compiled pattern is set when the `right` expression (pattern in `LIKE`) is a foldable expression. ### Why are the changes needed? To be sure that `cache` in `StringRegexExpression` is initialized for foldable patterns. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the added test in `RegexpExpressionsSuite`. Closes #27547 from MaxGekk/regexp-cache-test. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 23:50:34 +08:00
Maxim Gekk	c1986204e5	[SPARK-30788][SQL] Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see https://github.com/apache/spark/pull/26507 & https://github.com/apache/spark/pull/26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 20:12:38 +08:00
beliefer	f5026b1ba7	[SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract ### What changes were proposed in this pull request? The current implement of `regexp_extract` will throws a unprocessed exception show below: `SELECT regexp_extract('1a 2b 14m', 'd+')` ``` java.lang.IndexOutOfBoundsException: No group 1 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) ``` I think should treat this exception well. ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException No group 1 ` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? New UT Closes #27508 from beliefer/fix-regexp_extract-bug. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 14:49:22 +08:00
Kris Mok	b4769998ef	[SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() ### What changes were proposed in this pull request? This PR proposes to make the `code` string interpolator treat escapes the same way as Scala's builtin `StringContext.s()` string interpolator. This will remove the need for an ugly workaround in `Like` expression's codegen. ### Why are the changes needed? The `code()` string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin `StringContext.s()` interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments. For example, ```scala val arg = "This is an argument." val str = s"This is string part 1. $arg This is string part 2." val code = code"This is string part 1. $arg This is string part 2." assert(code.toString == str) ``` We should expect the `code()` interpolator to produce the same result as the `StringContext.s()` interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim. But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args. That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in `Like` expression's codegen, there's an ugly workaround for this bug: ```scala // We need double escape to avoid org.codehaus.commons.compiler.CompileException. // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'. // '\"' will cause exception 'Line break in literal not allowed'. val newEscapeChar = if (escapeChar == '\"' \|\| escapeChar == '\\') { s"""\\\\\\$escapeChar""" } else { escapeChar } ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a new unit test case in `CodeBlockSuite`. Closes #27544 from rednaxelafx/fix-code-string-interpolator. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 15:19:16 +09:00
Maxim Gekk	45db48e2d2	Revert "[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to revert the commit `8aebc80e0e`. ### Why are the changes needed? See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites. Closes #27531 from MaxGekk/revert-like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-11 10:15:34 -08:00
HyukjinKwon	99bd59fe29	[SPARK-29462][SQL][DOCS] Add some more context and details in 'spark.sql.defaultUrlStreamHandlerFactory.enabled' documentation ### What changes were proposed in this pull request? This PR adds some more information and context to `spark.sql.defaultUrlStreamHandlerFactory.enabled`. ### Why are the changes needed? It is a bit difficult to understand the documentation of `spark.sql.defaultUrlStreamHandlerFactory.enabled`. ### Does this PR introduce any user-facing change? Nope, internal doc only fix. ### How was this patch tested? Nope. I only tested linter. Closes #27541 from HyukjinKwon/SPARK-29462-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-11 09:55:02 -08:00
Maxim Gekk	dc66d57e98	[SPARK-30754][SQL] Reuse results of floorDiv in calculations of floorMod in DateTimeUtils ### What changes were proposed in this pull request? In the case of back-to-back calculation of `floorDiv` and `floorMod` with the same arguments, the result of `foorDiv` can be reused in calculation of `floorMod`. The `floorMod` method is defined as the following in Java standard library: ```java public static int floorMod(int x, int y) { int r = x - floorDiv(x, y) * y; return r; } ``` If `floorDiv(x, y)` has been already calculated, it can be reused in `x - floorDiv(x, y) * y`. I propose to modify 2 places in `DateTimeUtils`: 1. `microsToInstant` which is widely used in many date-time functions. `Math.floorMod(us, MICROS_PER_SECOND)` is just replaced by its definition from Java Math library. 2. `truncDate`: `Math.floorMod(oldYear, divider) == 0` is replaced by `Math.floorDiv(oldYear, divider) * divider == oldYear` where `floorDiv(...) * divider` is pre-calculated. ### Why are the changes needed? This reduces the number of arithmetic operations, and can slightly improve performance of date-time functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27491 from MaxGekk/opt-microsToInstant. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-11 09:07:40 -06:00
HyukjinKwon	0045be766b	[SPARK-29462][SQL] The data type of "array()" should be array<null> ### What changes were proposed in this pull request? This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 17:22:08 +09:00
Shixiong Zhu	e2ebca733c	[SPARK-30779][SS] Fix some API issues found when reviewing Structured Streaming API docs ### What changes were proposed in this pull request? - Fix the scope of `Logging.initializeForcefully` so that it doesn't appear in subclasses' public methods. Right now, `sc.initializeForcefully(false, false)` is allowed to called. - Don't show classes under `org.apache.spark.internal` package in API docs. - Add missing `since` annotation. - Fix the scope of `ArrowUtils` to remove it from the API docs. ### Why are the changes needed? Avoid leaking APIs unintentionally in Spark 3.0.0. ### Does this PR introduce any user-facing change? No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0. ### How was this patch tested? Manually generated the API docs and verified the above issues have been fixed. Closes #27528 from zsxwing/audit-ss-apis. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-10 14:26:14 -08:00
Maxim Gekk	3c1c9b48fc	[SPARK-30759][SQL] Initialize cache for foldable patterns in StringRegexExpression ### What changes were proposed in this pull request? In the PR, I propose to fix `cache` initialization in `StringRegexExpression` by changing `case Literal(value: String, StringType)` to `case p: Expression if p.foldable` ### Why are the changes needed? Actually, the case doesn't work at all because of: 1. Literals value has type `UTF8String` 2. It doesn't work for foldable expressions like in the example: ```sql SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'; ``` <img width="649" alt="Screen Shot 2020-02-08 at 22 45 50" src="https://user-images.githubusercontent.com/1580697/74091681-0d4a2180-4acb-11ea-8a0d-7e8c65f4214e.png"> ### Does this PR introduce any user-facing change? No ### How was this patch tested? By the `check outputs of expression examples` test from `SQLQuerySuite`. Closes #27502 from MaxGekk/str-regexp-foldable-pattern. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-10 12:51:37 -08:00
HyukjinKwon	4439b29bd2	Revert "[SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static" ### What changes were proposed in this pull request? This reverts commit `8ce7962931`. There's variable name conflicts with `8aebc80e0e (diff-39298b470865a4cbc67398a4ea11e767)`. This can be cleanly ported back to branch-3.0. ### Why are the changes needed? Performance investigation were not made enough and it's not clear if it really beneficial or now. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins tests. Closes #27514 from HyukjinKwon/revert-cache-PR. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-10 10:56:43 -08:00
Eric Wu	b2011a295b	[SPARK-30326][SQL] Raise exception if analyzer exceed max iterations ### What changes were proposed in this pull request? Enhance RuleExecutor strategy to take different actions when exceeding max iterations. And raise exception if analyzer exceed max iterations. ### Why are the changes needed? Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add test in AnalysisSuite Closes #26977 from Eric5553/EnhanceMaxIterations. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 23:41:39 +08:00
Terry Kim	70e545a94d	[SPARK-30757][SQL][DOC] Update the doc on TableCatalog.alterTable's behavior ### What changes were proposed in this pull request? This PR updates the documentation on `TableCatalog.alterTable`s behavior on the order by which the requested changes are applied. It now explicitly mentions that the changes are applied in the order given. ### Why are the changes needed? The current documentation on `TableCatalog.alterTable` doesn't mention which order the requested changes are applied. It will be useful to explicitly document this behavior so that the user can expect the behavior. For example, `REPLACE COLUMNS` needs to delete columns before adding new columns, and if the order is guaranteed by `alterTable`, it's much easier to work with the catalog API. ### Does this PR introduce any user-facing change? Yes, document change. ### How was this patch tested? Not added (doc changes). Closes #27496 from imback82/catalog_table_alter_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 19:04:49 +08:00
Liang-Chi Hsieh	9f8172e96a	Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project This reverts commit `a0e63b61e7`. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-09 19:45:16 -08:00
Nicholas Chammas	339c0f9a62	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options ### What changes were proposed in this pull request? This PR adds a doc builder for Spark SQL's configuration options. Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)): ![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png) Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql): ![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png) ### Why are the changes needed? There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql). ### Does this PR introduce any user-facing change? No, apart from new documentation. ### How was this patch tested? I tested this manually by building the docs and reviewing them in my browser. Closes #27459 from nchammas/SPARK-30510-spark-sql-options. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-09 19:20:47 +09:00
Yuanjian Li	3db3e39f11	[SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause ### What changes were proposed in this pull request? This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior. ### Why are the changes needed? The original change might risky to end-users, it changes behavior silently. ### Does this PR introduce any user-facing change? Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional. ### How was this patch tested? New UT. Closes #27454 from xuanyuanking/SPARK-28228-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-08 14:10:28 -08:00
Terry Kim	a7451f44d2	[SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time ### What changes were proposed in this pull request? The current ALTER COLUMN syntax allows to change multiple properties at a time: ``` ALTER TABLE table=multipartIdentifier (ALTER \| CHANGE) COLUMN? column=multipartIdentifier (TYPE dataType)? (COMMENT comment=STRING)? colPosition? ``` The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan) This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time. Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior. ### Why are the changes needed? To follow SQL standard (and other recent SQL systems) behavior. ### Does this PR introduce any user-facing change? Yes, now the user can update the column properties only one at a time. For example, ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment' ``` should be broken into ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment' ``` ### How was this patch tested? Updated existing tests. Closes #27444 from imback82/alter_column_one_at_a_time. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:47:44 +08:00
Maxim Gekk	a3e77773cf	[SPARK-30752][SQL] Fix `to_utc_timestamp` on daylight saving day ### What changes were proposed in this pull request? - Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API - Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence. ### Why are the changes needed? - Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example: ```scala scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show +-------------------+ \| local UTC\| +-------------------+ \|2019-11-03 03:00:00\| +-------------------+ ``` but the result must be 2019-11-03 04:00:00: <img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png"> - Simplifies the code, and make it more maintainable - Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions. ### Does this PR introduce any user-facing change? Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. - Added `convert time zones on a daylight saving day` to DateFunctionsSuite Closes #27474 from MaxGekk/port-convertTz-on-Java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:32:07 +08:00
Wenchen Fan	5a4c70b4e2	[SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26656. We don't support window aggregate function with filter predicate yet and we should fail explicitly. Observable metrics has the same issue. This PR fixes it as well. ### Why are the changes needed? If we simply ignore filter predicate when we don't support it, the result is wrong. ### Does this PR introduce any user-facing change? yea, fix the query result. ### How was this patch tested? new tests Closes #27476 from cloud-fan/filter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-06 13:33:39 -08:00
Wenchen Fan	8ce58627eb	[SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply ### What changes were proposed in this pull request? Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally. This PR also add a config to not skip AQE. ### Why are the changes needed? It's not a warning at all if we intentionally skip AQE. ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs. Closes #27452 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-06 09:16:14 -08:00
Terry Kim	c27a616450	[SPARK-30612][SQL] Resolve qualified column name with v2 tables ### What changes were proposed in this pull request? This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables. This PR would allow qualified column names in query as following: ```SQL SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl ``` ### Why are the changes needed? This is a bug because you cannot qualify column names in queries. ### Does this PR introduce any user-facing change? Yes, now users can qualify column names for v2 tables. ### How was this patch tested? Added new tests. Closes #27391 from imback82/qualified_col. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 13:54:17 +08:00
Yuanjian Li	4938905a1c	[SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message ### What changes were proposed in this pull request? Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message. ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27464 from xuanyuanking/SPARK-29864-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 11:19:42 -08:00
turbofei	6d507b4a31	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer ### What changes were proposed in this pull request? When spark.sql.ansi.enabled is true, for the statement: ``` select cast(cast(2147483648 as Float) as Integer) //result is 2147483647 ``` Its result is 2147483647 and does not throw `ArithmeticException`. The root cause is that, the below code does not work for some corner cases. `94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)` For example: ![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png) In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added Unit test. Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 21:24:02 +08:00
Maxim Gekk	459e757ed4	[SPARK-30668][SQL] Support `SimpleDateFormat` patterns in parsing timestamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit `51a6ba0181`, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 18:48:45 +08:00
HyukjinKwon	692e3ddb4e	[SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size ### What changes were proposed in this pull request? This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given: - `spark.sql.execution.pandas.arrowSafeTypeConversion` - `spark.sql.execution.pandas.respectSessionTimeZone` - `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName` - other configurations like `spark.sql.execution.arrow.*`. ### Why are the changes needed? To make configuration names consistent. ### Does this PR introduce any user-facing change? No because this configuration was not released yet. ### How was this patch tested? Existing tests should cover. Closes #27450 from HyukjinKwon/SPARK-27870-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-05 11:38:33 +09:00
Dongjoon Hyun	898716980d	Revert "[SPARK-28310][SQL] Support (FIRST_VALUE\|LAST_VALUE)(expr[ (IGNORE\|RESPECT) NULLS]?) syntax" ### What changes were proposed in this pull request? This reverts commit `b89c3de1a4`. ### Why are the changes needed? `FIRST_VALUE` is used only for window expression. Please see the discussion on https://github.com/apache/spark/pull/25082 . ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Pass the Jenkins. Closes #27458 from dongjoon-hyun/SPARK-28310. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:26:46 -08:00
Liang-Chi Hsieh	7631275f97	[SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser ### What changes were proposed in this pull request? This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string. ### Why are the changes needed? Adding a legacy config to make migration easy for Spark users. ### Does this PR introduce any user-facing change? Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0. ### How was this patch tested? Unit test. Closes #27456 from viirya/SPARK-25040-followup. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:22:23 -08:00
Maxim Gekk	f2dd082544	[SPARK-30725][SQL] Make legacy SQL configs as internal configs ### What changes were proposed in this pull request? All legacy SQL configs are marked as internal configs. In particular, the following configs are updated as internals: - spark.sql.legacy.sizeOfNull - spark.sql.legacy.replaceDatabricksSparkAvro.enabled - spark.sql.legacy.typeCoercion.datetimeToString.enabled - spark.sql.legacy.looseUpcast - spark.sql.legacy.arrayExistsFollowsThreeValuedLogic ### Why are the changes needed? In general case, users shouldn't change legacy configs, so, they can be marked as internals. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Should be tested by jenkins build and run tests. Closes #27448 from MaxGekk/legacy-internal-sql-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 21:17:05 +08:00
Yuanjian Li	a4912cee61	[SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf ### What changes were proposed in this pull request? Put the configs below needed by Structured Streaming UI into StaticSQLConf: - spark.sql.streaming.ui.enabled - spark.sql.streaming.ui.retainedProgressUpdates - spark.sql.streaming.ui.retainedQueries ### Why are the changes needed? Make all SS UI configs consistent with other similar configs in usage and naming. ### Does this PR introduce any user-facing change? Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`. ### How was this patch tested? Existing UT. Closes #27425 from xuanyuanking/SPARK-29543-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-02-02 23:37:13 -08:00
Burak Yavuz	2eccfd8a73	[SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView ### What changes were proposed in this pull request? Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog. ### Why are the changes needed? This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to DataSourceV2SessionCatalogSuite Closes #27423 from brkyvz/isViewF. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-03 14:08:59 +08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
yi.wu	82b4f753a0	[SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource ### What changes were proposed in this pull request? This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like ``` val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions") ``` Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any). ### Why are the changes needed? This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested). Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal. ### Does this PR introduce any user-facing change? Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`. ### How was this patch tested? Added a new test suite. Closes #27199 from Ngone51/dev-executeCommand. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 15:05:26 -08:00
Maxim Gekk	2d4b5eaee4	[SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double ### What changes were proposed in this pull request? - Replace `new Integer(0)` by a serializable instance in RDD.scala - Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html ### Why are the changes needed? This fixes the following warnings: 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite Closes #27399 from MaxGekk/eliminate-warning-part4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 15:03:16 -06:00
yi.wu	5ccbb38a71	[SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand All credit to Ngone51, Closes #27293. ### What changes were proposed in this pull request? This PR improves `AlterTableAddPartitionCommand` by: 1. adds an internal config for partitions batch size to avoid hard code 2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse ### Why are the changes needed? Improve code quality. ### Does this PR introduce any user-facing change? Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change. ### How was this patch tested? Pass Jenkins. Closes #27413 from xuanyuanking/SPARK-29938. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-01 01:03:00 +08:00
Burak Yavuz	290a528bff	[SPARK-30615][SQL] Introduce Analyzer rule for V2 AlterTable column change resolution ### What changes were proposed in this pull request? Adds an Analyzer rule to normalize the column names used in V2 AlterTable table changes. We need to handle all ColumnChange operations. We add an extra match statement for future proofing new changes that may be added. This prevents downstream consumers (e.g. catalogs) to deal about case sensitivity or check that columns exist, etc. We also fix the behavior for ALTER TABLE CHANGE COLUMN (Hive style syntax) for adding comments to complex data types. Currently, the data type needs to be provided as part of the Hive style syntax. This assumes that the data type as changed when it may have not and the user only wants to add a comment, which fails in CheckAnalysis. ### Why are the changes needed? Currently we do not handle case sensitivity correctly for ALTER TABLE ALTER COLUMN operations. ### Does this PR introduce any user-facing change? No, fixes a bug. ### How was this patch tested? Introduced v2CommandsCaseSensitivitySuite and added a test around HiveStyle Change columns to PlanResolutionSuite Closes #27350 from brkyvz/normalizeAlter. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 16:41:10 +08:00
Dongjoon Hyun	05be81d69e	[SPARK-30192][SQL][FOLLOWUP] Rename SINGLETON to INSTANCE ### What changes were proposed in this pull request? This PR renames a variable `SINGLETON` to `INSTANCE`. ### Why are the changes needed? This is a minor change for consistency with the other parts. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the existing tests. Closes #27409 from dongjoon-hyun/SPARK-30192. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 22:51:51 -08:00
Burak Yavuz	1cd19ad92d	[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming ### What changes were proposed in this pull request? We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example. We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream. ### Why are the changes needed? Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface. ### Does this PR introduce any user-facing change? DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once(). ### How was this patch tested? Existing tests, as this API is mostly internal Closes #27380 from brkyvz/rateLimit. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-30 22:02:48 -08:00
Wenchen Fan	9f42be25eb	[SPARK-29665][SQL] refine the TableProvider interface ### What changes were proposed in this pull request? Instead of having several overloads of `getTable` method in `TableProvider`, it's better to have 2 methods explicitly: `inferSchema` and `inferPartitioning`. With a single `getTable` method that takes everything: schema, partitioning and properties. This PR also adds a `supportsExternalMetadata` method in `TableProvider`, to indicate if the source support external table metadata. If this flag is false: 1. spark.read.schema... is disallowed and fails 2. when we support creating v2 tables in session catalog, spark only keeps table properties in the catalog. ### Why are the changes needed? API improvement. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26868 from cloud-fan/provider2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 13:37:43 +08:00
Wenchen Fan	e5f572af06	[SPARK-30680][SQL] ResolvedNamespace does not require a namespace catalog ### What changes were proposed in this pull request? Update `ResolvedNamespace` to accept catalog as `CatalogPlugin` not `SupportsNamespaces`. This is extracted from https://github.com/apache/spark/pull/27345 ### Why are the changes needed? not all commands that need to resolve namespaces require a namespace catalog. For example, `SHOW TABLE` is implemented by `TableCatalog.listTables`, and is nothing to do with `SupportsNamespace`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27403 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 10:34:59 -08:00
Wenchen Fan	7503e76af0	[SPARK-30622][SQL] commands should return dummy statistics ### What changes were proposed in this pull request? override `Command.stats` to return a dummy statistics (Long.Max). ### Why are the changes needed? Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame is created. That said, the statistics of a command is useless. We should avoid unnecessary statistics calculation of command's children. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test Closes #27344 from cloud-fan/command. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 10:27:35 -08:00
Kazuaki Ishizaki	b0db6231fd	[SPARK-29020][FOLLOWUP][SQL] Update description of array_sort function ### What changes were proposed in this pull request? This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated. This PR updates the description to follow the latest feature. ### Why are the changes needed? ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests since this PR just updates description text. Closes #27404 from kiszk/SPARK-29020-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 09:41:32 -08:00
uncleGen	7173786153	[SPARK-29543][SS][UI] Structured Streaming Web UI ### What changes were proposed in this pull request? This PR adds two pages to Web UI for Structured Streaming: - "/streamingquery": Streaming Query Page, providing some aggregate information for running/completed streaming queries. - "/streamingquery/statistics": Streaming Query Statistics Page, providing detailed information for streaming query, including `Input Rate`, `Process Rate`, `Input Rows`, `Batch Duration` and `Operation Duration` ![Screen Shot 2020-01-29 at 1 38 00 PM](https://user-images.githubusercontent.com/1000778/73399837-cd01cc80-429c-11ea-9d4b-1d200a41b8d5.png) ![Screen Shot 2020-01-29 at 1 39 16 PM](https://user-images.githubusercontent.com/1000778/73399838-cd01cc80-429c-11ea-8185-4e56db6866bd.png) ### Why are the changes needed? It helps users to better monitor Structured Streaming query. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - new added and existing UTs - manual test Closes #26201 from uncleGen/SPARK-29543. Lead-authored-by: uncleGen <hustyugm@gmail.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Genmao Yu <hustyugm@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-29 13:43:51 -08:00
Takeshi Yamamuro	ec1fb6b4e1	[SPARK-30234][SQL][FOLLOWUP] Add `.enabled` in the suffix of the ADD FILE legacy option ### What changes were proposed in this pull request? This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`. ### Why are the changes needed? For consistent option names. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27372 from maropu/SPARK-30234-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-29 12:23:59 +09:00
Dongjoon Hyun	580c2b7e34	[SPARK-27166][SQL][FOLLOWUP] Refactor to build string once ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/24098 to refactor to build string once according to the [review comment](https://github.com/apache/spark/pull/24098#discussion_r369845234) ### Why are the changes needed? Previously, we chose the minimal change way. In this PR, we choose a more robust way than the previous post-step string processing. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The test case is extended with more cases. Closes #27353 from dongjoon-hyun/SPARK-27166-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-28 12:48:16 -08:00
Maxim Gekk	8aebc80e0e	[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by `187f3c1773`. ### Why are the changes needed? The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions. ### Does this PR introduce any user-facing change? Yes, before `like` fails with the exception: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7 ``` After: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); true ``` ### How was this patch tested? - Add new example for the `like` function which is checked by `SQLQuerySuite` - Run `RegexpExpressionsSuite` and `ExpressionParserSuite`. Closes #27355 from MaxGekk/like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 11:19:32 -08:00
Patrick Cording	c5c580ba0d	[SPARK-30633][SQL] Append L to seed when type is LongType ### What changes were proposed in this pull request? Allow for using longs as seed for xxHash. ### Why are the changes needed? Codegen fails when passing a seed to xxHash that is > 2^31. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests pass. Should more be added? Closes #27354 from patrickcording/fix_xxhash_seed_bug. Authored-by: Patrick Cording <patrick.cording@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 10:32:15 -08:00
Jungtaek Lim (HeartSaVioR)	0436b3d3f8	[SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files ### What changes were proposed in this pull request? This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files. ### Why are the changes needed? The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` grep -IUrl --color "^M" . \| grep "\.java\\|\.scala\\|\.xml\\|\.py\\|\.R" \| grep -v "/target/" \| grep -v "/build/" \| grep -v "/dist/" \| grep -v "dependency-reduced-pom.xml" \| grep -v ".pyc" ``` (Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.) Before the patch, the result is: ``` ./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala ``` and after the patch, the result is None. And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below: ``` warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala. The file will have its original line endings in your working directory. ``` Closes #27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 10:20:51 -08:00
Yuchen Huo	d0800fc8e2	[SPARK-30314] Add identifier and catalog information to DataSourceV2Relation ### What changes were proposed in this pull request? Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step. ### Why are the changes needed? In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests in the following suites: CatalogManagerSuite.scala CatalogV2UtilSuite.scala SupportsCatalogOptionsSuite.scala PlanResolutionSuite.scala Closes #26957 from yuchenhuo/SPARK-30314. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-26 12:59:24 -08:00
Xiao Li	d69ed9afdf	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `1d20d13149`. Closes #27351 from gatorsmile/revertSPARK25496. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 21:34:12 -08:00
Liang-Chi Hsieh	a0e63b61e7	[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project ### What changes were proposed in this pull request? This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it. ### Why are the changes needed? In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #26978 from viirya/SPARK-29721. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-24 22:17:28 -08:00

1 2 3 4 5 ...

4402 commits