ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Maxim Gekk	c1986204e5	[SPARK-30788][SQL] Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see https://github.com/apache/spark/pull/26507 & https://github.com/apache/spark/pull/26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 20:12:38 +08:00
beliefer	f5026b1ba7	[SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract ### What changes were proposed in this pull request? The current implement of `regexp_extract` will throws a unprocessed exception show below: `SELECT regexp_extract('1a 2b 14m', 'd+')` ``` java.lang.IndexOutOfBoundsException: No group 1 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) ``` I think should treat this exception well. ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException No group 1 ` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? New UT Closes #27508 from beliefer/fix-regexp_extract-bug. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 14:49:22 +08:00
Kris Mok	b4769998ef	[SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() ### What changes were proposed in this pull request? This PR proposes to make the `code` string interpolator treat escapes the same way as Scala's builtin `StringContext.s()` string interpolator. This will remove the need for an ugly workaround in `Like` expression's codegen. ### Why are the changes needed? The `code()` string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin `StringContext.s()` interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments. For example, ```scala val arg = "This is an argument." val str = s"This is string part 1. $arg This is string part 2." val code = code"This is string part 1. $arg This is string part 2." assert(code.toString == str) ``` We should expect the `code()` interpolator to produce the same result as the `StringContext.s()` interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim. But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args. That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in `Like` expression's codegen, there's an ugly workaround for this bug: ```scala // We need double escape to avoid org.codehaus.commons.compiler.CompileException. // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'. // '\"' will cause exception 'Line break in literal not allowed'. val newEscapeChar = if (escapeChar == '\"' \|\| escapeChar == '\\') { s"""\\\\\\$escapeChar""" } else { escapeChar } ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a new unit test case in `CodeBlockSuite`. Closes #27544 from rednaxelafx/fix-code-string-interpolator. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 15:19:16 +09:00
herman	b25359cca3	[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions ### What changes were proposed in this pull request? This is a small follow-up for https://github.com/apache/spark/pull/27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 10:48:29 +09:00
Maxim Gekk	45db48e2d2	Revert "[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to revert the commit `8aebc80e0e`. ### Why are the changes needed? See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites. Closes #27531 from MaxGekk/revert-like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-11 10:15:34 -08:00
HyukjinKwon	99bd59fe29	[SPARK-29462][SQL][DOCS] Add some more context and details in 'spark.sql.defaultUrlStreamHandlerFactory.enabled' documentation ### What changes were proposed in this pull request? This PR adds some more information and context to `spark.sql.defaultUrlStreamHandlerFactory.enabled`. ### Why are the changes needed? It is a bit difficult to understand the documentation of `spark.sql.defaultUrlStreamHandlerFactory.enabled`. ### Does this PR introduce any user-facing change? Nope, internal doc only fix. ### How was this patch tested? Nope. I only tested linter. Closes #27541 from HyukjinKwon/SPARK-29462-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-11 09:55:02 -08:00
Maxim Gekk	dc66d57e98	[SPARK-30754][SQL] Reuse results of floorDiv in calculations of floorMod in DateTimeUtils ### What changes were proposed in this pull request? In the case of back-to-back calculation of `floorDiv` and `floorMod` with the same arguments, the result of `foorDiv` can be reused in calculation of `floorMod`. The `floorMod` method is defined as the following in Java standard library: ```java public static int floorMod(int x, int y) { int r = x - floorDiv(x, y) * y; return r; } ``` If `floorDiv(x, y)` has been already calculated, it can be reused in `x - floorDiv(x, y) * y`. I propose to modify 2 places in `DateTimeUtils`: 1. `microsToInstant` which is widely used in many date-time functions. `Math.floorMod(us, MICROS_PER_SECOND)` is just replaced by its definition from Java Math library. 2. `truncDate`: `Math.floorMod(oldYear, divider) == 0` is replaced by `Math.floorDiv(oldYear, divider) * divider == oldYear` where `floorDiv(...) * divider` is pre-calculated. ### Why are the changes needed? This reduces the number of arithmetic operations, and can slightly improve performance of date-time functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27491 from MaxGekk/opt-microsToInstant. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-11 09:07:40 -06:00
fuwhu	f1d0dce484	[MINOR][DOC] Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions ### What changes were proposed in this pull request? Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions. ### Why are the changes needed? To describe these two classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #27535 from fuwhu/SPARK-15616-FOLLOW-UP. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-11 22:16:44 +08:00
HyukjinKwon	0045be766b	[SPARK-29462][SQL] The data type of "array()" should be array<null> ### What changes were proposed in this pull request? This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 17:22:08 +09:00
Shixiong Zhu	e2ebca733c	[SPARK-30779][SS] Fix some API issues found when reviewing Structured Streaming API docs ### What changes were proposed in this pull request? - Fix the scope of `Logging.initializeForcefully` so that it doesn't appear in subclasses' public methods. Right now, `sc.initializeForcefully(false, false)` is allowed to called. - Don't show classes under `org.apache.spark.internal` package in API docs. - Add missing `since` annotation. - Fix the scope of `ArrowUtils` to remove it from the API docs. ### Why are the changes needed? Avoid leaking APIs unintentionally in Spark 3.0.0. ### Does this PR introduce any user-facing change? No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0. ### How was this patch tested? Manually generated the API docs and verified the above issues have been fixed. Closes #27528 from zsxwing/audit-ss-apis. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-10 14:26:14 -08:00
Yuanjian Li	a6b91d2bf7	[SPARK-30556][SQL][FOLLOWUP] Reset the status changed in SQLExecution.withThreadLocalCaptured ### What changes were proposed in this pull request? Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured. ### Why are the changes needed? For code safety. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #27516 from xuanyuanking/SPARK-30556-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: herman <herman@databricks.com>	2020-02-10 22:16:25 +01:00
Maxim Gekk	3c1c9b48fc	[SPARK-30759][SQL] Initialize cache for foldable patterns in StringRegexExpression ### What changes were proposed in this pull request? In the PR, I propose to fix `cache` initialization in `StringRegexExpression` by changing `case Literal(value: String, StringType)` to `case p: Expression if p.foldable` ### Why are the changes needed? Actually, the case doesn't work at all because of: 1. Literals value has type `UTF8String` 2. It doesn't work for foldable expressions like in the example: ```sql SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'; ``` <img width="649" alt="Screen Shot 2020-02-08 at 22 45 50" src="https://user-images.githubusercontent.com/1580697/74091681-0d4a2180-4acb-11ea-8a0d-7e8c65f4214e.png"> ### Does this PR introduce any user-facing change? No ### How was this patch tested? By the `check outputs of expression examples` test from `SQLQuerySuite`. Closes #27502 from MaxGekk/str-regexp-foldable-pattern. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-10 12:51:37 -08:00
HyukjinKwon	4439b29bd2	Revert "[SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static" ### What changes were proposed in this pull request? This reverts commit `8ce7962931`. There's variable name conflicts with `8aebc80e0e (diff-39298b470865a4cbc67398a4ea11e767)`. This can be cleanly ported back to branch-3.0. ### Why are the changes needed? Performance investigation were not made enough and it's not clear if it really beneficial or now. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins tests. Closes #27514 from HyukjinKwon/revert-cache-PR. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-10 10:56:43 -08:00
Liang-Chi Hsieh	acfdb46a60	[SPARK-27946][SQL][FOLLOW-UP] Change doc and error message for SHOW CREATE TABLE ### What changes were proposed in this pull request? This is a follow-up for #24938 to tweak error message and migration doc. ### Why are the changes needed? Making user know workaround if SHOW CREATE TABLE doesn't work for some Hive tables. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27505 from viirya/SPARK-27946-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-02-10 10:45:00 -08:00
Eric Wu	b2011a295b	[SPARK-30326][SQL] Raise exception if analyzer exceed max iterations ### What changes were proposed in this pull request? Enhance RuleExecutor strategy to take different actions when exceeding max iterations. And raise exception if analyzer exceed max iterations. ### Why are the changes needed? Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add test in AnalysisSuite Closes #26977 from Eric5553/EnhanceMaxIterations. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 23:41:39 +08:00
jiake	5a240603fd	[SPARK-30719][SQL] Add unit test to verify the log warning print when intentionally skip AQE ### What changes were proposed in this pull request? This is a follow up in [#27452](https://github.com/apache/spark/pull/27452). Add a unit test to verify whether the log warning is print when intentionally skip AQE. ### Why are the changes needed? Add unit test ### Does this PR introduce any user-facing change? No ### How was this patch tested? adding unit test Closes #27515 from JkSelf/aqeLoggingWarningTest. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 21:48:00 +08:00
Terry Kim	70e545a94d	[SPARK-30757][SQL][DOC] Update the doc on TableCatalog.alterTable's behavior ### What changes were proposed in this pull request? This PR updates the documentation on `TableCatalog.alterTable`s behavior on the order by which the requested changes are applied. It now explicitly mentions that the changes are applied in the order given. ### Why are the changes needed? The current documentation on `TableCatalog.alterTable` doesn't mention which order the requested changes are applied. It will be useful to explicitly document this behavior so that the user can expect the behavior. For example, `REPLACE COLUMNS` needs to delete columns before adding new columns, and if the order is guaranteed by `alterTable`, it's much easier to work with the catalog API. ### Does this PR introduce any user-facing change? Yes, document change. ### How was this patch tested? Not added (doc changes). Closes #27496 from imback82/catalog_table_alter_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 19:04:49 +08:00
Kent Yao	58b9ca1e6f	[SPARK-30592][SQL][FOLLOWUP] Add some round-trip test cases ### What changes were proposed in this pull request? Add round-trip tests for CSV and JSON functions as https://github.com/apache/spark/pull/27317#discussion_r376745135 asked. ### Why are the changes needed? improve test coverage ### Does this PR introduce any user-facing change? no ### How was this patch tested? add uts Closes #27510 from yaooqinn/SPARK-30592-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-10 16:23:44 +09:00
Liang-Chi Hsieh	9f8172e96a	Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project This reverts commit `a0e63b61e7`. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-09 19:45:16 -08:00
Gengliang Wang	b877aac146	[SPARK-30684 ][WEBUI][FollowUp] A new approach for SPARK-30684 ### What changes were proposed in this pull request? Simplify the changes for adding metrics description for WholeStageCodegen in https://github.com/apache/spark/pull/27405 ### Why are the changes needed? In https://github.com/apache/spark/pull/27405, the UI changes can be made without using the function `adjustPositionOfOperationName` to adjust the position of operation name and mark as an operation-name class. I suggest we make simpler changes so that it would be easier for future development. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual test with the queries provided in https://github.com/apache/spark/pull/27405 ``` sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output") sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output") ``` ![image](https://user-images.githubusercontent.com/1097932/74073629-e3f09f00-49bf-11ea-90dc-1edb5ca29e5e.png) Closes #27490 from gengliangwang/wholeCodegenUI. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-09 14:18:51 -08:00
Nicholas Chammas	339c0f9a62	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options ### What changes were proposed in this pull request? This PR adds a doc builder for Spark SQL's configuration options. Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)): ![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png) Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql): ![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png) ### Why are the changes needed? There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql). ### Does this PR introduce any user-facing change? No, apart from new documentation. ### How was this patch tested? I tested this manually by building the docs and reviewing them in my browser. Closes #27459 from nchammas/SPARK-30510-spark-sql-options. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-09 19:20:47 +09:00
Yuanjian Li	3db3e39f11	[SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause ### What changes were proposed in this pull request? This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior. ### Why are the changes needed? The original change might risky to end-users, it changes behavior silently. ### Does this PR introduce any user-facing change? Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional. ### How was this patch tested? New UT. Closes #27454 from xuanyuanking/SPARK-28228-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-08 14:10:28 -08:00
Terry Kim	a7451f44d2	[SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time ### What changes were proposed in this pull request? The current ALTER COLUMN syntax allows to change multiple properties at a time: ``` ALTER TABLE table=multipartIdentifier (ALTER \| CHANGE) COLUMN? column=multipartIdentifier (TYPE dataType)? (COMMENT comment=STRING)? colPosition? ``` The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan) This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time. Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior. ### Why are the changes needed? To follow SQL standard (and other recent SQL systems) behavior. ### Does this PR introduce any user-facing change? Yes, now the user can update the column properties only one at a time. For example, ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment' ``` should be broken into ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment' ``` ### How was this patch tested? Updated existing tests. Closes #27444 from imback82/alter_column_one_at_a_time. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:47:44 +08:00
Maxim Gekk	a3e77773cf	[SPARK-30752][SQL] Fix `to_utc_timestamp` on daylight saving day ### What changes were proposed in this pull request? - Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API - Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence. ### Why are the changes needed? - Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example: ```scala scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show +-------------------+ \| local UTC\| +-------------------+ \|2019-11-03 03:00:00\| +-------------------+ ``` but the result must be 2019-11-03 04:00:00: <img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png"> - Simplifies the code, and make it more maintainable - Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions. ### Does this PR introduce any user-facing change? Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. - Added `convert time zones on a daylight saving day` to DateFunctionsSuite Closes #27474 from MaxGekk/port-convertTz-on-Java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:32:07 +08:00
Wenchen Fan	5a4c70b4e2	[SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26656. We don't support window aggregate function with filter predicate yet and we should fail explicitly. Observable metrics has the same issue. This PR fixes it as well. ### Why are the changes needed? If we simply ignore filter predicate when we don't support it, the result is wrong. ### Does this PR introduce any user-facing change? yea, fix the query result. ### How was this patch tested? new tests Closes #27476 from cloud-fan/filter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-06 13:33:39 -08:00
Wenchen Fan	8ce58627eb	[SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply ### What changes were proposed in this pull request? Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally. This PR also add a config to not skip AQE. ### Why are the changes needed? It's not a warning at all if we intentionally skip AQE. ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs. Closes #27452 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-06 09:16:14 -08:00
yi.wu	368ee62a5d	[SPARK-27297][DOC][FOLLOW-UP] Improve documentation for various Scala functions ### What changes were proposed in this pull request? Add examples and parameter description for these Scala functions: * transform * exists * forall * aggregate * zip_with * transform_keys * transform_values * map_filter * map_zip_with ### Why are the changes needed? Better documentation for UX. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27449 from Ngone51/doc-funcs. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:34:29 +08:00
yi.wu	3f5b23340e	[SPARK-30744][SQL] Optimize AnalyzePartitionCommand by calculating location sizes in parallel ### What changes were proposed in this pull request? Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27471 from Ngone51/dev_calculate_in_parallel. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:20:44 +08:00
beliefer	c8ef1dee90	[SPARK-29108][SQL][TESTS][FOLLOWUP] Comment out no use test case and add 'insert into' statement of window.sql (Part 2) ### What changes were proposed in this pull request? When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty. I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`. Because only one case reference the table `empsalary` and it throws `AnalysisException`. ``` -- !query select last(salary) over(order by salary range between 1000 preceding and 1000 following), lag(salary) over(order by salary range between 1000 preceding and 1000 following), salary from empsalary -- !query schema struct<> -- !query output org.apache.spark.sql.AnalysisException Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1); ``` So we should do four work: 1. comment out the only one case and create a new ticket. 2. Add `INSERT INTO empsalary`. Note: window_part4.sql not use the table `empsalary`. ### Why are the changes needed? Supplementary test data. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New test case Closes #27439 from beliefer/add-insert-to-window. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-06 15:24:26 +09:00
Terry Kim	c27a616450	[SPARK-30612][SQL] Resolve qualified column name with v2 tables ### What changes were proposed in this pull request? This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables. This PR would allow qualified column names in query as following: ```SQL SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl ``` ### Why are the changes needed? This is a bug because you cannot qualify column names in queries. ### Does this PR introduce any user-facing change? Yes, now users can qualify column names for v2 tables. ### How was this patch tested? Added new tests. Closes #27391 from imback82/qualified_col. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 13:54:17 +08:00
Wenchen Fan	3b26f807a0	[SPARK-30721][SQL][TESTS] Fix DataFrameAggregateSuite when enabling AQE ### What changes were proposed in this pull request? update `DataFrameAggregateSuite` to make it pass with AQE ### Why are the changes needed? We don't need to turn off AQE in `DataFrameAggregateSuite` ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `DataFrameAggregateSuite` locally with AQE on. Closes #27451 from cloud-fan/aqe-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 12:36:51 -08:00
Yuanjian Li	4938905a1c	[SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message ### What changes were proposed in this pull request? Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message. ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27464 from xuanyuanking/SPARK-29864-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 11:19:42 -08:00
turbofei	6d507b4a31	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer ### What changes were proposed in this pull request? When spark.sql.ansi.enabled is true, for the statement: ``` select cast(cast(2147483648 as Float) as Integer) //result is 2147483647 ``` Its result is 2147483647 and does not throw `ArithmeticException`. The root cause is that, the below code does not work for some corner cases. `94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)` For example: ![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png) In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added Unit test. Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 21:24:02 +08:00
Maxim Gekk	459e757ed4	[SPARK-30668][SQL] Support `SimpleDateFormat` patterns in parsing timestamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit `51a6ba0181`, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 18:48:45 +08:00
HyukjinKwon	692e3ddb4e	[SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size ### What changes were proposed in this pull request? This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given: - `spark.sql.execution.pandas.arrowSafeTypeConversion` - `spark.sql.execution.pandas.respectSessionTimeZone` - `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName` - other configurations like `spark.sql.execution.arrow.*`. ### Why are the changes needed? To make configuration names consistent. ### Does this PR introduce any user-facing change? No because this configuration was not released yet. ### How was this patch tested? Existing tests should cover. Closes #27450 from HyukjinKwon/SPARK-27870-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-05 11:38:33 +09:00
Dongjoon Hyun	898716980d	Revert "[SPARK-28310][SQL] Support (FIRST_VALUE\|LAST_VALUE)(expr[ (IGNORE\|RESPECT) NULLS]?) syntax" ### What changes were proposed in this pull request? This reverts commit `b89c3de1a4`. ### Why are the changes needed? `FIRST_VALUE` is used only for window expression. Please see the discussion on https://github.com/apache/spark/pull/25082 . ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Pass the Jenkins. Closes #27458 from dongjoon-hyun/SPARK-28310. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:26:46 -08:00
Liang-Chi Hsieh	7631275f97	[SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser ### What changes were proposed in this pull request? This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string. ### Why are the changes needed? Adding a legacy config to make migration easy for Spark users. ### Does this PR introduce any user-facing change? Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0. ### How was this patch tested? Unit test. Closes #27456 from viirya/SPARK-25040-followup. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:22:23 -08:00
Maxim Gekk	f2dd082544	[SPARK-30725][SQL] Make legacy SQL configs as internal configs ### What changes were proposed in this pull request? All legacy SQL configs are marked as internal configs. In particular, the following configs are updated as internals: - spark.sql.legacy.sizeOfNull - spark.sql.legacy.replaceDatabricksSparkAvro.enabled - spark.sql.legacy.typeCoercion.datetimeToString.enabled - spark.sql.legacy.looseUpcast - spark.sql.legacy.arrayExistsFollowsThreeValuedLogic ### Why are the changes needed? In general case, users shouldn't change legacy configs, so, they can be marked as internals. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Should be tested by jenkins build and run tests. Closes #27448 from MaxGekk/legacy-internal-sql-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 21:17:05 +08:00
maryannxue	6097b343ba	[SPARK-30717][SQL] AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression` ### What changes were proposed in this pull request? This PR is to fix a potential bug in AQE where an `ExecSubqueryExpression` could be mistakenly replaced with another `ExecSubqueryExpression` with the same `ListQuery` but a different `child` expression. This is because a ListQuery's id can only identify the ListQuery itself, not the parent expression `InSubquery`, but right now the `subqueryMap` in `InsertAdaptiveSparkPlan` uses the `ListQuery`'s id as key and the corresponding `InSubqueryExec` for the `ListQuery`'s parent expression as value. So the fix uses the corresponding `SubqueryExec` for the `ListQuery` itself as the map's value. ### Why are the changes needed? This logical bug could potentially cause a wrong query plan, which could throw an exception related to unresolved columns. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #27446 from maryannxue/spark-30717. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 12:31:44 +08:00
fuwhu	47659a0675	[SPARK-30525][SQL] HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning ### What changes were proposed in this pull request? HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called. ### Why are the changes needed? In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit tests. Closes #27232 from fuwhu/SPARK-30525. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 01:24:53 +08:00
Yuanjian Li	a4912cee61	[SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf ### What changes were proposed in this pull request? Put the configs below needed by Structured Streaming UI into StaticSQLConf: - spark.sql.streaming.ui.enabled - spark.sql.streaming.ui.retainedProgressUpdates - spark.sql.streaming.ui.retainedQueries ### Why are the changes needed? Make all SS UI configs consistent with other similar configs in usage and naming. ### Does this PR introduce any user-facing change? Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`. ### How was this patch tested? Existing UT. Closes #27425 from xuanyuanking/SPARK-29543-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-02-02 23:37:13 -08:00
Burak Yavuz	2eccfd8a73	[SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView ### What changes were proposed in this pull request? Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog. ### Why are the changes needed? This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to DataSourceV2SessionCatalogSuite Closes #27423 from brkyvz/isViewF. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-03 14:08:59 +08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
yi.wu	82b4f753a0	[SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource ### What changes were proposed in this pull request? This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like ``` val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions") ``` Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any). ### Why are the changes needed? This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested). Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal. ### Does this PR introduce any user-facing change? Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`. ### How was this patch tested? Added a new test suite. Closes #27199 from Ngone51/dev-executeCommand. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 15:05:26 -08:00
Maxim Gekk	2d4b5eaee4	[SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double ### What changes were proposed in this pull request? - Replace `new Integer(0)` by a serializable instance in RDD.scala - Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html ### Why are the changes needed? This fixes the following warnings: 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite Closes #27399 from MaxGekk/eliminate-warning-part4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 15:03:16 -06:00
Kousuke Saruta	18bc4e55ef	[SPARK-30684][WEBUI] Show the descripton of metrics for WholeStageCodegen in DAG viz ### What changes were proposed in this pull request? Added description for metrics shown in the WholeStageCodegen-node in DAG viz. This is before the change is applied. ![before-changed](https://user-images.githubusercontent.com/4736016/73469870-5cf16480-43ca-11ea-9a13-714083508a3b.png) And following is after change. ![after-fixing-layout](https://user-images.githubusercontent.com/4736016/73469364-983f6380-43c9-11ea-8b7e-ddab030d0270.png) For this change, I also modify the layout of DAG viz. Actually, I noticed it's not enough to just added the description. Following is without changing the layout. ![layout-is-broken](https://user-images.githubusercontent.com/4736016/73470178-cffadb00-43ca-11ea-86d7-aed109b105e6.png) ### Why are the changes needed? Users can't understand what those metrics mean. ### Does this PR introduce any user-facing change? Yes. The layout is a little bit changed. ### How was this patch tested? I confirm the result of DAG viz with following 3 operations. `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show` `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output")` `sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output")` Closes #27405 from sarutak/sql-dag-metrics. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 11:58:52 -08:00
Wenchen Fan	33546d637d	Revert "[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by" This reverts commit `a2de20c0e6`.	2020-02-01 03:02:52 +08:00
Jungtaek Lim (HeartSaVioR)	5e0faf9a3d	[SPARK-29779][SPARK-30479][CORE][SQL][FOLLOWUP] Reflect review comments on post-hoc review ### What changes were proposed in this pull request? This PR reflects review comments on post-hoc review among PRs for SPARK-29779 (#27085), SPARK-30479 (#27164). The list of review comments this PR addresses are below: * https://github.com/apache/spark/pull/27085#discussion_r373304218 * https://github.com/apache/spark/pull/27164#discussion_r373300793 * https://github.com/apache/spark/pull/27164#discussion_r373301193 * https://github.com/apache/spark/pull/27164#discussion_r373301351 I also applied review comments to the CORE module (BasicEventFilterBuilder.scala) as well, as the review comments for SQL/core module (SQLEventFilterBuilder.scala) can be applied there as well. ### Why are the changes needed? There're post-hoc reviews on PRs for such issues, like links in above section. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27414 from HeartSaVioR/SPARK-28869-SPARK-29779-SPARK-30479-FOLLOWUP-posthoc-reviews. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 10:17:07 -08:00
Tathagata Das	481e5211d2	[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits This PR solves two bugs related to streaming limits Bug 1 (SPARK-30658): Limit before a streaming aggregate (i.e. `df.limit(5).groupBy().count()`) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results. Solution: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it. Bug 2 (SPARK-30657): `LocalLimitExec` does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. `df.dropDuplicates().limit(5)`), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed). Solution: Change the planner rule to always use a new `StreamingLocalLimitExec` which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace `StreamingLocalLimitExec` with `LocalLimitExec` when there is no stateful operator before the limit that could be affected by it. No Updated incorrect unit tests and added new ones Closes #27373 from tdas/SPARK-30657. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-31 09:27:34 -08:00
yi.wu	5ccbb38a71	[SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand All credit to Ngone51, Closes #27293. ### What changes were proposed in this pull request? This PR improves `AlterTableAddPartitionCommand` by: 1. adds an internal config for partitions batch size to avoid hard code 2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse ### Why are the changes needed? Improve code quality. ### Does this PR introduce any user-facing change? Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change. ### How was this patch tested? Pass Jenkins. Closes #27413 from xuanyuanking/SPARK-29938. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-01 01:03:00 +08:00

1 2 3 4 5 ...

9041 commits