ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuanjian Li	d9b3069412	[SPARK-30125][SQL] Remove PostgreSQL dialect ### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-11 01:22:34 +08:00
Anton Okolnychyi	a9f1809a2a	[SPARK-30206][SQL] Rename normalizeFilters in DataSourceStrategy to be generic ### What changes were proposed in this pull request? This PR renames `normalizeFilters` in `DataSourceStrategy` to be more generic as the logic is not specific to filters. ### Why are the changes needed? These changes are needed to support PR #26751. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26830 from aokolnychyi/rename-normalize-exprs. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 07:49:22 -08:00
yi.wu	aa9da9365f	[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by https://github.com/apache/spark/pull/25248#issuecomment-559594305, user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-10 20:56:21 +08:00
Luan	3d98c9f985	[SPARK-30179][SQL][TESTS] Improve test in SingleSessionSuite ### What changes were proposed in this pull request? improve the temporary functions test in SingleSessionSuite by verifying the result in a query ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #26812 from leoluan2009/SPARK-30179. Authored-by: Luan <xuluan@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-10 10:57:32 +09:00
Sean Owen	36fa1980c2	[SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray ### What changes were proposed in this pull request? Use Seq instead of Array in sc.parallelize, with reference types. Remove usage of WrappedArray. ### Why are the changes needed? These both enable building on Scala 2.13. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #26787 from srowen/SPARK-30158. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 14:41:48 -06:00
Jungtaek Lim (HeartSaVioR)	538b8d101c	[SPARK-30159][SQL][FOLLOWUP] Fix lint-java via removing unnecessary imports ### What changes were proposed in this pull request? This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate. ### Why are the changes needed? Java linter starts complaining. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? lint-java passed locally This closes #26819 Closes #26818 from HeartSaVioR/SPARK-30159-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-09 08:57:20 -08:00
Gengliang Wang	a717d219a6	[SPARK-30159][SQL][TESTS] Fix the method calls of `QueryTest.checkAnswer` ### What changes were proposed in this pull request? Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen. The actual exceptions are thrown in the trait `QueryTest`. However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers. After this PR: 1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results. 2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message. 3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`. ### Why are the changes needed? We should fix these method calls to perform actual validations in test suites. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #26788 from gengliangwang/fixCheckAnswer. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-09 22:19:08 +09:00
fuwhu	c2f29d5ea5	[SPARK-30138][SQL] Separate configuration key of max iterations for analyzer and optimizer ### What changes were proposed in this pull request? separate the configuration keys "spark.sql.optimizer.maxIterations" and "spark.sql.analyzer.maxIterations". ### Why are the changes needed? Currently, both Analyzer and Optimizer use conf "spark.sql.optimizer.maxIterations" to set the max iterations to run, which is a little confusing. It is clearer to add a new conf "spark.sql.analyzer.maxIterations" for analyzer max iterations. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit tests. Closes #26766 from fuwhu/SPARK-30138. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-09 19:43:32 +09:00
Aman Omer	dcea7a4c9a	[SPARK-29883][SQL] Implement a helper method for aliasing bool_and() and bool_or() ### What changes were proposed in this pull request? This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`. ### Why are the changes needed? Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually. For query, `select every('true');` Output before this PR, > Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7; After this PR, > Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7; Closes #26712 from amanomer/29883. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-09 13:23:16 +08:00
Pablo Langa	bca9de6684	[SPARK-29922][SQL] SHOW FUNCTIONS should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands. We don’t have this methods in the catalog to implement an V2 command * catalog.listFunctions ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing `SHOW FUNCTIONS LIKE namespace.function` ### Does this PR introduce any user-facing change? Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog. ### How was this patch tested? Unit tests. Closes #26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-12-08 20:15:09 -08:00
Kent Yao	e88d74052b	[SPARK-30147][SQL] Trim the string when cast string type to booleans ### What changes were proposed in this pull request? Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean. This behavior makes type cast and coercion inconsistency in Spark. Not fitting ANSI SQL standard either. ``` If TD is boolean, then Case: a) If SD is character string, then SV is replaced by TRIM ( BOTH ' ' FROM VE ) Case: i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid value of the data type TD, then let TV be that value. ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast. b) If SD is boolean, then TV is SV ``` In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces. ### Why are the changes needed? Type cast/coercion consistency ### Does this PR introduce any user-facing change? yes, string with whitespaces in both ends will be trimmed before converted to booleans. e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null` ### How was this patch tested? add unit tests Closes #26776 from yaooqinn/SPARK-30147. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-07 15:03:51 +09:00
Aman Omer	51aa7a920e	[SPARK-30148][SQL] Optimize writing plans if there is an analysis exception ### What changes were proposed in this pull request? Optimized QueryExecution.scala#writePlans(). ### Why are the changes needed? If any query fails in Analysis phase and gets AnalysisException, there is no need to execute further phases since those will return a same result i.e, AnalysisException. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26778 from amanomer/optExplain. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-07 10:58:02 +09:00
Sean Owen	a30ec19a73	[SPARK-30155][SQL] Rename parse() to parseString() to avoid conflict in Scala 2.13 ### What changes were proposed in this pull request? Rename internal method LegacyTypeStringParser.parse() to parseString(). ### Why are the changes needed? In Scala 2.13, the parse() definition clashes with supertype declarations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #26784 from srowen/SPARK-30155. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 16:16:28 -08:00
wuyi	58be82ad4b	[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0. And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`. ### Why are the changes needed? 1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent. 2. User may gets confused in some cases. For example: ``` CREATE TABLE t1 (c1 INT) USING PARQUET; CREATE TABLE t2 (c1 INT); ``` In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case. On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`: ``` CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE; CREATE TABLE t4 AS SELECT 1 AS VALUE; ``` And these two cases together can be really confusing. 3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior. ### Does this PR introduce any user-facing change? Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider. ### How was this patch tested? Added tests in `DDLParserSuite` and `HiveDDlSuite`. Closes #26736 from Ngone51/dev-create-table-using-parquet-by-default. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-07 02:15:25 +08:00
Liang-Chi Hsieh	c1a5f94973	[SPARK-30112][SQL] Allow insert overwrite same table if using dynamic partition overwrite ### What changes were proposed in this pull request? This patch proposes to allow insert overwrite same table if using dynamic partition overwrite. ### Why are the changes needed? Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite. This enables users to read data from a table and insert overwrite to same table by using dynamic partition overwrite. Because this is not allowed for now, users need to write to other temporary location and move it back to the table. ### Does this PR introduce any user-facing change? Yes. Users can insert overwrite same table if using dynamic partition overwrite. ### How was this patch tested? Unit test. Closes #26752 from viirya/dynamic-overwrite-same-table. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 09:22:16 -08:00
Sean Owen	c8ed71b3cd	[SPARK-30011][SQL] Inline 2.12 "AsIfIntegral" classes, not present in 2.13 ### What changes were proposed in this pull request? Classes like DoubleAsIfIntegral are not found in Scala 2.13, but used in the current build. This change 'inlines' the 2.12 implementation and makes it work with both 2.12 and 2.13. ### Why are the changes needed? To cross-compile with 2.13. ### Does this PR introduce any user-facing change? It should not as it copies in 2.12's existing behavior. ### How was this patch tested? Existing tests. Closes #26769 from srowen/SPARK-30011. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-06 08:15:38 -08:00
gengjiaan	187f3c1773	[SPARK-28083][SQL] Support LIKE ... ESCAPE syntax ## What changes were proposed in this pull request? The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL. For example: ``` select 'abcSpark_13sd' LIKE '%Spark\\_%'; //true select 'abcSpark_13sd' LIKE '%Spark/_%'; //false select 'abcSpark_13sd' LIKE '%Spark"_%'; //false select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; //true select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; //true select 'abcSpark%13sd' LIKE '%Spark\\%%'; //true select 'abcSpark%13sd' LIKE '%Spark/%%'; //false select 'abcSpark%13sd' LIKE '%Spark"%%'; //false select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; //true select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; //true select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; //true select 'abcSpark/13sd' LIKE '%Spark//_%'; //false select 'abcSpark"13sd' LIKE '%Spark""_%'; //false select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; //true select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; //true ``` But Spark SQL only supports 'LIKE predicate'. Note: If the input string or pattern string is null, then the result is null too. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/functions-matching.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape MySQL: https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708 https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A ## How was this patch tested? Exists UT and new UT. This PR merged to my production environment and runs above sql: ``` spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%'; true Time taken: 0.119 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%'; false Time taken: 0.103 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%'; false Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; true Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; true Time taken: 0.092 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%'; true Time taken: 0.109 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%'; false Time taken: 0.1 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%'; false Time taken: 0.081 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; true Time taken: 0.095 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; true Time taken: 0.113 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; true Time taken: 0.078 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%'; false Time taken: 0.067 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%'; false Time taken: 0.084 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; true Time taken: 0.091 seconds, Fetched 1 row(s) spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; true Time taken: 0.091 seconds, Fetched 1 row(s) ``` I create a table and its schema is: ``` spark-sql> desc formatted gja_test; key string NULL value string NULL other string NULL # Detailed Table Information Database test Table gja_test Owner test Created Time Wed Apr 10 11:06:15 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.1-SNAPSHOT Type MANAGED Provider hive Table Properties [transient_lastDdlTime=1563443838] Statistics 26 bytes Location hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties [field.delim= , serialization.format= ] Partition Provider Catalog Time taken: 0.642 seconds, Fetched 21 row(s) ``` Table `gja_test` exists three rows of data. ``` spark-sql> select * from gja_test; a A ao b B bo "__ """__ " Time taken: 0.665 seconds, Fetched 3 row(s) ``` At finally, I test this function: ``` spark-sql> select * from gja_test where key like value escape '"'; "__ """__ " Time taken: 0.687 seconds, Fetched 1 row(s) ``` Closes #25001 from beliefer/ansi-sql-like. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-06 00:07:38 -08:00
Terry Kim	b86d4bb931	[SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables ### What changes were proposed in this pull request? This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation. ### Why are the changes needed? Currently there are two issues: 1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice). 2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails: ``` scala> sql("create table t using csv as select 1 as i") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [spark_catalog, t] ``` ### Does this PR introduce any user-facing change? Yes. Now the catalog name is resolved correctly: ``` scala> sql("create table t using csv as select 1 as i") res0: org.apache.spark.sql.DataFrame = [] scala> sql("select from t").show +---+ \| i\| +---+ \| 1\| +---+ scala> sql("select * from spark_catalog.t").show +---+ \| i\| +---+ \| 1\| +---+ ``` ### How was this patch tested? Added new tests. Closes #26684 from imback82/resolve_relation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-06 15:45:13 +08:00
madianjun	a5ccbced8c	[SPARK-30067][CORE] Fix fragment offset comparison in getBlockHosts ### What changes were proposed in this pull request? A bug fixed about the code in getBlockHosts() function. In the case "The fragment ends at a position within this block", the end of fragment should be before the end of block，where the "end of block" means `b.getOffset + b.getLength`，not `b.getLength`. ### Why are the changes needed? When comparing the fragment end and the block end，we should use fragment's `offset + length`，and then compare to the block's `b.getOffset + b.getLength`, not the block's length. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? No test. Closes #26650 from mdianjun/fix-getBlockHosts. Authored-by: madianjun <madianjun@jd.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 23:39:49 -08:00
Jungtaek Lim (HeartSaVioR)	25431d79f7	[SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink ### What changes were proposed in this pull request? This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup. To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files. ### Why are the changes needed? Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT. Closes #26590 from HeartSaVioR/SPARK-29953. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-12-05 21:46:28 -08:00
Sean Owen	7782b61a31	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748 I told you it'd be iterative! Closes #26765 from srowen/SPARK-29392.3. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-05 13:48:29 -08:00
Kent Yao	b9cae37750	[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres # What changes were proposed in this pull request? Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy: ```scala /** * For [[Add]]: * 1. if both side are interval, stays the same; * 2. else if one side is interval, turns it to [[TimeAdd]]; * 3. else if one side is date, turns it to [[DateAdd]] ; * 4. else stays the same. * * For [[Subtract]]: * 1. if both side are interval, stays the same; * 2. else if the right side is an interval, turns it to [[TimeSub]]; * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]]; * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]]; * 5. else if the left side is date, turns it to [[DateSub]]; * 6. else turns it to stays the same. * * For [[Multiply]]: * 1. If one side is interval, turns it to [[MultiplyInterval]]; * 2. otherwise, stays the same. * * For [[Divide]]: * 1. If the left side is interval, turns it to [[DivideInterval]]; * 2. otherwise, stays the same. */ ``` Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule. ### Why are the changes needed? Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark. ### Does this PR introduce any user-facing change? 1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results. ### How was this patch tested? add ut Closes #26412 from yaooqinn/SPARK-29774. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 22:03:44 +08:00
Kent Yao	332e252a14	[SPARK-29425][SQL] The ownership of a database should be respected ### What changes were proposed in this pull request? Keep the owner of a database when executing alter database commands ### Why are the changes needed? Spark will inadvertently delete the owner of a database for executing databases ddls ### Does this PR introduce any user-facing change? NO ### How was this patch tested? add and modify uts Closes #26080 from yaooqinn/SPARK-29425. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 16:14:27 +08:00
turbofei	0ab922c1eb	[SPARK-29860][SQL] Fix dataType mismatch issue for InSubquery ### What changes were proposed in this pull request? There is an issue for InSubquery expression. For example, there are two tables `ta` and `tb` created by the below statements. ``` sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") ``` This statement below would thrown dataType mismatch exception. ``` sql("select * from ta where id in (select id from tb)").show() ``` However, this similar statement could execute successfully. ``` sql("select * from ta where id in ((select id from tb))").show() ``` The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression. Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt. In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved. ### Why are the changes needed? Some InSubquery would throw dataType mismatch exception. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #26485 from turboFei/SPARK-29860-in-subquery. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 16:00:16 +08:00
Aman Omer	0bd8b995d6	[SPARK-30093][SQL] Improve error message for creating view ### What changes were proposed in this pull request? Improved error message while creating views. ### Why are the changes needed? Error message should suggest user to use TEMPORARY keyword while creating permanent view referred by temporary view. https://github.com/apache/spark/pull/26317#discussion_r352377363 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test case. Closes #26731 from amanomer/imp_err_msg. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 15:28:07 +08:00
Sean Owen	ebd83a544e	[SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly ### What changes were proposed in this pull request? Follow up on https://github.com/apache/spark/pull/26654#discussion_r353826162 Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`. ### Why are the changes needed? Simplification of the previous change, which existed to support Scala 2.13 migration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #26761 from srowen/SPARK-30009.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-05 11:27:25 +08:00
Sean Owen	2ceed6f32c	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings ### What changes were proposed in this pull request? Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent. ### Why are the changes needed? Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13. The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings. While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`. ### Does this PR introduce any user-facing change? Should not change behavior. ### How was this patch tested? Existing tests. Closes #26748 from srowen/SPARK-29392.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-04 15:03:26 -08:00
07ARB	a2102c81ee	[SPARK-29453][WEBUI] Improve tooltips information for SQL tab ### What changes were proposed in this pull request? Adding tooltip to SQL tab for better usability. ### Why are the changes needed? There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain. ### Does this PR introduce any user-facing change? yes. ![Screenshot 2019-11-23 at 9 47 41 AM](https://user-images.githubusercontent.com/8948111/69472963-aaec5980-0dd6-11ea-881a-fe6266171054.png) ### How was this patch tested? Manual test. Closes #26641 from 07ARB/SPARK-29453. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-12-04 12:33:43 -06:00
Aman Omer	55132ae9c9	[SPARK-30099][SQL] Improve Analyzed Logical Plan ### What changes were proposed in this pull request? Avoid duplicate error message in Analyzed Logical plan. ### Why are the changes needed? Currently, when any query throws `AnalysisException`, same error message will be repeated because of following code segment. `04a5b8f5f8/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (L157-L166)` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Result of `explain extended select * from wrong;` BEFORE > == Parsed Logical Plan == > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > AFTER > == Parsed Logical Plan == > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > Closes #26734 from amanomer/cor_APlan. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-04 13:51:40 +08:00
xiaodeshan	196ea936c3	[SPARK-30106][SQL][TEST] Fix the test of DynamicPartitionPruningSuite ### What changes were proposed in this pull request? Changed the test DPP triggers only for certain types of query in DynamicPartitionPruningSuite. ### Why are the changes needed? The sql has no partition key. The description "no predicate on the dimension table" is not right. So fix it. ``` Given("no predicate on the dimension table") withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") { val df = sql( """ \|SELECT * FROM fact_sk f \|JOIN dim_store s \|ON f.date_id = s.store_id """.stripMargin) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated UT Closes #26744 from deshanxiao/30106. Authored-by: xiaodeshan <xiaodeshan@xiaomi.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-03 14:27:48 -08:00
Sean Owen	4193d2f4cc	[SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13 ### What changes were proposed in this pull request? Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications. Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both. ### Why are the changes needed? To support building for Scala 2.13 in the future. ### Does this PR introduce any user-facing change? There should be no behavior change. ### How was this patch tested? Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates. Closes #26728 from srowen/SPARK-30012. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-03 08:59:43 -08:00
John Ayad	8c2849a695	[SPARK-30082][SQL] Do not replace Zeros when replacing NaNs ### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| NaN\| 0\| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 2\| \| 0.0\| 3\| \| 2.0\| 2\| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| NaN\| 0\| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ \|index\|value\| +-----+-----+ \| 1.0\| 0\| \| 0.0\| 3\| \| 2.0\| 0\| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes #26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John Ayad <jayad@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-04 00:04:55 +08:00
Kent Yao	65552a81d1	[SPARK-30083][SQL] visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking ### What changes were proposed in this pull request? `UnaryPositive` only accepts numeric and interval as we defined, but what we do for this in `AstBuider.visitArithmeticUnary` is just bypassing it. This should not be omitted for the type checking requirement. ### Why are the changes needed? bug fix, you can find a pre-discussion here https://github.com/apache/spark/pull/26578#discussion_r347350398 ### Does this PR introduce any user-facing change? yes, +non-numeric-or-interval is now invalid. ``` -- !query 14 select +date '1900-01-01' -- !query 14 schema struct<DATE '1900-01-01':date> -- !query 14 output 1900-01-01 -- !query 15 select +timestamp '1900-01-01' -- !query 15 schema struct<TIMESTAMP '1900-01-01 00:00:00':timestamp> -- !query 15 output 1900-01-01 00:00:00 -- !query 16 select +map(1, 2) -- !query 16 schema struct<map(1, 2):map<int,int>> -- !query 16 output {1:2} -- !query 17 select +array(1,2) -- !query 17 schema struct<array(1, 2):array<int>> -- !query 17 output [1,2] -- !query 18 select -'1' -- !query 18 schema struct<(- CAST(1 AS DOUBLE)):double> -- !query 18 output -1.0 -- !query 19 select -X'1' -- !query 19 schema struct<> -- !query 19 output org.apache.spark.sql.AnalysisException cannot resolve '(- X'01')' due to data type mismatch: argument 1 requires (numeric or interval) type, however, 'X'01'' is of binary type.; line 1 pos 7 -- !query 20 select +X'1' -- !query 20 schema struct<X'01':binary> -- !query 20 output ``` ### How was this patch tested? add ut check Closes #26716 from yaooqinn/SPARK-30083. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 23:42:21 +08:00
Kent Yao	39291cff95	[SPARK-30048][SQL] Enable aggregates with interval type values for RelationalGroupedDataset ### What changes were proposed in this pull request? Now the min/max/sum/avg are support for intervals, we should also enable it in RelationalGroupedDataset ### Why are the changes needed? API consistency improvement ### Does this PR introduce any user-facing change? yes, Dataset support min/max/sum/avg(mean) on intervals ### How was this patch tested? add ut Closes #26681 from yaooqinn/SPARK-30048. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 18:40:14 +08:00
herman	d7b268ab32	[SPARK-29348][SQL] Add observable Metrics for Streaming queries ### What changes were proposed in this pull request? Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point. A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach: - Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map. - (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming. ### Why are the changes needed? This enabled observable metrics. ### Does this PR introduce any user-facing change? Yes. It adds the `observe` method to `Dataset`. ### How was this patch tested? - Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`. - Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`. - Added integration tests for streaming to the `StreamingQueryListenerSuite`. - Added integration tests for batch to the `DataFrameCallbackSuite`. Closes #26127 from hvanhovell/SPARK-29348. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-12-03 11:25:49 +01:00
wuyi	075ae1eeaf	[SPARK-29537][SQL] throw exception when user defined a wrong base path ### What changes were proposed in this pull request? When user defined a base path which is not an ancestor directory for all the input paths, throw exception immediately. ### Why are the changes needed? Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1. When using `spark.read.parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c2 only. But if we use `spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c1 and c2. This's happens because a wrong base path does not actually work in `parsePartition()`, so paring would continue until it reaches a directory without "=". And I think the result of the second read way doesn't make sense. ### Does this PR introduce any user-facing change? Yes, with this change, user would hit `IllegalArgumentException ` when given a wrong base path while previous behavior doesn't. ### How was this patch tested? Added UT. Closes #26195 from Ngone51/dev-wrong-basePath. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-03 17:02:50 +08:00
sychen	332e593093	[SPARK-29943][SQL] Improve error messages for unsupported data type ### What changes were proposed in this pull request? Improve error messages for unsupported data type. ### Why are the changes needed? When the spark reads the hive table and encounters an unsupported field type, the exception message has only one unsupported type, and the user cannot know which field of which table. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ```create view t AS SELECT STRUCT('a' AS `$a`, 1 AS b) as q;``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q ```select * from t,t_normal_1,t_normal_2``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q, db: default, table: t Closes #26577 from cxzl25/unsupport_data_type_msg. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-03 10:07:09 +09:00
Ali Afroozeh	68034a8056	[SPARK-30072][SQL] Create dedicated planner for subqueries ### What changes were proposed in this pull request? This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent. As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes: ``` promote_precision(1.7) * promote_precision(avg(a)) ``` After the optimization, more specifically the constant folding rule, this expression becomes: ``` 1.7 * promote_precision(avg(a)) ``` Now if we run the analyzer on this optimized query again, we will get: ``` promote_precision(1.7) * promote_precision(promote_precision(avg(a))) ``` Which will later optimized as: ``` 1.7 * promote_precision(promote_precision(avg(a))) ``` As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan. We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced. ### How was this patch tested? Unit tests Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-12-02 20:56:40 +01:00
Jungtaek Lim (HeartSaVioR)	54edaee586	[MINOR][SS] Add implementation note on overriding serialize/deserialize in HDFSMetadataLog methods' scaladoc ### What changes were proposed in this pull request? The patch adds scaladoc on `HDFSMetadataLog.serialize` and `HDFSMetadataLog.deserialize` for adding implementation note when overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside try-finally and caller will do the resource (input stream, output stream) cleanup, so resource cleanup should not be performed in these methods, but there's no note on this (only code comment, not scaladoc) which is easy to be missed. ### Why are the changes needed? Contributors who are unfamiliar with the intention seem to think it as a bug if the resource is not cleaned up in serialize/deserialize of subclass of HDFSMetadataLog, and they couldn't know about the intention without reading the code of HDFSMetadataLog. Adding the note as scaladoc would expand the visibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Just a doc change. Closes #26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: dz <953396112@qq.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-12-02 09:01:45 -06:00
Wenchen Fan	e271664a01	[MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled ### What changes were proposed in this pull request? add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`. ### Why are the changes needed? to follow the existing naming style ### Does this PR introduce any user-facing change? no ### How was this patch tested? not needed Closes #26694 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 21:05:06 +08:00
Kent Yao	4e073f3c50	[SPARK-30047][SQL] Support interval types in UnsafeRow ### What changes were proposed in this pull request? Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance. ### Why are the changes needed? improve aggerates ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut and existing ones Closes #26680 from yaooqinn/SPARK-30047. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 20:47:23 +08:00
LantaoJin	04a5b8f5f8	[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 16:11:58 +08:00
Yuanjian Li	169415ffac	[SPARK-30025][CORE] Continuous shuffle block fetching should be disabled by default when the old fetch protocol is used ### What changes were proposed in this pull request? Disable continuous shuffle block fetching when the old fetch protocol in use. ### Why are the changes needed? The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`. ### Does this PR introduce any user-facing change? Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used. ### How was this patch tested? Existing UT. Closes #26663 from xuanyuanking/SPARK-30025. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 15:59:12 +08:00
Liang-Chi Hsieh	85cb388ae3	[SPARK-30050][SQL] analyze table and rename table should not erase hive table bucketing info ### What changes were proposed in this pull request? This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata. Rename table also has this issue. ### Why are the changes needed? Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info. ### Does this PR introduce any user-facing change? Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info. ### How was this patch tested? Unit test. Closes #26685 from viirya/fix-hive-bucket. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 13:40:11 +08:00
HyukjinKwon	51e69feb49	[SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map ### What changes were proposed in this pull request? This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694 ### Why are the changes needed? To avoid potential issues like SPARK-16694 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover. Closes #26729 from HyukjinKwon/SPARK-29851. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 13:40:00 +09:00
Yuanjian Li	d1465a1b0d	[SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled ### What changes were proposed in this pull request? 1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config. 2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`. ### Why are the changes needed? Make the relation between these confs clearer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26664 from xuanyuanking/SPARK-9853-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 12:37:06 +08:00
Terry Kim	5a1896adcb	[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root \|-- col1: string (nullable = true) \|-- col2: string (nullable = true) \|-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ \|col1\|col2\|col2\| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #26700 from imback82/na_drop. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 12:25:28 +08:00
wuyi	87ebfaf003	[SPARK-29956][SQL] A literal number with an exponent should be parsed to Double ### What changes were proposed in this pull request? For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior. ### Why are the changes needed? According to ANSI standard of SQL, we see that the (part of) definition of `literal` : ``` <approximate numeric literal> ::= <mantissa> E <exponent> ``` which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal). And when we test Presto, we found that Presto also conforms to this standard: ``` presto:default> select typeof(1E2); _col0 -------- double (1 row) ``` ``` presto:default> select typeof(1.2); _col0 -------------- decimal(2,1) (1 row) ``` We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to The difference between the two confuses most users as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment). Although, we also see that PostgreSQL has its own implementation: ``` postgres=# select pg_typeof(1E2); pg_typeof ----------- numeric (1 row) postgres=# select pg_typeof(1.2); pg_typeof ----------- numeric (1 row) ``` We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience. ### Does this PR introduce any user-facing change? Yes. For `1E2`, before this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)] ``` After this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [100.0: double] ``` And for `1E-45`, before this PR: ``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 == SQL == select 1E-45 at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided ``` after this PR: ``` scala> spark.sql("select 1E-45"); res1: org.apache.spark.sql.DataFrame = [1.0E-45: double] ``` And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well. ### How was this patch tested? updated `literals.sql.out` and `ansi/literals.sql.out` Closes #26595 from Ngone51/SPARK-29956. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 11:34:56 +08:00
Yuming Wang	708ab57f37	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 09:02:39 +09:00
shahid	b182ed83f6	[SPARK-29724][SPARK-29726][WEBUI][SQL] Support JDBC/ODBC tab for HistoryServer WebUI ### What changes were proposed in this pull request? Support JDBC/ODBC tab for HistoryServer WebUI. Currently from Historyserver we can't access the JDBC/ODBC tab for thrift server applications. In this PR, I am doing 2 main changes 1. Refactor existing thrift server listener to support kvstore 2. Add history server plugin for thrift server listener and tab. ### Why are the changes needed? Users can access Thriftserver tab from History server for both running and finished applications, ### Does this PR introduce any user-facing change? Support for JDBC/ODBC tab for the WEBUI from History server ### How was this patch tested? Add UT and Manual tests 1. Start Thriftserver and Historyserver ``` sbin/stop-thriftserver.sh sbin/stop-historyserver.sh sbin/start-thriftserver.sh sbin/start-historyserver.sh ``` 2. Launch beeline `bin/beeline -u jdbc:hive2://localhost:10000` 3. Run queries Go to the JDBC/ODBC page of the WebUI from History server ![image](https://user-images.githubusercontent.com/23054875/68365501-cf013700-0156-11ea-84b4-fda8008c92c4.png) Closes #26378 from shahidki31/ThriftKVStore. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-11-29 19:44:31 -08:00

1 2 3 4 5 ...

8767 commits