ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	54996be4d2	[SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot. According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth. In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories. ### Why are the changes needed? Performance test coverage, since these operations are missing in the DateTimeBenchmark. ### Does this PR introduce any user-facing change? No, just test ### How was this patch tested? regenerated benchmark results Closes #28369 from yaooqinn/SPARK-31527-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 15:39:28 +00:00
Max Gekk	b7cabc80e6	[SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit `5631a96367`. Closes #28328 ### Why are the changes needed? The PR https://github.com/apache/spark/pull/25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return 'true' because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ \|isInCollection\| +--------------+ \| false\| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:10:50 +00:00
Kent Yao	beec8d535f	[SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r) ### What changes were proposed in this pull request? The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent. Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239 Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR. ### Why are the changes needed? remove redundant and useless code for easy maintenance ### Does this PR introduce any user-facing change? Yes, the SQL string of `datetime - interval` become `datetime + (- interval)` ### How was this patch tested? modified existing unit tests. Closes #28381 from yaooqinn/SPARK-31586. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:01:07 +00:00
Yuanjian Li	6ed2dfbba1	[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ \| b\| fake\| +---+----------+ \| 2\|2020-01-01\| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 08:11:41 +00:00
jiake	079b3623c8	[SPARK-31524][SQL] Add metric to the split task number for skew optimization ### What changes were proposed in this pull request? This is a followup of [#28022](https://github.com/apache/spark/pull/28022), to add the metric info of split task number for skewed optimization. With this PR, we can see the number of splits for the skewed partitions as following: ![image](https://user-images.githubusercontent.com/11972570/80294583-ff886c00-879c-11ea-813c-2db302f99f04.png) ### Why are the changes needed? make UI more friendly ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing ut Closes #28109 from JkSelf/addSplitNumer. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 07:21:00 +00:00
Dongjoon Hyun	79eaaaf6da	[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.5.10. ### Why are the changes needed? Apache ORC 1.5.10 is a maintenance release with the following patches. - [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569 - [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte - [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs - [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file The following is release note. - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing ORC tests and a newly added test case. - The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114) - The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC. - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906 Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 18:56:30 -07:00
Wenchen Fan	2f4f38b6f1	[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets. This PR also fixes various problems: 1. the name check should consider case sensitivity 2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail. ### Why are the changes needed? correct the behavior ### Does this PR introduce any user-facing change? yes, fix the fore-mentioned behaviors. ### How was this patch tested? new tests Closes #28371 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 16:47:39 -07:00
Yuanjian Li	ba7adc4949	[SPARK-27340][SS] Alias on TimeWindow expression cause watermark metadata lost Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work. ### What changes were proposed in this pull request? Make metadata propagatable between Aliases. ### Why are the changes needed? In Structured Streaming, we added an Alias for TimeWindow by default. `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L3272-L3273)` For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/Column.scala (L1049-L1054)` and finally cause the AnalysisException: ``` Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition ``` ### Does this PR introduce any user-facing change? Bugfix for an alias on time window with watermark. ### How was this patch tested? New UTs added. One for the functionality and one for explaining the common scenario. Closes #28326 from xuanyuanking/SPARK-27340. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 15:07:52 -07:00
Kousuke Saruta	7d4d05c684	[SPARK-31565][WEBUI][FOLLOWUP] Add font color setting of DAG-viz for query plan ### What changes were proposed in this pull request? This PR adds a font color setting of DAG-viz for query plan. ### Why are the changes needed? #28352 aimed to unify the font color of all DAG-viz in WebUI but there is one part left over. Before this change applied, the appearance of a query plan is like as follows. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80325600-ca4d4e00-8870-11ea-945c-64971dbb752c.png"> The color of `WholeStageCodegen (1)` and its following text (`duration: total...`) is slightly darker than `SerializeFromObject`. After this change, those color is unified as `#333333`. <img width="450" alt="plan-graph-fixed2" src="https://user-images.githubusercontent.com/4736016/80325651-fb2d8300-8870-11ea-8ed8-178c124d224c.png"> ### Does this PR introduce any user-facing change? Slightly yes. ### How was this patch tested? I confirmed the style of `fill` and `color` is `#333333` by debug console of Chrome. <img width="321" alt="fill" src="https://user-images.githubusercontent.com/4736016/80325760-6c6d3600-8871-11ea-82e7-e789bf741f2a.png"> <img width="316" alt="color" src="https://user-images.githubusercontent.com/4736016/80325765-70995380-8871-11ea-8976-7020205d585c.png"> Closes #28355 from sarutak/followup-SPARK-31565. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 13:34:43 -07:00
Kent Yao	5ba467ca1d	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc ### What changes were proposed in this pull request? ```scala spark.sql.session.timeZone spark.sql.warehouse.dir ``` these 2 configs are nondeterministic and vary with environments Besides, reflect code in `gen-sql-config-docs.py` via https://github.com/apache/spark/pull/28274#discussion_r412893096 and `configuration.md` via https://github.com/apache/spark/pull/28274#discussion_r412894905 ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? verify locally ![image](https://user-images.githubusercontent.com/8326978/80179099-5e7da200-8632-11ea-803f-d47a93151869.png) Closes #28322 from yaooqinn/SPARK-31550. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 17:08:52 +09:00
yi.wu	7df658414b	[SPARK-31529][SQL] Remove extra whitespaces in formatted explain ### What changes were proposed in this pull request? Remove all the extra whitespaces in the formatted explain. ### Why are the changes needed? The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can. ### Does this PR introduce any user-facing change? No, formatted explain is newly added in Spark 3.0. ### How was this patch tested? Updated sql query tests. Closes #28315 from Ngone51/fix_extra_spaces. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 07:39:24 +00:00
Kent Yao	ebc8fa50d0	[SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode ### What changes were proposed in this pull request? To follow ANSI，the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0. ### Why are the changes needed? Better ANSI compliance ### Does this PR introduce any user-facing change? No, this PR should target 3.0.0 in which this feature is newly added. ### How was this patch tested? add more unit tests Closes #28310 from yaooqinn/SPARK-31527. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 05:28:46 +00:00
Bruce Robbins	a911287244	[SPARK-31557][SQL] Legacy time parser should return Gregorian days rather than Julian days ### What changes were proposed in this pull request? This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days. ### Why are the changes needed? The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual testing and modified existing unit tests. Closes #28345 from bersprockets/SPARK-31557. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 05:00:36 +00:00
Juliusz Sompolski	560bd5401f	[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly ### What changes were proposed in this pull request? `CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr. On the example of `Single command with --database` test: In ``` runCliWithin(2.minute)( "CREATE DATABASE hive_db_test;" -> "", "USE hive_test;" -> "", "CREATE TABLE hive_test(key INT, val STRING);" -> "", "SHOW TABLES;" -> "hive_test" ) ``` It was looking for lines containing "", "", "" and then "hive_test". However, the string "hive_test" was contained in "hive_test_db", and hence: ``` 2020-04-08 17:53:12,752 INFO CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 0: "" 2020-04-08 17:53:12,765 INFO CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 1: "" 2020-04-08 17:53:17,688 INFO CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 2: "" 2020-04-08 17:53:18,069 INFO CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds 2020-04-08 17:53:18,087 INFO CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 3: "hive_test" 2020-04-08 17:53:21,742 INFO CliSuite - Found all expected output. ``` Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table. I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly. ### Why are the changes needed? org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before. ``` [info] CliSuite: [info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds) [info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds) [info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds) [info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds) [info] - Simple commands (16 seconds, 238 milliseconds) [info] - Single command with -e (9 seconds, 967 milliseconds) [info] - Single command with --database (21 seconds, 205 milliseconds) [info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds) [info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds) [info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds) [info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds) [info] - list jars (9 seconds, 520 milliseconds) [info] - list jar <jarfile> (9 seconds, 277 milliseconds) [info] - list files (9 seconds, 828 milliseconds) [info] - list file <filepath> (9 seconds, 646 milliseconds) [info] - apply hiveconf from cli command (9 seconds, 469 milliseconds) [info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds) [info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds) [info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds) [info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds) [info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds) [info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds) ``` Closes #28156 from juliuszsompolski/SPARK-31388. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 04:27:55 +00:00
Kousuke Saruta	91ec2eacfa	[SPARK-31565][WEBUI] Unify the font color of label among all DAG-viz ### What changes were proposed in this pull request? This PR unifies the font color of label as `#333333` among all DAG-viz. ### Why are the changes needed? For the consistent appearance among all DAG-viz. There are three types of DAG-viz in the WebUI. One is for stages, another one is for RDDs and the last one is for query plans. But the font color of labels are slightly different among them. For stages, the color is `#333333` (simply 333) which is specified by `spark-dag-viz.css`. <img width="355" alt="job-graph" src="https://user-images.githubusercontent.com/4736016/80321397-b517f580-8857-11ea-8c8e-cf68f648ab05.png"> <img width="310" alt="job-graph-color" src="https://user-images.githubusercontent.com/4736016/80321399-ba754000-8857-11ea-8708-83bdef4bc1d1.png"> For RDDs, the color is `#212529` which is specified by `bootstrap.min.js`. <img width="386" alt="stage-graph" src="https://user-images.githubusercontent.com/4736016/80321438-f0b2bf80-8857-11ea-9c2a-13fa0fd1431c.png"> <img width="313" alt="stage-graph-color" src="https://user-images.githubusercontent.com/4736016/80321444-fa3c2780-8857-11ea-81b2-4f1203d47896.png"> For query plans, the color is `black` which is specified by `spark-sql-viz.css`. <img width="449" alt="plan-graph" src="https://user-images.githubusercontent.com/4736016/80321490-61f27280-8858-11ea-9c3a-2c98d3d4d03b.png"> <img width="316" alt="plan-graph-color" src="https://user-images.githubusercontent.com/4736016/80321496-6ae34400-8858-11ea-8fe8-0d6e4a821608.png"> After the change, the appearance is like as follows (no change for stages). For RDDs. <img width="389" alt="stage-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321613-6b300f00-8859-11ea-912f-d92474aa9f47.png"> For query plans. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321638-9a468080-8859-11ea-974c-33c56a8ffe1a.png"> ### Does this PR introduce any user-facing change? Yes. The unified color is slightly lighter than ever. ### How was this patch tested? Confirmed that the color code among all DAG-viz are `#333333` using browser's debug console. Closes #28352 from sarutak/unify-label-color. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 16:57:23 -07:00
Max Gekk	bd139bda4a	[SPARK-31489][SQL] Fix pushing down filters with `java.time.LocalDate` values in ORC ### What changes were proposed in this pull request? Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled. Closes #28272 ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` Wrong value class java.time.LocalDate for DATE.EQUALS leaf java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352) at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28261 from MaxGekk/orc-date-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:49:00 -07:00
Peter Toth	4f53bfbbd5	[SPARK-31535][SQL] Fix nested CTE substitution ### What changes were proposed in this pull request? This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result: ``` WITH t(c) AS (SELECT 1) SELECT * FROM t WHERE c IN ( WITH t(c) AS (SELECT 2) SELECT * FROM t ) ``` Before this PR the result was `1`. ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce any user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new test case. Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:31:32 -07:00
Takeshi Yamamuro	e01125db0d	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp ### What changes were proposed in this pull request? This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows; - SELECT current_date; - SELECT current_timestamp; - SELECT substring('abcd' FROM 1); - SELECT substring('abcd' FROM 1 FOR 2); ### Why are the changes needed? To make the SQL (built-in functions) references complete. ### Does this PR introduce any user-facing change? <img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png"> <img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png"> <img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png"> ### How was this patch tested? Added test examples. Closes #28342 from maropu/SPARK-31562. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 11:46:52 -07:00
Gengliang Wang	f59ebdef5b	[SPARK-31558][UI] Code clean up in spark-sql-viz.js ### What changes were proposed in this pull request? 1. Remove console.log(), which seems unnecessary in the releases. 2. Replace the double equals to triple equals 3. Reuse jquery selector. ### Why are the changes needed? For better code quality. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests + manual test. Closes #28333 from gengliangwang/removeLog. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-25 13:43:52 -07:00
Kent Yao	7959808e96	[SPARK-31564][TESTS] Fix flaky AllExecutionsPageSuite for checking 1970 ### What changes were proposed in this pull request? Fix flakiness by checking `1970/01/01` instead of `1970`. The test was added by SPARK-27125 for 3.0.0. ### Why are the changes needed? the `org.apache.spark.sql.execution.ui.AllExecutionsPageSuite.SPARK-27019:correctly display SQL page when event reordering happens` test is flaky for just checking the `html` content not containing 1970. I will add a ticket to check and fix that. In the specific failure https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121799/testReport, it failed because the `html` ``` ... <td sorttable_customkey="1587806019707"> ... ``` contained `1970`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? passing jenkins Closes #28344 from yaooqinn/SPARK-31564. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 10:27:05 -07:00
Max Gekk	7d8216a664	[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types ### What changes were proposed in this pull request? In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection. ### Why are the changes needed? The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563. ### Does this PR introduce any user-facing change? Highly likely, not. ### How was this patch tested? Added a test to `ColumnExpressionSuite` Closes #28343 from MaxGekk/fix-InSet-sql. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 09:29:51 -07:00
Gengliang Wang	16b961526d	[SPARK-31560][SQL][TESTS] Add V1/V2 tests for TextSuite and WholeTextFileSuite ### What changes were proposed in this pull request? Add V1/V2 tests for TextSuite and WholeTextFileSuite ### Why are the changes needed? This is missing part since #24207. We should have these tests for test coverage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #28335 from gengliangwang/testV2Suite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:59:15 -07:00
Kent Yao	f92652d0b5	[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions ### What changes were proposed in this pull request? Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms. For example Negative: https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts https://prestodb.io/docs/current/functions/datetime.html#date_trunc https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/ Positive: https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC This PR removes these `fmt`s support for trunc and date_trunc functions. ### Why are the changes needed? clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later. ### Does this PR introduce any user-facing change? no, targeting 3.0.0, these are newly added in 3.0.0 ### How was this patch tested? remove and modify existing units tests Closes #28313 from yaooqinn/SPARK-31528. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:28:41 -07:00
Kent Yao	caf3ab8411	[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor ### What changes were proposed in this pull request? the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor` For example: ```scala scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder() newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T] scala> val decOne = Decimal(1, 38, 18) decOne: org.apache.spark.sql.types.Decimal = 1E-18 scala> val decTwo = Decimal(2, 38, 18) decTwo: org.apache.spark.sql.types.Decimal = 2E-18 scala> val decSpark = Array(decOne, decTwo) decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18) scala> Seq(decSpark).toDF() java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57) at newArrayEncoder(<console>:57) ... 53 elided scala> ``` In this PR, we add the missing cases to `arrayClassFor` ### Why are the changes needed? bugfix as described above ### Does this PR introduce any user-facing change? no ### How was this patch tested? add a test for array encoders Closes #28324 from yaooqinn/SPARK-31552. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:04:26 -07:00
Kent Yao	8424f55229	[SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession ### What changes were proposed in this pull request? SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession This seems a long-standing bug. ```scala scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|spark.sql.warehou...\|file:/Users/kenty...\| +--------------------+--------------------+ scala> spark.sql("set spark.sql.warehouse.dir=2"); org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir; at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ... 47 elided scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get getClass getOrCreate scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574 scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+-----+ \| key\|value\| +--------------------+-----+ \|spark.sql.warehou...\| xyz\| +--------------------+-----+ scala> OptionsAttachments ``` ### Why are the changes needed? bugfix as shown in the previous section ### Does this PR introduce any user-facing change? Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances. ### How was this patch tested? new ut. Closes #28316 from yaooqinn/SPARK-31532. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-25 08:53:00 +09:00
Jian Tang	6a576161ae	[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown ### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-04-24 22:10:58 +00:00
Yuming Wang	b10263b8e5	[SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators ### What changes were proposed in this pull request? `LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example: Teradata / Hive 3.0 / Snowflake: ```sql --like any select 'foo' LIKE ANY ('%foo%','%bar%'); --like all select 'foo' LIKE ALL ('%foo%','%bar%'); ``` PostgreSQL: ```sql -- like any select 'foo' LIKE ANY (array['%foo%','%bar%']); -- like all select 'foo' LIKE ALL (array['%foo%','%bar%']); ``` This PR add support these two operators. More details: https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ https://issues.apache.org/jira/browse/HIVE-15229 https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html ### Why are the changes needed? To smoothly migrate SQLs to Spark SQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27477 from wangyum/SPARK-30724. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-24 22:20:32 +09:00
yi.wu	463c54419b	[SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf ### What changes were proposed in this pull request? Give more friendly warning message/migration guide of deprecated scala udf to users. ### Why are the changes needed? User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #28311 from Ngone51/update_udf_doc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 19:10:18 +09:00
Jungtaek Lim (HeartSaVioR)	39bc50dbf8	[SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog ### What changes were proposed in this pull request? This patch adds some log messages to log elapsed time for "compact" operation in FileStreamSourceLog and FileStreamSinkLog (added in CompactibleFileStreamLog) to help investigating the mysterious latency spike during the batch run. ### Why are the changes needed? Tracking latency is a critical aspect of streaming query. While "compact" operation may bring nontrivial latency (it's even synchronous, adding all the latency to the batch run), it's not measured and end users have to guess. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A for UT. Manual test with streaming query using file source & file sink. > grep "for compact batch" <driver log> ``` ... 20/02/20 19:27:36 WARN FileStreamSinkLog: Compacting took 24473 ms (load: 14185 ms, write: 10288 ms) for compact batch 21359 20/02/20 19:27:39 WARN FileStreamSinkLog: Loaded 1068000 entries (397985432 bytes in memory), and wrote 1068000 entries for compact batch 21359 20/02/20 19:29:52 WARN FileStreamSourceLog: Compacting took 3777 ms (load: 1524 ms, write: 2253 ms) for compact batch 21369 20/02/20 19:29:52 WARN FileStreamSourceLog: Loaded 229477 entries (68970112 bytes in memory), and wrote 229477 entries for compact batch 21369 20/02/20 19:30:17 WARN FileStreamSinkLog: Compacting took 24183 ms (load: 12992 ms, write: 11191 ms) for compact batch 21369 20/02/20 19:30:20 WARN FileStreamSinkLog: Loaded 1068500 entries (398171880 bytes in memory), and wrote 1068500 entries for compact batch 21369 ... ``` ![Screen Shot 2020-02-21 at 12 34 22 PM](https://user-images.githubusercontent.com/1317309/75002142-c6830100-54a6-11ea-8da6-17afb056653b.png) This messages are explaining why the operation duration peaks per every 10 batches which is compact interval. Latency from addBatch heavily increases in each peak which DOES NOT mean it takes more time to write outputs, but we have no idea if such message is not presented. NOTE: The output may be a bit different from the code, as it may be changed a bit during review phase. Closes #27557 from HeartSaVioR/SPARK-30804. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 12:34:44 +09:00
Max Gekk	26165427c7	[SPARK-31488][SQL] Support `java.time.LocalDate` in Parquet filter pushdown ### What changes were proposed in this pull request? 1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes. 2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days. 3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`). ### Why are the changes needed? To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test to `ParquetFilterSuite` Closes #28259 from MaxGekk/parquet-filter-java8-date-time. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-24 02:21:53 +00:00
Takeshi Yamamuro	42f496f6ac	[SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo ### What changes were proposed in this pull request? This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows; - Added a new test suite named `ExpressionInfoSuite` - To improve test coverage, added a test for error handling in `ExpressionInfoSuite` - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite` - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite` - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224) ### Why are the changes needed? To improve test suites/coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28308 from maropu/SPARK-31526. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 11:19:20 +09:00
Kent Yao	8dc2c0247b	[SPARK-31522][SQL] Hive metastore client initialization related configurations should be static ### What changes were proposed in this pull request? HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static: 1. spark.sql.hive.metastore.version - used to determine the hive version in Spark 2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client 3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes - package names of classes that are shared or separated between SparkContextLoader and hive client class loader Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command. Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility. ### Why are the changes needed? bugfix, these configurations should not be changed. ### Does this PR introduce any user-facing change? Yes, the following set of configs are not allowed to change. ``` Seq("spark.sql.hive.metastore.version ", "spark.sql.hive.metastore.jars", "spark.sql.hive.metastore.sharedPrefixes", "spark.sql.hive.metastore.barrierPrefixes") ``` ### How was this patch tested? add unit test Closes #28302 from yaooqinn/SPARK-31522. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 15:07:44 +00:00
Yuanjian Li	ca90e1932d	[SPARK-31515][SQL] Canonicalize Cast should consider the value of needTimeZone ### What changes were proposed in this pull request? Override the canonicalized fields with respect to the result of `needsTimeZone`. ### Why are the changes needed? The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #28288 from xuanyuanking/SPARK-31515. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-23 14:32:10 +09:00
Takeshi Yamamuro	820733aee2	[SPARK-31476][SQL][FOLLOWUP] Add tests for extract('field', source) ### What changes were proposed in this pull request? SPARK-31476 has supported `extract('field', source)` as side-effect, so this PR intends to add some tests for the function in `SQLQueryTestSuite`. ### Why are the changes needed? For better test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28276 from maropu/SPARK-31476-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 04:59:59 +00:00
Kent Yao	3b5792114a	[SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info ### What changes were proposed in this pull request? \_FUNC\_ is used in note() of `ExpressionDescription` since https://github.com/apache/spark/pull/28248, it can be more cases later, we should replace it with function name for documentation ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? pass Jenkins, and verify locally with Jekyll serve Closes #28305 from yaooqinn/SPARK-31474-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-23 13:33:04 +09:00
Max Gekk	e7856a7902	[MINOR][SQL] Add comments for filters values and return values of Row.get()/apply() ### What changes were proposed in this pull request? - Document row field values of `DATE` and `TIMESTAMP` type returned by `Row.get()` and `Row.apply`. - Refer to `Row.get()` from the description of filter values ### Why are the changes needed? Reflect current behaviour of Row's method `apply()` and `get()` in comments to inform users about different return types that are depended on the SQL config settings `spark.sql.datetime.java8API.enabled`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run `$ ./dev/scalastyle` Closes #28300 from MaxGekk/doc-filter-date-time. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 04:23:33 +00:00
Gabor Somogyi	c619990c1d	[SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector ### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added DB2 support (other supported databases will come in later PRs). What this PR contains: * Added `DB2ConnectionProvider` * Added `DB2ConnectionProviderSuite` * Added `DB2KrbIntegrationSuite` docker integration test * Changed DB2 JDBC driver to use the latest (test scope only) * Changed test table data type to a type which is supported by all the databases * Removed double connection creation on test side * Increased connection timeout in docker tests because DB2 docker takes quite a time to start ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, now user is able to connect to DB2 using kerberos. ### How was this patch tested? * Additional + existing unit tests * Additional + existing integration tests * Test on cluster manually Closes #28215 from gaborgsomogyi/SPARK-31272. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>	2020-04-22 17:10:30 -07:00
yi.wu	8fbfdb38c0	[SPARK-31495][SQL] Support formatted explain for AQE ### What changes were proposed in this pull request? To support formatted explain for AQE. ### Why are the changes needed? AQE does not support formatted explain yet. It's good to support it for better user experience, debugging, etc. Before: ``` == Physical Plan == AdaptiveSparkPlan (1) +- * HashAggregate (unknown) +- CustomShuffleReader (unknown) +- ShuffleQueryStage (unknown) +- Exchange (unknown) +- * HashAggregate (unknown) +- * Project (unknown) +- * BroadcastHashJoin Inner BuildRight (unknown) :- * LocalTableScan (unknown) +- BroadcastQueryStage (unknown) +- BroadcastExchange (unknown) +- LocalTableScan (unknown) (1) AdaptiveSparkPlan Output [4]: [k#7, count(v1)#32L, sum(v1)#33L, avg(v2)#34] Arguments: HashAggregate(keys=[k#7], functions=[count(1), sum(cast(v1#8 as bigint)), avg(cast(v2#19 as bigint))]), AdaptiveExecutionContext(org.apache.spark.sql.SparkSession104ab57b), [PlanAdaptiveSubqueries(Map())], false ``` After: ``` == Physical Plan == AdaptiveSparkPlan (14) +- * HashAggregate (13) +- CustomShuffleReader (12) +- ShuffleQueryStage (11) +- Exchange (10) +- * HashAggregate (9) +- * Project (8) +- * BroadcastHashJoin Inner BuildRight (7) :- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (6) +- BroadcastExchange (5) +- * Project (4) +- * LocalTableScan (3) (1) LocalTableScan [codegen id : 2] Output [2]: [_1#x, _2#x] Arguments: [_1#x, _2#x] (2) Project [codegen id : 2] Output [2]: [_1#x AS k#x, _2#x AS v1#x] Input [2]: [_1#x, _2#x] (3) LocalTableScan [codegen id : 1] Output [2]: [_1#x, _2#x] Arguments: [_1#x, _2#x] (4) Project [codegen id : 1] Output [2]: [_1#x AS k#x, _2#x AS v2#x] Input [2]: [_1#x, _2#x] (5) BroadcastExchange Input [2]: [k#x, v2#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#x] (6) BroadcastQueryStage Output [2]: [k#x, v2#x] Arguments: 0 (7) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [k#x] Join condition: None (8) Project [codegen id : 2] Output [3]: [k#x, v1#x, v2#x] Input [4]: [k#x, v1#x, k#x, v2#x] (9) HashAggregate [codegen id : 2] Input [3]: [k#x, v1#x, v2#x] Keys [1]: [k#x] Functions [3]: [partial_count(1), partial_sum(cast(v1#x as bigint)), partial_avg(cast(v2#x as bigint))] Aggregate Attributes [4]: [count#xL, sum#xL, sum#x, count#xL] Results [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] (10) Exchange Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Arguments: hashpartitioning(k#x, 5), true, [id=#x] (11) ShuffleQueryStage Output [5]: [sum#xL, k#x, sum#x, count#xL, count#xL] Arguments: 1 (12) CustomShuffleReader Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Arguments: coalesced (13) HashAggregate [codegen id : 3] Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Keys [1]: [k#x] Functions [3]: [count(1), sum(cast(v1#x as bigint)), avg(cast(v2#x as bigint))] Aggregate Attributes [3]: [count(1)#xL, sum(cast(v1#x as bigint))#xL, avg(cast(v2#x as bigint))#x] Results [4]: [k#x, count(1)#xL AS count(v1)#xL, sum(cast(v1#x as bigint))#xL AS sum(v1)#xL, avg(cast(v2#x as bigint))#x AS avg(v2)#x] (14) AdaptiveSparkPlan Output [4]: [k#x, count(v1)#xL, sum(v1)#xL, avg(v2)#x] Arguments: isFinalPlan=true ``` ### Does this PR introduce any user-facing change? No, this should be new feature along with AQE in Spark 3.0. ### How was this patch tested? Added a query file: `explain-aqe.sql` and a unit test. Closes #28271 from Ngone51/support_formatted_explain_for_aqe. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 12:44:06 +00:00
Kent Yao	37d2e037ed	[SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function ### What changes were proposed in this pull request? Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT This PR removes these extract fields support from extract function for date and timestamp values `isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms `isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it. For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated. 1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`] 2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`] 3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found. 4. with a discussion the community，we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too. Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system ### Why are the changes needed? Remove some nonstandard and uncommon features as we can add them back if necessary ### Does this PR introduce any user-facing change? NO, we should target this to 3.0.0 and these are added during 3.0.0 ### How was this patch tested? Remove unused tests Closes #28284 from yaooqinn/SPARK-31507. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 10:24:49 +00:00
Kent Yao	2c2062ea7c	[SPARK-31498][SQL][DOCS] Dump public static sql configurations through doc generation ### What changes were proposed in this pull request? Currently, only the non-static public SQL configurations are dump to public doc, we'd better also add those static public ones as the command `set -v` This PR force call StaticSQLConf to buildStaticConf. ### Why are the changes needed? Fix missing SQL configurations in doc ### Does this PR introduce any user-facing change? NO ### How was this patch tested? add unit test and verify locally to see if public static SQL conf is in `docs/sql-config.html` Closes #28274 from yaooqinn/SPARK-31498. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 10:16:39 +00:00
herman	cf6038499d	[SPARK-31511][SQL] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #28286 from hvanhovell/SPARK-31511. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-21 18:17:19 -07:00
Wenchen Fan	b209b5f406	[SPARK-31503][SQL] fix the SQL string of the TRIM functions ### What changes were proposed in this pull request? override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax. ### Why are the changes needed? The current implementation is wrong. It gives you a SQL string that returns different result. ### Does this PR introduce any user-facing change? No ### How was this patch tested? new tests Closes #28281 from cloud-fan/sql. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-21 11:22:18 -07:00
Wenchen Fan	a5ebbacf53	[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata ### What changes were proposed in this pull request? This PR adds a new parquet/avro file metadata: `org.apache.spark.legacyDatetime`. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it. This makes Spark be able to do rebase more smartly: 1. If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true. 2. If the file was written by Spark 2.4 and earlier, then do rebase. 3. If the file was written by Spark 3.0 and later, do rebase if the `org.apache.spark.legacyDatetime` exists in file metadata. ### Why are the changes needed? It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately. To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test Closes #28137 from cloud-fan/datetime. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-22 00:26:23 +09:00
yi.wu	55b026a783	[SPARK-31504][SQL] Formatted Explain should have determined order of Output fields ### What changes were proposed in this pull request? In `verboseStringWithOperatorId`, use `output` (it's `Seq[Attribute]`) instead of `producedAttributes` (it's `AttributeSet`) to generates `"Output"` for the leaf node in order to make `"Output"` determined. ### Why are the changes needed? Currently, Formatted Explain use `producedAttributes`, the `AttributeSet`, to generate `"Output"`. As a result, the fields order within `"Output"` can be different from time to time. It's That means, for the same plan, it could have different explain outputs. ### Does this PR introduce any user-facing change? Yes, user see the determined fields order within formatted explain now. ### How was this patch tested? Added a regression test. Closes #28282 from Ngone51/fix_output. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 12:33:58 +00:00
Kent Yao	1985437110	[SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function ### What changes were proposed in this pull request? ```sql spark-sql> SELECT extract(dayofweek from '2009-07-26'); 1 spark-sql> SELECT extract(dow from '2009-07-26'); 0 spark-sql> SELECT extract(isodow from '2009-07-26'); 7 spark-sql> SELECT dayofweek('2009-07-26'); 1 spark-sql> SELECT weekday('2009-07-26'); 6 ``` Currently, there are 4 types of day-of-week range: 1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7) 2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6) 3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7) 4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6) Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings. https://issues.apache.org/jira/browse/SPARK-23903 https://issues.apache.org/jira/browse/SPARK-28623 In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything. Also, add more documentation to the extracting function to make extract field more clear to understand. ### Why are the changes needed? Consistency insurance ### Does this PR introduce any user-facing change? yes, doc updated and extract `dow` is as same as `dayofweek` ### How was this patch tested? 1. modified ut 2. local SQL doc verification #### before ![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png) #### after ![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png) Closes #28248 from yaooqinn/SPARK-31474. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 11:55:33 +00:00
Maryann Xue	ae29cf24fc	[SPARK-31501][SQL] AQE update UI should not cause deadlock ### What changes were proposed in this pull request? This PR makes sure that AQE does not call update UI if the current execution ID does not match the current query. This PR also includes a minor refactoring that moves `getOrCloneSessionWithAqeOff` from `QueryExecution` to `AdaptiveSparkPlanHelper` since that function is not used by `QueryExecution` any more. ### Why are the changes needed? Without this fix, there could be a potential deadlock. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #28275 from maryannxue/aqe-ui-deadlock. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 03:56:42 +00:00
Takeshi Yamamuro	e42dbe7cd4	[SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions ### What changes were proposed in this pull request? This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references. ### Why are the changes needed? To make SQL references complete. ### Does this PR introduce any user-facing change? Yes; ![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png) ![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png) ![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png) ### How was this patch tested? Manually checked and added tests. Closes #28224 from maropu/SPARK-31429. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-21 10:55:13 +09:00
rishi	4f8b03d336	[SPARK-31389][SQL][TESTS] Add codegen-on test coverage for some tests in SQLMetricsSuite ### What changes were proposed in this pull request? Adding missing unit tests in SQLMetricSuite to cover the code generated path. Additional tests were added in the following unit tests. Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics, SortMergeJoin(left-anti) metrics ### Why are the changes needed? The existing tests in SQLMetricSuite only cover the interpreted path. It is necessary for the tests to cover code generated path as well since CodeGenerated path is often used in production. The PR doesn't change test("Aggregate metrics") and test("ObjectHashAggregate metrics"). The test("Aggregate metrics") tests metrics when a HashAggregate is used. Enabling codegen forces the test to use ObjectHashAggregate rather than the regular HashAggregate. ObjectHashAggregate has a test of its own. Therefore, I feel these two tests need not enabling codegen is not necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added debug statements in the code to make sure both Code generated and Interpreted paths are being exercised. I further used Intellij debugger to ensure that the newly added unit tests are in fact exercising both code generated and interpreted paths. Closes #28173 from sririshindra/SPARK-31389. Authored-by: rishi <spothireddi@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 14:41:45 -07:00
Wenchen Fan	69f9ee18b6	[SPARK-31452][SQL] Do not create partition spec for 0-size partitions in AQE ### What changes were proposed in this pull request? This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing. ### Why are the changes needed? launching tasks that do nothing is a waste. ### Does this PR introduce any user-facing change? no ### How was this patch tested? updated tests Closes #28226 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 13:50:07 -07:00
Yuming Wang	b11e42663b	[SPARK-31381][SPARK-29245][SQL] Upgrade built-in Hive 2.3.6 to 2.3.7 ### What changes were proposed in this pull request? Hive 2.3.7 fixed these issues: - HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer - HIVE-21980:Parsing time can be high in case of deeply nested subqueries - HIVE-22249: Support Parquet through HCatalog ### Why are the changes needed? Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840) - [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353) - [x] Manual test with remote hive metastore. Hive side: ``` export JAVA_HOME=/usr/lib/jdk1.8.0_221 export PATH=$JAVA_HOME/bin:$PATH cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6 bin/schematool -dbType derby -initSchema --verbose bin/hive --service metastore ``` Spark side: ``` export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 ``` Closes #28148 from wangyum/SPARK-31381. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 13:38:24 -07:00

1 2 3 4 5 ...

9401 commits