ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
iRakson	fbb3144a9c	[SPARK-31642] Add Pagination Support for Structured Streaming Page ### What changes were proposed in this pull request? Add Pagination Support for structured streaming page. Now both tables `Active Queries` and `Completed Queries` will have pagination. To implement pagination, pagination framework from #7399 is used. * Also tables will only be shown if there is at least one entry in the table. ### Why are the changes needed? * This will help users in analysing their structured streaming queries in much better way. * Other Web UI pages support pagination in their table. So this will make web UI more consistent across pages. * This can prevent potential OOM errors. ### Does this PR introduce _any_ user-facing change? Yes. Both tables will support pagination. ### How was this patch tested? Manually. I will add snapshots soon. Closes #28485 from iRakson/SPARK-31642. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-05-23 17:17:53 +09:00
Takeshi Yamamuro	7ca73f03fb	[SPARK-29854][SQL][TESTS] Add tests to check lpad/rpad throw an exception for invalid length input ### What changes were proposed in this pull request? This PR intends to add trivial tests to check https://github.com/apache/spark/pull/27024 has already been fixed in the master. Closes #27024 ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28604 from maropu/SPARK-29854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-23 08:48:29 +09:00
Jungtaek Lim (HeartSaVioR)	5a258b0b67	[SPARK-30915][SS] CompactibleFileStreamLog: Avoid reading the metadata log file when finding the latest batch ID ### What changes were proposed in this pull request? This patch adds the new method `getLatestBatchId()` in CompactibleFileStreamLog in complement of getLatest() which doesn't read the content of the latest batch metadata log file, and apply to both FileStreamSource and FileStreamSink to avoid unnecessary latency on reading log file. ### Why are the changes needed? Once compacted metadata log file becomes huge, writing outputs for the compact + 1 batch is also affected due to unnecessarily reading the compacted metadata log file. This unnecessary latency can be simply avoided. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Also manually tested under query which has huge metadata log on file stream sink: > before applying the patch ![Screen Shot 2020-02-21 at 4 20 19 PM](https://user-images.githubusercontent.com/1317309/75016223-d3ffb180-54cd-11ea-9063-49405943049d.png) > after applying the patch ![Screen Shot 2020-02-21 at 4 06 18 PM](https://user-images.githubusercontent.com/1317309/75016220-d235ee00-54cd-11ea-81a7-7c03a43c4db4.png) Peaks are compact batches - please compare the next batch after compact batches, especially the area of "light brown". Closes #27664 from HeartSaVioR/SPARK-30915. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-05-22 16:46:17 -07:00
TJX2014	2115c55efe	[SPARK-31710][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions ### What changes were proposed in this pull request? Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS` A test is added. Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds) ### Why are the changes needed? People will have convenient way to get timestamps from seconds,milliseconds and microseconds. ### Does this PR introduce _any_ user-facing change? Yes, people will have the following ways to get timestamp: ```scala sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false) ``` ``` +-------------------------+ \|timestamp \| +-------------------------+ \|2008-12-25 23:30:00\| \|1931-01-07 16:30:00\| +-------------------------+ ``` ```scala sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false) ``` ``` +-------------------------------+ \|timestamp \| +-------------------------------+ \|2008-12-25 23:30:00.123\| \|1931-01-07 16:29:59.877\| +-------------------------------+ ``` ```scala sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false) ``` ``` +------------------------------------+ \|timestamp \| +------------------------------------+ \|2008-12-25 23:30:00.123123\| \|1931-01-07 16:29:59.876877\| +------------------------------------+ ``` ### How was this patch tested? Unit test. Closes #28534 from TJX2014/master-SPARK-31710. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-22 14:16:30 +00:00
Wenchen Fan	ce4da29ec3	[SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string ### What changes were proposed in this pull request? This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value. If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility. ### Why are the changes needed? To keep backward compatibility with Spark 2.4. ### Does this PR introduce _any_ user-facing change? Yes. Spark 2.4: ``` scala> sql("select to_timestamp('16', 'dd')").show +------------------------+ \|to_timestamp('16', 'dd')\| +------------------------+ \| 1970-01-16 00:00:00\| +------------------------+ scala> sql("select to_date('16', 'dd')").show +-------------------+ \|to_date('16', 'dd')\| +-------------------+ \| 1970-01-16\| +-------------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +----------------------------------+ \|to_timestamp('2019 40', 'yyyy mm')\| +----------------------------------+ \| 2019-01-01 00:40:00\| +----------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +----------------------------------------------+ \|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')\| +----------------------------------------------+ \| 2019-01-01 10:10:10\| +----------------------------------------------+ ``` in branch 3.0 ``` scala> sql("select to_timestamp('16', 'dd')").show +--------------------+ \|to_timestamp(16, dd)\| +--------------------+ \| null\| +--------------------+ scala> sql("select to_date('16', 'dd')").show +---------------+ \|to_date(16, dd)\| +---------------+ \| null\| +---------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +------------------------------+ \|to_timestamp(2019 40, yyyy mm)\| +------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +------------------------------------------+ \|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)\| +------------------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------------------+ ``` After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled. ### How was this patch tested? new tests Closes #28576 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 16:10:08 +09:00
Max Gekk	60118a2426	[SPARK-31785][SQL][TESTS] Add a helper function to test all parquet readers ### What changes were proposed in this pull request? Add `withAllParquetReaders` to `ParquetTest`. The function allow to run a block of code for all available Parquet readers. ### Why are the changes needed? 1. It simplifies tests 2. Allow to test all parquet readers that could be available in projects based on Apache Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28598 from MaxGekk/add-withAllParquetReaders. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 09:53:35 +09:00
Gengliang Wang	db5e5fce68	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `92877c4ef2`. Closes #28602 from gengliangwang/revertSPARK-31765. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 16:00:58 -07:00
Kousuke Saruta	92877c4ef2	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 11:43:25 -07:00
iRakson	f1495c5bc0	[SPARK-31688][WEBUI] Refactor Pagination framework ### What changes were proposed in this pull request? Currently while implementing pagination using the existing pagination framework, a lot of code is being copied as pointed out [here](https://github.com/apache/spark/pull/28485#pullrequestreview-408881656). I introduced some changes in `PagedTable` which is the main trait for implementing the pagination. * Added function for getting table parameters. * Added a function for table header row. This will help in maintaining consistency across the tables. All the header rows across tables will be consistent now. ### Why are the changes needed? * A lot of code is copied every time pagination is implemented for any table. * Code readability is not great as lot of HTML is embedded. * Paginating other tables will be a lot easier now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. This is mainly refactoring work, no new functionality introduced. Existing test cases should pass. Closes #28512 from iRakson/refactorPaginationFramework. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-21 13:00:00 -05:00
Vinoo Ganesh	dae79888dc	[SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener ## What changes were proposed in this pull request? This change was made as a result of the conversation on https://issues.apache.org/jira/browse/SPARK-31354 and is intended to continue work from that ticket here. This change fixes a memory leak where SparkSession listeners are never cleared off of the SparkContext listener bus. Before running this PR, the following code: ``` SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() ``` would result in a SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb, <-First instance org.apache.spark.sql.SparkSession$$anon$1fadb9a0] <- Second instance ``` After this PR, the execution of the same code above results in SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb] <-One instance ``` ## How was this patch tested? * Unit test included as a part of the PR Closes #28128 from vinooganesh/vinooganesh/SPARK-27958. Lead-authored-by: Vinoo Ganesh <vinoo.ganesh@gmail.com> Co-authored-by: Vinoo Ganesh <vganesh@palantir.com> Co-authored-by: Vinoo Ganesh <vinoo@safegraph.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 16:06:28 +00:00
Max Gekk	5d673319af	[SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString ### What changes were proposed in this pull request? 1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings: - TimestampFormatter: - `def format(ts: Timestamp): String` - `def format(instant: Instant): String` - DateFormatter: - `def format(date: Date): String` - `def format(localDate: LocalDate): String` 2. Re-use the added methods from `HiveResult.toHiveString` 3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests for toHiveString and new tests in `TimestampFormatterSuite`. Closes #28582 from MaxGekk/opt-format-old-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 04:01:19 +00:00
Wenchen Fan	34414acfa3	[SPARK-31706][SQL] add back the support of streaming update mode ### What changes were proposed in this pull request? This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode. Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`. ### Why are the changes needed? In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911 It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility. ### Does this PR introduce _any_ user-facing change? Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4 ### How was this patch tested? existing tests. Closes #28523 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-20 03:45:13 +00:00
yi.wu	0fd98abd85	[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType ### What changes were proposed in this pull request? Eliminate the `UpCast` if it's child data type is already decimal type. ### Why are the changes needed? While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.: ``` sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") .write.mode("overwrite") .parquet(f.getAbsolutePath) // can fail spark.read.parquet(f.getAbsolutePath).as[BigDecimal] ``` ``` [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18). [info] The type path of the target object is: [info] - root class: "scala.math.BigDecimal" [info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) ``` ### Does this PR introduce _any_ user-facing change? Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change. ### How was this patch tested? Added tests. Closes #28572 from Ngone51/fix_encoder. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-20 11:00:58 +09:00
Ali Afroozeh	b9cc31cd95	[SPARK-31721][SQL] Assert optimized is initialized before tracking the planning time ### What changes were proposed in this pull request? The QueryPlanningTracker in QueryExeuction reports the planning time that also includes the optimization time. This happens because the optimizedPlan in QueryExecution is lazy and only will initialize when first called. When df.queryExecution.executedPlan is called, the the tracker starts recording the planning time, and then calls the optimized plan. This causes the planning time to start before optimization and also include the planning time. This PR fixes this behavior by introducing a method assertOptimized, similar to assertAnalyzed that explicitly initializes the optimized plan. This method is called before measuring the time for sparkPlan and executedPlan. We call it before sparkPlan because that also counts as planning time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #28543 from dbaliafroozeh/AddAssertOptimized. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-05-19 11:10:49 +02:00
Eren Avsarogullari	ab4cf49a1c	[SPARK-31440][SQL] Improve SQL Rest API ### What changes were proposed in this pull request? SQL Rest API exposes query execution metrics as Public API. This PR aims to apply following improvements on SQL Rest API by aligning Spark-UI. Proposed Improvements: 1- Support Physical Operations and group metrics per physical operation by aligning Spark UI. 2- Support `wholeStageCodegenId` for Physical Operations 3- `nodeId` can be useful for grouping metrics and sorting physical operations (according to execution order) to differentiate same operators (if used multiple times during the same query execution) and their metrics. 4- Filter `empty` metrics by aligning with Spark UI - SQL Tab. Currently, Spark UI does not show empty metrics. 5- Remove line breakers(`\n`) from `metricValue`. 6- `planDescription` can be `optional` Http parameter to avoid network cost where there is specially complex jobs creating big-plans. 7- `metrics` attribute needs to be exposed at the bottom order as `nodes`. Specially, this can be useful for the user where `nodes` array size is high. 8- `edges` attribute is being exposed to show relationship between `nodes`. 9- Reverse order on `metricDetails` aims to match with Spark UI by supporting Physical Operators' execution order. ### Why are the changes needed? Proposed improvements provides more useful (e.g: physical operations and metrics correlation, grouping) and clear (e.g: filtering blank metrics, removing line breakers) result for the end-user. ### Does this PR introduce any user-facing change? Yes. Please find both current and improved versions of the results as attached for following SQL Rest Endpoint: ``` curl -X GET http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true ``` Current version: https://issues.apache.org/jira/secure/attachment/12999821/current_version.json Improved version: https://issues.apache.org/jira/secure/attachment/13000621/improved_version.json ### Backward Compatibility SQL Rest API will be started to expose with `Spark 3.0` and `3.0.0-preview2` (released on 12/23/19) does not cover this API so if PR can catch 3.0 release, this will not have any backward compatibility issue. ### How was this patch tested? 1. New Unit tests are added. 2. Also, patch has been tested manually through both Spark Core and History Server Rest APIs. Closes #28208 from erenavsarogullari/SPARK-31440. Authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-18 23:21:32 -07:00
Jungtaek Lim (HeartSaVioR)	d2bec5e265	[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? This patch effectively reverts SPARK-30098 via below changes: * Removed the config * Removed the changes done in parser rule * Removed the usage of config in tests * Removed tests which depend on the config * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098 * Reflect the change into docs (migration doc, create table syntax) ### Why are the changes needed? SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change. Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details. ### Does this PR introduce _any_ user-facing change? No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases. ### How was this patch tested? Existing UTs. Closes #28517 from HeartSaVioR/revert-SPARK-30098. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:27:23 +00:00
Max Gekk	5539ecfdac	[SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default ### What changes were proposed in this pull request? Set default time zone and locale in the default constructor of `SparkFunSuite`: - Default time zone to `America/Los_Angeles` - Default locale to `Locale.US` ### Why are the changes needed? 1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite 2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running all affected test suites Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:26:00 +00:00
Yuanjian Li	86bd37f37e	[SPARK-31663][SQL] Grouping sets with having clause returns the wrong result ### What changes were proposed in this pull request? - Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`: - Change the operations resolving directions to top-down. - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions` - Push the aggregate expressions into the aggregate which contains the expanded operations. - Use UnresolvedHaving for all having clause. ### Why are the changes needed? Correctness bug fix. See the demo and analysis in SPARK-31663. ### Does this PR introduce _any_ user-facing change? Yes, correctness bug fix for HAVING with GROUPING SETS. ### How was this patch tested? New UTs added. Closes #28501 from xuanyuanking/SPARK-31663. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-16 04:37:18 +00:00
yi.wu	d8b001fa87	[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery ### What changes were proposed in this pull request? Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. ### Why are the changes needed? When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. ### Does this PR introduce _any_ user-facing change? Yes, user hit error previously but run query successfully after this change. ### How was this patch tested? Added a regression test. Closes #28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-15 15:36:28 +00:00
Karuppayya Rajendran	72601460ad	[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### What changes were proposed in this pull request? Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### Why are the changes needed? BEFORE ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` AFTER ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-13 23:18:38 -07:00
Wenchen Fan	fd2d55c991	[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files ### What changes were proposed in this pull request? When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. ### Why are the changes needed? Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. ### Does this PR introduce _any_ user-facing change? Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. ### How was this patch tested? updated tests Closes #28477 from cloud-fan/rebase. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-14 12:32:40 +09:00
beliefer	a89006aba0	[SPARK-31393][SQL] Show the correct alias in schema for expression ### What changes were proposed in this pull request? Some alias of expression can not display correctly in schema. This PR will fix them. - `TimeWindow` - `MaxBy` - `MinBy` - `UnaryMinus` - `BitwiseCount` This PR also fix a typo issue, please look at `b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)` Note: 1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed. We can reuse `prettyName` to replace `funcName`. 2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below: `override def sql: String = s"bit_count(${child.sql})"` I don't think it's elegant enough. Because `Expression` gives the following definitions. ``` def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } ``` By this definition, `BitwiseCount` should override `prettyName` method. ### Why are the changes needed? Improve the implement of some expression. ### Does this PR introduce any user-facing change? 'Yes'. This PR will let user see the correct alias in schema. ### How was this patch tested? Jenkins test. Closes #28164 from beliefer/elegant-pretty-name-for-function. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-12 10:25:04 +09:00
Gabor Somogyi	5a5af46a94	[SPARK-31575][SQL] Synchronise global JVM security configuration modification ### What changes were proposed in this pull request? There is a race in secure JDBC connection providers. Namely multiple providers can read and/or write the exact same JVM security configuration at the same time. In this PR I've synchronised them with an object class. Since the configuration read and write takes couple of milliseconds it won't cause performance degradation. ### Why are the changes needed? There is a race in secure JDBC connection providers. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit + integration tests. Closes #28368 from gaborgsomogyi/SPARK-31575. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-11 09:10:58 -05:00
Max Gekk	5d5866be12	[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`. ### Why are the changes needed? This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-07 00:32:20.123\| ... \|1001-01-07 00:32:20.123\| +-----------------------+ ``` Expected values must be 1001-01-01 01:02:03.123 but not 1001-01-07 00:32:20.123. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-01 01:02:03.123\| ... \|1001-01-01 01:02:03.123\| +-----------------------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-11 04:58:08 +00:00
Max Gekk	ce63bef1da	[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`. ### Why are the changes needed? This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-07\| ... \|1001-01-07\| +----------+ ``` Expected values must be 1000-01-01 but not 1001-01-07. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-01\| ... \|1001-01-01\| +----------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-10 13:31:26 +09:00
manuzhang	77c690a725	[SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan ### What changes were proposed in this pull request? Show write commands on SQL UI of an AQE plan ### Why are the changes needed? Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI. #### Before ![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png) #### After ![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add UT. Closes #28474 from manuzhang/aqe-ui. Lead-authored-by: manuzhang <owenzhang1990@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-05-08 10:24:13 -07:00
Max Gekk	272d229005	[SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing ### What changes were proposed in this pull request? In PR, I propose to modify two tests of `ParquetIOSuite`: - SPARK-31159: rebasing timestamps in write - SPARK-31159: rebasing dates in write to check non-vectorized Parquet reader together with vectorized reader. ### Why are the changes needed? To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `PaquetIOSuite`: ``` $ ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-07 07:52:29 +00:00
Kent Yao	b31ae7bb0b	[SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions ### What changes were proposed in this pull request? The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods. This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of the RuntimeReplaceable ones with toPettySQL ### Why are the changes needed? Consistency of schema output between RuntimeReplaceable expressions and normal ones. For example, `date_format` vs `to_timestamp`, before this PR, they output differently #### Before ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS") struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp> ``` #### After ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS") struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp> ```` ### Does this PR introduce _any_ user-facing change? Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example ### How was this patch tested? regenerate all related tests Closes #28420 from yaooqinn/SPARK-31615. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-07 14:40:26 +09:00
Max Gekk	3d38bc2605	[SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite ### What changes were proposed in this pull request? Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`. ### Why are the changes needed? Avro config is not relevant to the parquet tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite` via ``` ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-07 09:46:42 +09:00
yi.wu	b16ea8e1ab	[SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries ### What changes were proposed in this pull request? Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages. ### Why are the changes needed? If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries. Before: <img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png"> After: <img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png"> ### Does this PR introduce _any_ user-facing change? No(AQE feature hasn't been released). ### How was this patch tested? Tested manually. Closes #28460 from Ngone51/fix_aqe_ui. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-06 12:52:53 +00:00
Liang-Chi Hsieh	4952f1a03c	[SPARK-31365][SQL] Enable nested predicate pushdown per data sources ### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-06 04:50:06 +00:00
sychen	588966d696	[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-06 10:56:19 +09:00
Max Gekk	bd26429931	[SPARK-31641][SQL] Fix days conversions by JSON legacy parser ### What changes were proposed in this pull request? Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date. ### Why are the changes needed? The changes fix a bug and restore compatibility with Spark 2.4 in which: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-11\| +----------+ ``` After: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### How was this patch tested? Add a test to `JsonSuite`. Closes #28453 from MaxGekk/json-rebase-legacy-days. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 14:15:31 +00:00
Max Gekk	bef5828e12	[SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold ### What changes were proposed in this pull request? Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`. ### Why are the changes needed? The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`. Before (see the PR's results https://github.com/apache/spark/pull/28440): ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X ``` After: ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X ``` Perf improvements in average of: 1. From java.sql.Timestamp is ~ 34% 2. To java.sql.Timestamps is ~16% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`. Closes #28441 from MaxGekk/opt-rebase-common-threshold. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 14:11:53 +00:00
turbofei	8d1f7d2a4a	[SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException ### What changes were proposed in this pull request? If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`. We may meet TableAlreadyExistsException. ``` org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default' ``` The reason is that, there is some tests that does not clean up the tables/views. In this PR, I add `withTempViews` for these tests. ### Why are the changes needed? To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed UT. Closes #28239 from turboFei/SPARK-31467. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-05 15:14:33 +09:00
Max Gekk	735771e7b4	[SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write ### What changes were proposed in this pull request? Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS, and rebasing on for INT96): timestamp type \| vectorized off (ns/row) \| vectorized on (ns/row) --\|--\|-- TIMESTAMP_MICROS\| 160.1 \| 50.2 INT96 \| 215.6 \| 117.8 TIMESTAMP_MILLIS \| 159.9 \| 60.6 ### Why are the changes needed? To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmarks via: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 05:40:15 +00:00
beliefer	b9494206a5	[SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line `b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L165)` just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples * FAILED * (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-05 10:04:16 +09:00
Max Gekk	372ccba063	[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default ### What changes were proposed in this pull request? This reverts commit `43a73e387c`. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-04 17:27:02 -07:00
Burak Yavuz	02a319d7e1	[SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog ## What changes were proposed in this pull request? SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such. ### Why are the changes needed? Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore. ### Does this PR introduce _any_ user-facing change? Fixes a bug ### How was this patch tested? Regression test Closes #28434 from brkyvz/ddlCommands. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-04 12:22:29 +00:00
Wenchen Fan	f72220b8ab	[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 15:30:10 +09:00
Tianshi Zhu	a222644e1d	[SPARK-31267][SQL] Flaky test: WholeStageCodegenSparkSubmitSuite.Generated code on driver should not embed platform-specific constant ### What changes were proposed in this pull request? Allow customized timeouts for `runSparkSubmit`, which will make flaky tests more likely to pass by using a larger timeout value. I was able to reproduce the test failure on my laptop, which took 1.5 - 2 minutes to finish the test. After increasing the timeout, the test now can pass locally. ### Why are the changes needed? This allows slow tests to use a larger timeout, so they are more likely to succeed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test was able to pass on my local env after the change. Closes #28438 from tianshizz/SPARK-31267. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 14:50:38 +09:00
Max Gekk	2fb85f6b68	[SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in `DateTimeBenchmark` ### What changes were proposed in this pull request? - Changed to the number of rows in benchmark cases from 3 to the actual number `N`. - Regenerated benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? The changes are needed to have: - Correct benchmark results - Base line for other perf improvements that can be checked in the same environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and checking its output. Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-04 09:39:50 +09:00
Max Gekk	13dddee9a8	[MINOR][SQL][TESTS] Disable UI in SQL benchmarks by default ### What changes were proposed in this pull request? Set `spark.ui.enabled` to `false` in `SqlBasedBenchmark.getSparkSession`. This disables UI in all SQL benchmarks by default. ### Why are the changes needed? UI overhead lowers numbers in the `Relative` column and impacts on `Stdev` in benchmark results. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked by running `DateTimeRebaseBenchmark`. Closes #28432 from MaxGekk/ui-off-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-02 17:54:36 +09:00
Pablo Langa	4fecc20f6e	[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-01 22:09:04 +09:00
Yuanjian Li	aec8b69435	[SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units ### What changes were proposed in this pull request? Add test cases covering all interval units: MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test only. Closes #28418 from xuanyuanking/SPARK-28424. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-01 10:32:37 +09:00
Yuanjian Li	7195a18bf2	[SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests ### What changes were proposed in this pull request? - Rephrase the API doc for `Column.as` - Simplify the UTs ### Why are the changes needed? Address comments in https://github.com/apache/spark/pull/28326 ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #28390 from xuanyuanking/SPARK-27340-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 06:24:00 +00:00
beliefer	1d1bb79bc6	[SPARK-31372][SQL][TEST] Display expression schema for double check ### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:58:04 +00:00
Kent Yao	9241f8282f	[SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations ### What changes were proposed in this pull request? Because of `ebc8fa50d0` and `beec8d535f`, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....` This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back. ### Why are the changes needed? restore the SQL string for datetime operations ### Does this PR introduce any user-facing change? NO, we are restoring here ### How was this patch tested? added unit tests Closes #28402 from yaooqinn/SPARK-31586-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:31:29 +00:00
Max Gekk	91648654da	[SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:20:10 +00:00
Takeshi Yamamuro	97f2c03d3b	[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema ### What changes were proposed in this pull request? This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ \|rand(7986133828002692830)\| +-------------------------+ \| 0.9524061403696937\| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ \| rand()\| +------------------+ \|0.7137935639522275\| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ \| rand(1)\| +------------------+ \|0.6363787615254752\| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == (1) Project [rand(-2282124938778456838) AS rand()#0] +- (1) Scan OneRowRelation[] ``` Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. ### Why are the changes needed? To make output schema deterministic. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-29 00:14:50 -07:00
Terry Kim	36803031e8	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:06:45 +00:00
Kent Yao	ea525fe8c0	[SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond ### What changes were proposed in this pull request? With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933 I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values. ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 ------- 14 (1 row) Query 20200428_135239_00000_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 ------- 13 (1 row) Query 20200428_135246_00001_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```sql postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part ----------- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part ----------- 13 ``` ``` spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); 0 spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); 0 ``` In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value. ### Why are the changes needed? Both satisfy the ANSI standard and common use cases in modern SQL platforms ### Does this PR introduce any user-facing change? No, it new in 3.0 ### How was this patch tested? add more uts Closes #28396 from yaooqinn/SPARK-31597. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 06:56:33 +00:00
Kent Yao	295d866969	[SPARK-31596][SQL][DOCS] Generate SQL Configurations from hive module to configuration doc ### What changes were proposed in this pull request? This PR adds `-Phive` profile to the pre-build phase to build the hive module to dev classpath. Then reflect the HiveUtils object to dump all configurations in the class. ### Why are the changes needed? supply SQL configurations from hive module to doc ### Does this PR introduce any user-facing change? NO ### How was this patch tested? passing Jenkins add verified locally ![image](https://user-images.githubusercontent.com/8326978/80492333-6fae1200-8996-11ea-99fd-595ee18c67e5.png) Closes #28394 from yaooqinn/SPARK-31596. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-29 15:34:45 +09:00
Kent Yao	54996be4d2	[SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot. According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth. In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories. ### Why are the changes needed? Performance test coverage, since these operations are missing in the DateTimeBenchmark. ### Does this PR introduce any user-facing change? No, just test ### How was this patch tested? regenerated benchmark results Closes #28369 from yaooqinn/SPARK-31527-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 15:39:28 +00:00
Max Gekk	b7cabc80e6	[SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit `5631a96367`. Closes #28328 ### Why are the changes needed? The PR https://github.com/apache/spark/pull/25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return 'true' because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ \|isInCollection\| +--------------+ \| false\| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:10:50 +00:00
Kent Yao	beec8d535f	[SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r) ### What changes were proposed in this pull request? The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent. Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239 Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR. ### Why are the changes needed? remove redundant and useless code for easy maintenance ### Does this PR introduce any user-facing change? Yes, the SQL string of `datetime - interval` become `datetime + (- interval)` ### How was this patch tested? modified existing unit tests. Closes #28381 from yaooqinn/SPARK-31586. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:01:07 +00:00
Yuanjian Li	6ed2dfbba1	[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ \| b\| fake\| +---+----------+ \| 2\|2020-01-01\| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 08:11:41 +00:00
jiake	079b3623c8	[SPARK-31524][SQL] Add metric to the split task number for skew optimization ### What changes were proposed in this pull request? This is a followup of [#28022](https://github.com/apache/spark/pull/28022), to add the metric info of split task number for skewed optimization. With this PR, we can see the number of splits for the skewed partitions as following: ![image](https://user-images.githubusercontent.com/11972570/80294583-ff886c00-879c-11ea-813c-2db302f99f04.png) ### Why are the changes needed? make UI more friendly ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing ut Closes #28109 from JkSelf/addSplitNumer. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 07:21:00 +00:00
Dongjoon Hyun	79eaaaf6da	[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.5.10. ### Why are the changes needed? Apache ORC 1.5.10 is a maintenance release with the following patches. - [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569 - [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte - [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs - [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file The following is release note. - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing ORC tests and a newly added test case. - The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114) - The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC. - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906 Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 18:56:30 -07:00
Wenchen Fan	2f4f38b6f1	[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets. This PR also fixes various problems: 1. the name check should consider case sensitivity 2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail. ### Why are the changes needed? correct the behavior ### Does this PR introduce any user-facing change? yes, fix the fore-mentioned behaviors. ### How was this patch tested? new tests Closes #28371 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 16:47:39 -07:00
Yuanjian Li	ba7adc4949	[SPARK-27340][SS] Alias on TimeWindow expression cause watermark metadata lost Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work. ### What changes were proposed in this pull request? Make metadata propagatable between Aliases. ### Why are the changes needed? In Structured Streaming, we added an Alias for TimeWindow by default. `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L3272-L3273)` For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/Column.scala (L1049-L1054)` and finally cause the AnalysisException: ``` Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition ``` ### Does this PR introduce any user-facing change? Bugfix for an alias on time window with watermark. ### How was this patch tested? New UTs added. One for the functionality and one for explaining the common scenario. Closes #28326 from xuanyuanking/SPARK-27340. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 15:07:52 -07:00
Kousuke Saruta	7d4d05c684	[SPARK-31565][WEBUI][FOLLOWUP] Add font color setting of DAG-viz for query plan ### What changes were proposed in this pull request? This PR adds a font color setting of DAG-viz for query plan. ### Why are the changes needed? #28352 aimed to unify the font color of all DAG-viz in WebUI but there is one part left over. Before this change applied, the appearance of a query plan is like as follows. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80325600-ca4d4e00-8870-11ea-945c-64971dbb752c.png"> The color of `WholeStageCodegen (1)` and its following text (`duration: total...`) is slightly darker than `SerializeFromObject`. After this change, those color is unified as `#333333`. <img width="450" alt="plan-graph-fixed2" src="https://user-images.githubusercontent.com/4736016/80325651-fb2d8300-8870-11ea-8ed8-178c124d224c.png"> ### Does this PR introduce any user-facing change? Slightly yes. ### How was this patch tested? I confirmed the style of `fill` and `color` is `#333333` by debug console of Chrome. <img width="321" alt="fill" src="https://user-images.githubusercontent.com/4736016/80325760-6c6d3600-8871-11ea-82e7-e789bf741f2a.png"> <img width="316" alt="color" src="https://user-images.githubusercontent.com/4736016/80325765-70995380-8871-11ea-8976-7020205d585c.png"> Closes #28355 from sarutak/followup-SPARK-31565. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 13:34:43 -07:00
Kent Yao	5ba467ca1d	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc ### What changes were proposed in this pull request? ```scala spark.sql.session.timeZone spark.sql.warehouse.dir ``` these 2 configs are nondeterministic and vary with environments Besides, reflect code in `gen-sql-config-docs.py` via https://github.com/apache/spark/pull/28274#discussion_r412893096 and `configuration.md` via https://github.com/apache/spark/pull/28274#discussion_r412894905 ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? verify locally ![image](https://user-images.githubusercontent.com/8326978/80179099-5e7da200-8632-11ea-803f-d47a93151869.png) Closes #28322 from yaooqinn/SPARK-31550. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 17:08:52 +09:00
yi.wu	7df658414b	[SPARK-31529][SQL] Remove extra whitespaces in formatted explain ### What changes were proposed in this pull request? Remove all the extra whitespaces in the formatted explain. ### Why are the changes needed? The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can. ### Does this PR introduce any user-facing change? No, formatted explain is newly added in Spark 3.0. ### How was this patch tested? Updated sql query tests. Closes #28315 from Ngone51/fix_extra_spaces. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 07:39:24 +00:00
Kent Yao	ebc8fa50d0	[SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode ### What changes were proposed in this pull request? To follow ANSI，the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0. ### Why are the changes needed? Better ANSI compliance ### Does this PR introduce any user-facing change? No, this PR should target 3.0.0 in which this feature is newly added. ### How was this patch tested? add more unit tests Closes #28310 from yaooqinn/SPARK-31527. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 05:28:46 +00:00
Kousuke Saruta	91ec2eacfa	[SPARK-31565][WEBUI] Unify the font color of label among all DAG-viz ### What changes were proposed in this pull request? This PR unifies the font color of label as `#333333` among all DAG-viz. ### Why are the changes needed? For the consistent appearance among all DAG-viz. There are three types of DAG-viz in the WebUI. One is for stages, another one is for RDDs and the last one is for query plans. But the font color of labels are slightly different among them. For stages, the color is `#333333` (simply 333) which is specified by `spark-dag-viz.css`. <img width="355" alt="job-graph" src="https://user-images.githubusercontent.com/4736016/80321397-b517f580-8857-11ea-8c8e-cf68f648ab05.png"> <img width="310" alt="job-graph-color" src="https://user-images.githubusercontent.com/4736016/80321399-ba754000-8857-11ea-8708-83bdef4bc1d1.png"> For RDDs, the color is `#212529` which is specified by `bootstrap.min.js`. <img width="386" alt="stage-graph" src="https://user-images.githubusercontent.com/4736016/80321438-f0b2bf80-8857-11ea-9c2a-13fa0fd1431c.png"> <img width="313" alt="stage-graph-color" src="https://user-images.githubusercontent.com/4736016/80321444-fa3c2780-8857-11ea-81b2-4f1203d47896.png"> For query plans, the color is `black` which is specified by `spark-sql-viz.css`. <img width="449" alt="plan-graph" src="https://user-images.githubusercontent.com/4736016/80321490-61f27280-8858-11ea-9c3a-2c98d3d4d03b.png"> <img width="316" alt="plan-graph-color" src="https://user-images.githubusercontent.com/4736016/80321496-6ae34400-8858-11ea-8fe8-0d6e4a821608.png"> After the change, the appearance is like as follows (no change for stages). For RDDs. <img width="389" alt="stage-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321613-6b300f00-8859-11ea-912f-d92474aa9f47.png"> For query plans. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321638-9a468080-8859-11ea-974c-33c56a8ffe1a.png"> ### Does this PR introduce any user-facing change? Yes. The unified color is slightly lighter than ever. ### How was this patch tested? Confirmed that the color code among all DAG-viz are `#333333` using browser's debug console. Closes #28352 from sarutak/unify-label-color. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 16:57:23 -07:00
Max Gekk	bd139bda4a	[SPARK-31489][SQL] Fix pushing down filters with `java.time.LocalDate` values in ORC ### What changes were proposed in this pull request? Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled. Closes #28272 ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` Wrong value class java.time.LocalDate for DATE.EQUALS leaf java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352) at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28261 from MaxGekk/orc-date-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:49:00 -07:00
Peter Toth	4f53bfbbd5	[SPARK-31535][SQL] Fix nested CTE substitution ### What changes were proposed in this pull request? This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result: ``` WITH t(c) AS (SELECT 1) SELECT * FROM t WHERE c IN ( WITH t(c) AS (SELECT 2) SELECT * FROM t ) ``` Before this PR the result was `1`. ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce any user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new test case. Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:31:32 -07:00
Takeshi Yamamuro	e01125db0d	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp ### What changes were proposed in this pull request? This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows; - SELECT current_date; - SELECT current_timestamp; - SELECT substring('abcd' FROM 1); - SELECT substring('abcd' FROM 1 FOR 2); ### Why are the changes needed? To make the SQL (built-in functions) references complete. ### Does this PR introduce any user-facing change? <img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png"> <img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png"> <img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png"> ### How was this patch tested? Added test examples. Closes #28342 from maropu/SPARK-31562. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 11:46:52 -07:00
Gengliang Wang	f59ebdef5b	[SPARK-31558][UI] Code clean up in spark-sql-viz.js ### What changes were proposed in this pull request? 1. Remove console.log(), which seems unnecessary in the releases. 2. Replace the double equals to triple equals 3. Reuse jquery selector. ### Why are the changes needed? For better code quality. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests + manual test. Closes #28333 from gengliangwang/removeLog. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-25 13:43:52 -07:00
Kent Yao	7959808e96	[SPARK-31564][TESTS] Fix flaky AllExecutionsPageSuite for checking 1970 ### What changes were proposed in this pull request? Fix flakiness by checking `1970/01/01` instead of `1970`. The test was added by SPARK-27125 for 3.0.0. ### Why are the changes needed? the `org.apache.spark.sql.execution.ui.AllExecutionsPageSuite.SPARK-27019:correctly display SQL page when event reordering happens` test is flaky for just checking the `html` content not containing 1970. I will add a ticket to check and fix that. In the specific failure https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121799/testReport, it failed because the `html` ``` ... <td sorttable_customkey="1587806019707"> ... ``` contained `1970`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? passing jenkins Closes #28344 from yaooqinn/SPARK-31564. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 10:27:05 -07:00
Max Gekk	7d8216a664	[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types ### What changes were proposed in this pull request? In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection. ### Why are the changes needed? The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563. ### Does this PR introduce any user-facing change? Highly likely, not. ### How was this patch tested? Added a test to `ColumnExpressionSuite` Closes #28343 from MaxGekk/fix-InSet-sql. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 09:29:51 -07:00
Gengliang Wang	16b961526d	[SPARK-31560][SQL][TESTS] Add V1/V2 tests for TextSuite and WholeTextFileSuite ### What changes were proposed in this pull request? Add V1/V2 tests for TextSuite and WholeTextFileSuite ### Why are the changes needed? This is missing part since #24207. We should have these tests for test coverage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #28335 from gengliangwang/testV2Suite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:59:15 -07:00
Kent Yao	f92652d0b5	[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions ### What changes were proposed in this pull request? Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms. For example Negative: https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts https://prestodb.io/docs/current/functions/datetime.html#date_trunc https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/ Positive: https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC This PR removes these `fmt`s support for trunc and date_trunc functions. ### Why are the changes needed? clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later. ### Does this PR introduce any user-facing change? no, targeting 3.0.0, these are newly added in 3.0.0 ### How was this patch tested? remove and modify existing units tests Closes #28313 from yaooqinn/SPARK-31528. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:28:41 -07:00
Kent Yao	caf3ab8411	[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor ### What changes were proposed in this pull request? the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor` For example: ```scala scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder() newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T] scala> val decOne = Decimal(1, 38, 18) decOne: org.apache.spark.sql.types.Decimal = 1E-18 scala> val decTwo = Decimal(2, 38, 18) decTwo: org.apache.spark.sql.types.Decimal = 2E-18 scala> val decSpark = Array(decOne, decTwo) decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18) scala> Seq(decSpark).toDF() java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57) at newArrayEncoder(<console>:57) ... 53 elided scala> ``` In this PR, we add the missing cases to `arrayClassFor` ### Why are the changes needed? bugfix as described above ### Does this PR introduce any user-facing change? no ### How was this patch tested? add a test for array encoders Closes #28324 from yaooqinn/SPARK-31552. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:04:26 -07:00
Kent Yao	8424f55229	[SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession ### What changes were proposed in this pull request? SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession This seems a long-standing bug. ```scala scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|spark.sql.warehou...\|file:/Users/kenty...\| +--------------------+--------------------+ scala> spark.sql("set spark.sql.warehouse.dir=2"); org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir; at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ... 47 elided scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get getClass getOrCreate scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574 scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+-----+ \| key\|value\| +--------------------+-----+ \|spark.sql.warehou...\| xyz\| +--------------------+-----+ scala> OptionsAttachments ``` ### Why are the changes needed? bugfix as shown in the previous section ### Does this PR introduce any user-facing change? Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances. ### How was this patch tested? new ut. Closes #28316 from yaooqinn/SPARK-31532. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-25 08:53:00 +09:00
Jian Tang	6a576161ae	[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown ### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-04-24 22:10:58 +00:00
Yuming Wang	b10263b8e5	[SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators ### What changes were proposed in this pull request? `LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example: Teradata / Hive 3.0 / Snowflake: ```sql --like any select 'foo' LIKE ANY ('%foo%','%bar%'); --like all select 'foo' LIKE ALL ('%foo%','%bar%'); ``` PostgreSQL: ```sql -- like any select 'foo' LIKE ANY (array['%foo%','%bar%']); -- like all select 'foo' LIKE ALL (array['%foo%','%bar%']); ``` This PR add support these two operators. More details: https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ https://issues.apache.org/jira/browse/HIVE-15229 https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html ### Why are the changes needed? To smoothly migrate SQLs to Spark SQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27477 from wangyum/SPARK-30724. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-24 22:20:32 +09:00
yi.wu	463c54419b	[SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf ### What changes were proposed in this pull request? Give more friendly warning message/migration guide of deprecated scala udf to users. ### Why are the changes needed? User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #28311 from Ngone51/update_udf_doc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 19:10:18 +09:00
Jungtaek Lim (HeartSaVioR)	39bc50dbf8	[SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog ### What changes were proposed in this pull request? This patch adds some log messages to log elapsed time for "compact" operation in FileStreamSourceLog and FileStreamSinkLog (added in CompactibleFileStreamLog) to help investigating the mysterious latency spike during the batch run. ### Why are the changes needed? Tracking latency is a critical aspect of streaming query. While "compact" operation may bring nontrivial latency (it's even synchronous, adding all the latency to the batch run), it's not measured and end users have to guess. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A for UT. Manual test with streaming query using file source & file sink. > grep "for compact batch" <driver log> ``` ... 20/02/20 19:27:36 WARN FileStreamSinkLog: Compacting took 24473 ms (load: 14185 ms, write: 10288 ms) for compact batch 21359 20/02/20 19:27:39 WARN FileStreamSinkLog: Loaded 1068000 entries (397985432 bytes in memory), and wrote 1068000 entries for compact batch 21359 20/02/20 19:29:52 WARN FileStreamSourceLog: Compacting took 3777 ms (load: 1524 ms, write: 2253 ms) for compact batch 21369 20/02/20 19:29:52 WARN FileStreamSourceLog: Loaded 229477 entries (68970112 bytes in memory), and wrote 229477 entries for compact batch 21369 20/02/20 19:30:17 WARN FileStreamSinkLog: Compacting took 24183 ms (load: 12992 ms, write: 11191 ms) for compact batch 21369 20/02/20 19:30:20 WARN FileStreamSinkLog: Loaded 1068500 entries (398171880 bytes in memory), and wrote 1068500 entries for compact batch 21369 ... ``` ![Screen Shot 2020-02-21 at 12 34 22 PM](https://user-images.githubusercontent.com/1317309/75002142-c6830100-54a6-11ea-8da6-17afb056653b.png) This messages are explaining why the operation duration peaks per every 10 batches which is compact interval. Latency from addBatch heavily increases in each peak which DOES NOT mean it takes more time to write outputs, but we have no idea if such message is not presented. NOTE: The output may be a bit different from the code, as it may be changed a bit during review phase. Closes #27557 from HeartSaVioR/SPARK-30804. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 12:34:44 +09:00
Max Gekk	26165427c7	[SPARK-31488][SQL] Support `java.time.LocalDate` in Parquet filter pushdown ### What changes were proposed in this pull request? 1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes. 2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days. 3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`). ### Why are the changes needed? To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test to `ParquetFilterSuite` Closes #28259 from MaxGekk/parquet-filter-java8-date-time. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-24 02:21:53 +00:00
Takeshi Yamamuro	42f496f6ac	[SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo ### What changes were proposed in this pull request? This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows; - Added a new test suite named `ExpressionInfoSuite` - To improve test coverage, added a test for error handling in `ExpressionInfoSuite` - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite` - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite` - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224) ### Why are the changes needed? To improve test suites/coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28308 from maropu/SPARK-31526. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 11:19:20 +09:00
Takeshi Yamamuro	820733aee2	[SPARK-31476][SQL][FOLLOWUP] Add tests for extract('field', source) ### What changes were proposed in this pull request? SPARK-31476 has supported `extract('field', source)` as side-effect, so this PR intends to add some tests for the function in `SQLQueryTestSuite`. ### Why are the changes needed? For better test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28276 from maropu/SPARK-31476-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 04:59:59 +00:00
Gabor Somogyi	c619990c1d	[SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector ### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added DB2 support (other supported databases will come in later PRs). What this PR contains: * Added `DB2ConnectionProvider` * Added `DB2ConnectionProviderSuite` * Added `DB2KrbIntegrationSuite` docker integration test * Changed DB2 JDBC driver to use the latest (test scope only) * Changed test table data type to a type which is supported by all the databases * Removed double connection creation on test side * Increased connection timeout in docker tests because DB2 docker takes quite a time to start ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, now user is able to connect to DB2 using kerberos. ### How was this patch tested? * Additional + existing unit tests * Additional + existing integration tests * Test on cluster manually Closes #28215 from gaborgsomogyi/SPARK-31272. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>	2020-04-22 17:10:30 -07:00
yi.wu	8fbfdb38c0	[SPARK-31495][SQL] Support formatted explain for AQE ### What changes were proposed in this pull request? To support formatted explain for AQE. ### Why are the changes needed? AQE does not support formatted explain yet. It's good to support it for better user experience, debugging, etc. Before: ``` == Physical Plan == AdaptiveSparkPlan (1) +- * HashAggregate (unknown) +- CustomShuffleReader (unknown) +- ShuffleQueryStage (unknown) +- Exchange (unknown) +- * HashAggregate (unknown) +- * Project (unknown) +- * BroadcastHashJoin Inner BuildRight (unknown) :- * LocalTableScan (unknown) +- BroadcastQueryStage (unknown) +- BroadcastExchange (unknown) +- LocalTableScan (unknown) (1) AdaptiveSparkPlan Output [4]: [k#7, count(v1)#32L, sum(v1)#33L, avg(v2)#34] Arguments: HashAggregate(keys=[k#7], functions=[count(1), sum(cast(v1#8 as bigint)), avg(cast(v2#19 as bigint))]), AdaptiveExecutionContext(org.apache.spark.sql.SparkSession104ab57b), [PlanAdaptiveSubqueries(Map())], false ``` After: ``` == Physical Plan == AdaptiveSparkPlan (14) +- * HashAggregate (13) +- CustomShuffleReader (12) +- ShuffleQueryStage (11) +- Exchange (10) +- * HashAggregate (9) +- * Project (8) +- * BroadcastHashJoin Inner BuildRight (7) :- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (6) +- BroadcastExchange (5) +- * Project (4) +- * LocalTableScan (3) (1) LocalTableScan [codegen id : 2] Output [2]: [_1#x, _2#x] Arguments: [_1#x, _2#x] (2) Project [codegen id : 2] Output [2]: [_1#x AS k#x, _2#x AS v1#x] Input [2]: [_1#x, _2#x] (3) LocalTableScan [codegen id : 1] Output [2]: [_1#x, _2#x] Arguments: [_1#x, _2#x] (4) Project [codegen id : 1] Output [2]: [_1#x AS k#x, _2#x AS v2#x] Input [2]: [_1#x, _2#x] (5) BroadcastExchange Input [2]: [k#x, v2#x] Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#x] (6) BroadcastQueryStage Output [2]: [k#x, v2#x] Arguments: 0 (7) BroadcastHashJoin [codegen id : 2] Left keys [1]: [k#x] Right keys [1]: [k#x] Join condition: None (8) Project [codegen id : 2] Output [3]: [k#x, v1#x, v2#x] Input [4]: [k#x, v1#x, k#x, v2#x] (9) HashAggregate [codegen id : 2] Input [3]: [k#x, v1#x, v2#x] Keys [1]: [k#x] Functions [3]: [partial_count(1), partial_sum(cast(v1#x as bigint)), partial_avg(cast(v2#x as bigint))] Aggregate Attributes [4]: [count#xL, sum#xL, sum#x, count#xL] Results [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] (10) Exchange Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Arguments: hashpartitioning(k#x, 5), true, [id=#x] (11) ShuffleQueryStage Output [5]: [sum#xL, k#x, sum#x, count#xL, count#xL] Arguments: 1 (12) CustomShuffleReader Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Arguments: coalesced (13) HashAggregate [codegen id : 3] Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL] Keys [1]: [k#x] Functions [3]: [count(1), sum(cast(v1#x as bigint)), avg(cast(v2#x as bigint))] Aggregate Attributes [3]: [count(1)#xL, sum(cast(v1#x as bigint))#xL, avg(cast(v2#x as bigint))#x] Results [4]: [k#x, count(1)#xL AS count(v1)#xL, sum(cast(v1#x as bigint))#xL AS sum(v1)#xL, avg(cast(v2#x as bigint))#x AS avg(v2)#x] (14) AdaptiveSparkPlan Output [4]: [k#x, count(v1)#xL, sum(v1)#xL, avg(v2)#x] Arguments: isFinalPlan=true ``` ### Does this PR introduce any user-facing change? No, this should be new feature along with AQE in Spark 3.0. ### How was this patch tested? Added a query file: `explain-aqe.sql` and a unit test. Closes #28271 from Ngone51/support_formatted_explain_for_aqe. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 12:44:06 +00:00
Kent Yao	37d2e037ed	[SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function ### What changes were proposed in this pull request? Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT This PR removes these extract fields support from extract function for date and timestamp values `isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms `isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it. For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated. 1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`] 2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`] 3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found. 4. with a discussion the community，we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too. Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system ### Why are the changes needed? Remove some nonstandard and uncommon features as we can add them back if necessary ### Does this PR introduce any user-facing change? NO, we should target this to 3.0.0 and these are added during 3.0.0 ### How was this patch tested? Remove unused tests Closes #28284 from yaooqinn/SPARK-31507. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 10:24:49 +00:00
Kent Yao	2c2062ea7c	[SPARK-31498][SQL][DOCS] Dump public static sql configurations through doc generation ### What changes were proposed in this pull request? Currently, only the non-static public SQL configurations are dump to public doc, we'd better also add those static public ones as the command `set -v` This PR force call StaticSQLConf to buildStaticConf. ### Why are the changes needed? Fix missing SQL configurations in doc ### Does this PR introduce any user-facing change? NO ### How was this patch tested? add unit test and verify locally to see if public static SQL conf is in `docs/sql-config.html` Closes #28274 from yaooqinn/SPARK-31498. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-22 10:16:39 +00:00
herman	cf6038499d	[SPARK-31511][SQL] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #28286 from hvanhovell/SPARK-31511. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-21 18:17:19 -07:00
Wenchen Fan	b209b5f406	[SPARK-31503][SQL] fix the SQL string of the TRIM functions ### What changes were proposed in this pull request? override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax. ### Why are the changes needed? The current implementation is wrong. It gives you a SQL string that returns different result. ### Does this PR introduce any user-facing change? No ### How was this patch tested? new tests Closes #28281 from cloud-fan/sql. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-21 11:22:18 -07:00
Wenchen Fan	a5ebbacf53	[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata ### What changes were proposed in this pull request? This PR adds a new parquet/avro file metadata: `org.apache.spark.legacyDatetime`. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it. This makes Spark be able to do rebase more smartly: 1. If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true. 2. If the file was written by Spark 2.4 and earlier, then do rebase. 3. If the file was written by Spark 3.0 and later, do rebase if the `org.apache.spark.legacyDatetime` exists in file metadata. ### Why are the changes needed? It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately. To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test Closes #28137 from cloud-fan/datetime. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-22 00:26:23 +09:00
yi.wu	55b026a783	[SPARK-31504][SQL] Formatted Explain should have determined order of Output fields ### What changes were proposed in this pull request? In `verboseStringWithOperatorId`, use `output` (it's `Seq[Attribute]`) instead of `producedAttributes` (it's `AttributeSet`) to generates `"Output"` for the leaf node in order to make `"Output"` determined. ### Why are the changes needed? Currently, Formatted Explain use `producedAttributes`, the `AttributeSet`, to generate `"Output"`. As a result, the fields order within `"Output"` can be different from time to time. It's That means, for the same plan, it could have different explain outputs. ### Does this PR introduce any user-facing change? Yes, user see the determined fields order within formatted explain now. ### How was this patch tested? Added a regression test. Closes #28282 from Ngone51/fix_output. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 12:33:58 +00:00
Kent Yao	1985437110	[SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function ### What changes were proposed in this pull request? ```sql spark-sql> SELECT extract(dayofweek from '2009-07-26'); 1 spark-sql> SELECT extract(dow from '2009-07-26'); 0 spark-sql> SELECT extract(isodow from '2009-07-26'); 7 spark-sql> SELECT dayofweek('2009-07-26'); 1 spark-sql> SELECT weekday('2009-07-26'); 6 ``` Currently, there are 4 types of day-of-week range: 1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7) 2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6) 3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7) 4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6) Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings. https://issues.apache.org/jira/browse/SPARK-23903 https://issues.apache.org/jira/browse/SPARK-28623 In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything. Also, add more documentation to the extracting function to make extract field more clear to understand. ### Why are the changes needed? Consistency insurance ### Does this PR introduce any user-facing change? yes, doc updated and extract `dow` is as same as `dayofweek` ### How was this patch tested? 1. modified ut 2. local SQL doc verification #### before ![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png) #### after ![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png) Closes #28248 from yaooqinn/SPARK-31474. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 11:55:33 +00:00
Maryann Xue	ae29cf24fc	[SPARK-31501][SQL] AQE update UI should not cause deadlock ### What changes were proposed in this pull request? This PR makes sure that AQE does not call update UI if the current execution ID does not match the current query. This PR also includes a minor refactoring that moves `getOrCloneSessionWithAqeOff` from `QueryExecution` to `AdaptiveSparkPlanHelper` since that function is not used by `QueryExecution` any more. ### Why are the changes needed? Without this fix, there could be a potential deadlock. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #28275 from maryannxue/aqe-ui-deadlock. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-21 03:56:42 +00:00
Takeshi Yamamuro	e42dbe7cd4	[SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions ### What changes were proposed in this pull request? This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references. ### Why are the changes needed? To make SQL references complete. ### Does this PR introduce any user-facing change? Yes; ![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png) ![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png) ![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png) ### How was this patch tested? Manually checked and added tests. Closes #28224 from maropu/SPARK-31429. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-21 10:55:13 +09:00
rishi	4f8b03d336	[SPARK-31389][SQL][TESTS] Add codegen-on test coverage for some tests in SQLMetricsSuite ### What changes were proposed in this pull request? Adding missing unit tests in SQLMetricSuite to cover the code generated path. Additional tests were added in the following unit tests. Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics, SortMergeJoin(left-anti) metrics ### Why are the changes needed? The existing tests in SQLMetricSuite only cover the interpreted path. It is necessary for the tests to cover code generated path as well since CodeGenerated path is often used in production. The PR doesn't change test("Aggregate metrics") and test("ObjectHashAggregate metrics"). The test("Aggregate metrics") tests metrics when a HashAggregate is used. Enabling codegen forces the test to use ObjectHashAggregate rather than the regular HashAggregate. ObjectHashAggregate has a test of its own. Therefore, I feel these two tests need not enabling codegen is not necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added debug statements in the code to make sure both Code generated and Interpreted paths are being exercised. I further used Intellij debugger to ensure that the newly added unit tests are in fact exercising both code generated and interpreted paths. Closes #28173 from sririshindra/SPARK-31389. Authored-by: rishi <spothireddi@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 14:41:45 -07:00
Wenchen Fan	69f9ee18b6	[SPARK-31452][SQL] Do not create partition spec for 0-size partitions in AQE ### What changes were proposed in this pull request? This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing. ### Why are the changes needed? launching tasks that do nothing is a waste. ### Does this PR introduce any user-facing change? no ### How was this patch tested? updated tests Closes #28226 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 13:50:07 -07:00
gatorsmile	6c792a79c1	[SPARK-31234][SQL][FOLLOW-UP] ResetCommand should not affect static SQL Configuration ### What changes were proposed in this pull request? This PR is the follow-up PR of https://github.com/apache/spark/pull/28003 - add a migration guide - add an end-to-end test case. ### Why are the changes needed? The original PR made the major behavior change in the user-facing RESET command. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a new end-to-end test Closes #28265 from gatorsmile/spark-31234followup. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-04-20 13:08:55 -07:00
Maryann Xue	44d370dd45	[SPARK-31475][SQL] Broadcast stage in AQE did not timeout ### What changes were proposed in this pull request? This PR adds a timeout for the Future of a BroadcastQueryStageExec to make sure it can have the same timeout behavior as a non-AQE broadcast exchange. ### Why are the changes needed? This is to make the broadcast timeout behavior in AQE consistent with that in non-AQE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #28250 from maryannxue/aqe-broadcast-timeout. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-04-20 11:55:48 -07:00
Max Gekk	f1fde0cc22	[SPARK-31490][SQL][TESTS] Benchmark conversions to/from Java 8 datetime types ### What changes were proposed in this pull request? - Add benchmark cases for parallelizing `java.time.LocalDate` and `java.time.Instant` column values. - Add benchmark cases for collecting `java.time.LocalDate` and `java.time.Instant` column values. ### Why are the changes needed? - To detect perf regression in the future - To compare parallelization/collection of Java 8 date-time types with Java 7 date-time types `java.sql.Date` & `java.sql.Timestamp`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified benchmarks in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| Closes #28263 from MaxGekk/java8-datetime-collect-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-20 07:26:38 +00:00
Terry Kim	d7499aed9c	[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ \| c1\| +--------+ \| [, a2]\| \|[b1, b2]\| \| null\| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \|[b1, b2]\| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \| [, a2]\| \|[b1, b2]\| \| null\| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ \| c1\| +--------+ \|[b1, b2]\| +--------+ ``` Also, if `` is specified as a column, it will throw an `AnalysisException` that `` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-20 02:59:09 +00:00

1 2 3 4 5 ...

6853 commits