ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Jungtaek Lim (HeartSaVioR)	5a258b0b67	[SPARK-30915][SS] CompactibleFileStreamLog: Avoid reading the metadata log file when finding the latest batch ID ### What changes were proposed in this pull request? This patch adds the new method `getLatestBatchId()` in CompactibleFileStreamLog in complement of getLatest() which doesn't read the content of the latest batch metadata log file, and apply to both FileStreamSource and FileStreamSink to avoid unnecessary latency on reading log file. ### Why are the changes needed? Once compacted metadata log file becomes huge, writing outputs for the compact + 1 batch is also affected due to unnecessarily reading the compacted metadata log file. This unnecessary latency can be simply avoided. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Also manually tested under query which has huge metadata log on file stream sink: > before applying the patch ![Screen Shot 2020-02-21 at 4 20 19 PM](https://user-images.githubusercontent.com/1317309/75016223-d3ffb180-54cd-11ea-9063-49405943049d.png) > after applying the patch ![Screen Shot 2020-02-21 at 4 06 18 PM](https://user-images.githubusercontent.com/1317309/75016220-d235ee00-54cd-11ea-81a7-7c03a43c4db4.png) Peaks are compact batches - please compare the next batch after compact batches, especially the area of "light brown". Closes #27664 from HeartSaVioR/SPARK-30915. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-05-22 16:46:17 -07:00
TJX2014	2115c55efe	[SPARK-31710][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions ### What changes were proposed in this pull request? Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS` A test is added. Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds) ### Why are the changes needed? People will have convenient way to get timestamps from seconds,milliseconds and microseconds. ### Does this PR introduce _any_ user-facing change? Yes, people will have the following ways to get timestamp: ```scala sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false) ``` ``` +-------------------------+ \|timestamp \| +-------------------------+ \|2008-12-25 23:30:00\| \|1931-01-07 16:30:00\| +-------------------------+ ``` ```scala sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false) ``` ``` +-------------------------------+ \|timestamp \| +-------------------------------+ \|2008-12-25 23:30:00.123\| \|1931-01-07 16:29:59.877\| +-------------------------------+ ``` ```scala sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false) ``` ``` +------------------------------------+ \|timestamp \| +------------------------------------+ \|2008-12-25 23:30:00.123123\| \|1931-01-07 16:29:59.876877\| +------------------------------------+ ``` ### How was this patch tested? Unit test. Closes #28534 from TJX2014/master-SPARK-31710. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-22 14:16:30 +00:00
Wenchen Fan	ce4da29ec3	[SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string ### What changes were proposed in this pull request? This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value. If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility. ### Why are the changes needed? To keep backward compatibility with Spark 2.4. ### Does this PR introduce _any_ user-facing change? Yes. Spark 2.4: ``` scala> sql("select to_timestamp('16', 'dd')").show +------------------------+ \|to_timestamp('16', 'dd')\| +------------------------+ \| 1970-01-16 00:00:00\| +------------------------+ scala> sql("select to_date('16', 'dd')").show +-------------------+ \|to_date('16', 'dd')\| +-------------------+ \| 1970-01-16\| +-------------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +----------------------------------+ \|to_timestamp('2019 40', 'yyyy mm')\| +----------------------------------+ \| 2019-01-01 00:40:00\| +----------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +----------------------------------------------+ \|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')\| +----------------------------------------------+ \| 2019-01-01 10:10:10\| +----------------------------------------------+ ``` in branch 3.0 ``` scala> sql("select to_timestamp('16', 'dd')").show +--------------------+ \|to_timestamp(16, dd)\| +--------------------+ \| null\| +--------------------+ scala> sql("select to_date('16', 'dd')").show +---------------+ \|to_date(16, dd)\| +---------------+ \| null\| +---------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +------------------------------+ \|to_timestamp(2019 40, yyyy mm)\| +------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +------------------------------------------+ \|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)\| +------------------------------------------+ \| 2019-01-01 00:00:00\| +------------------------------------------+ ``` After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled. ### How was this patch tested? new tests Closes #28576 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 16:10:08 +09:00
Max Gekk	60118a2426	[SPARK-31785][SQL][TESTS] Add a helper function to test all parquet readers ### What changes were proposed in this pull request? Add `withAllParquetReaders` to `ParquetTest`. The function allow to run a block of code for all available Parquet readers. ### Why are the changes needed? 1. It simplifies tests 2. Allow to test all parquet readers that could be available in projects based on Apache Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28598 from MaxGekk/add-withAllParquetReaders. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 09:53:35 +09:00
Gengliang Wang	db5e5fce68	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `92877c4ef2`. Closes #28602 from gengliangwang/revertSPARK-31765. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 16:00:58 -07:00
Kousuke Saruta	92877c4ef2	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-21 11:43:25 -07:00
iRakson	f1495c5bc0	[SPARK-31688][WEBUI] Refactor Pagination framework ### What changes were proposed in this pull request? Currently while implementing pagination using the existing pagination framework, a lot of code is being copied as pointed out [here](https://github.com/apache/spark/pull/28485#pullrequestreview-408881656). I introduced some changes in `PagedTable` which is the main trait for implementing the pagination. * Added function for getting table parameters. * Added a function for table header row. This will help in maintaining consistency across the tables. All the header rows across tables will be consistent now. ### Why are the changes needed? * A lot of code is copied every time pagination is implemented for any table. * Code readability is not great as lot of HTML is embedded. * Paginating other tables will be a lot easier now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. This is mainly refactoring work, no new functionality introduced. Existing test cases should pass. Closes #28512 from iRakson/refactorPaginationFramework. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-21 13:00:00 -05:00
Vinoo Ganesh	dae79888dc	[SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener ## What changes were proposed in this pull request? This change was made as a result of the conversation on https://issues.apache.org/jira/browse/SPARK-31354 and is intended to continue work from that ticket here. This change fixes a memory leak where SparkSession listeners are never cleared off of the SparkContext listener bus. Before running this PR, the following code: ``` SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() ``` would result in a SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb, <-First instance org.apache.spark.sql.SparkSession$$anon$1fadb9a0] <- Second instance ``` After this PR, the execution of the same code above results in SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb] <-One instance ``` ## How was this patch tested? * Unit test included as a part of the PR Closes #28128 from vinooganesh/vinooganesh/SPARK-27958. Lead-authored-by: Vinoo Ganesh <vinoo.ganesh@gmail.com> Co-authored-by: Vinoo Ganesh <vganesh@palantir.com> Co-authored-by: Vinoo Ganesh <vinoo@safegraph.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 16:06:28 +00:00
Max Gekk	5d673319af	[SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString ### What changes were proposed in this pull request? 1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings: - TimestampFormatter: - `def format(ts: Timestamp): String` - `def format(instant: Instant): String` - DateFormatter: - `def format(date: Date): String` - `def format(localDate: LocalDate): String` 2. Re-use the added methods from `HiveResult.toHiveString` 3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests for toHiveString and new tests in `TimestampFormatterSuite`. Closes #28582 from MaxGekk/opt-format-old-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-21 04:01:19 +00:00
Wenchen Fan	34414acfa3	[SPARK-31706][SQL] add back the support of streaming update mode ### What changes were proposed in this pull request? This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode. Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`. ### Why are the changes needed? In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911 It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility. ### Does this PR introduce _any_ user-facing change? Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4 ### How was this patch tested? existing tests. Closes #28523 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-20 03:45:13 +00:00
yi.wu	0fd98abd85	[SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType ### What changes were proposed in this pull request? Eliminate the `UpCast` if it's child data type is already decimal type. ### Why are the changes needed? While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.: ``` sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") .write.mode("overwrite") .parquet(f.getAbsolutePath) // can fail spark.read.parquet(f.getAbsolutePath).as[BigDecimal] ``` ``` [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18). [info] The type path of the target object is: [info] - root class: "scala.math.BigDecimal" [info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) ``` ### Does this PR introduce _any_ user-facing change? Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change. ### How was this patch tested? Added tests. Closes #28572 from Ngone51/fix_encoder. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-20 11:00:58 +09:00
Ali Afroozeh	b9cc31cd95	[SPARK-31721][SQL] Assert optimized is initialized before tracking the planning time ### What changes were proposed in this pull request? The QueryPlanningTracker in QueryExeuction reports the planning time that also includes the optimization time. This happens because the optimizedPlan in QueryExecution is lazy and only will initialize when first called. When df.queryExecution.executedPlan is called, the the tracker starts recording the planning time, and then calls the optimized plan. This causes the planning time to start before optimization and also include the planning time. This PR fixes this behavior by introducing a method assertOptimized, similar to assertAnalyzed that explicitly initializes the optimized plan. This method is called before measuring the time for sparkPlan and executedPlan. We call it before sparkPlan because that also counts as planning time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #28543 from dbaliafroozeh/AddAssertOptimized. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-05-19 11:10:49 +02:00
Eren Avsarogullari	ab4cf49a1c	[SPARK-31440][SQL] Improve SQL Rest API ### What changes were proposed in this pull request? SQL Rest API exposes query execution metrics as Public API. This PR aims to apply following improvements on SQL Rest API by aligning Spark-UI. Proposed Improvements: 1- Support Physical Operations and group metrics per physical operation by aligning Spark UI. 2- Support `wholeStageCodegenId` for Physical Operations 3- `nodeId` can be useful for grouping metrics and sorting physical operations (according to execution order) to differentiate same operators (if used multiple times during the same query execution) and their metrics. 4- Filter `empty` metrics by aligning with Spark UI - SQL Tab. Currently, Spark UI does not show empty metrics. 5- Remove line breakers(`\n`) from `metricValue`. 6- `planDescription` can be `optional` Http parameter to avoid network cost where there is specially complex jobs creating big-plans. 7- `metrics` attribute needs to be exposed at the bottom order as `nodes`. Specially, this can be useful for the user where `nodes` array size is high. 8- `edges` attribute is being exposed to show relationship between `nodes`. 9- Reverse order on `metricDetails` aims to match with Spark UI by supporting Physical Operators' execution order. ### Why are the changes needed? Proposed improvements provides more useful (e.g: physical operations and metrics correlation, grouping) and clear (e.g: filtering blank metrics, removing line breakers) result for the end-user. ### Does this PR introduce any user-facing change? Yes. Please find both current and improved versions of the results as attached for following SQL Rest Endpoint: ``` curl -X GET http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true ``` Current version: https://issues.apache.org/jira/secure/attachment/12999821/current_version.json Improved version: https://issues.apache.org/jira/secure/attachment/13000621/improved_version.json ### Backward Compatibility SQL Rest API will be started to expose with `Spark 3.0` and `3.0.0-preview2` (released on 12/23/19) does not cover this API so if PR can catch 3.0 release, this will not have any backward compatibility issue. ### How was this patch tested? 1. New Unit tests are added. 2. Also, patch has been tested manually through both Spark Core and History Server Rest APIs. Closes #28208 from erenavsarogullari/SPARK-31440. Authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-05-18 23:21:32 -07:00
Jungtaek Lim (HeartSaVioR)	d2bec5e265	[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? This patch effectively reverts SPARK-30098 via below changes: * Removed the config * Removed the changes done in parser rule * Removed the usage of config in tests * Removed tests which depend on the config * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098 * Reflect the change into docs (migration doc, create table syntax) ### Why are the changes needed? SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change. Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details. ### Does this PR introduce _any_ user-facing change? No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases. ### How was this patch tested? Existing UTs. Closes #28517 from HeartSaVioR/revert-SPARK-30098. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:27:23 +00:00
Max Gekk	5539ecfdac	[SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default ### What changes were proposed in this pull request? Set default time zone and locale in the default constructor of `SparkFunSuite`: - Default time zone to `America/Los_Angeles` - Default locale to `Locale.US` ### Why are the changes needed? 1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite 2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running all affected test suites Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:26:00 +00:00
Yuanjian Li	86bd37f37e	[SPARK-31663][SQL] Grouping sets with having clause returns the wrong result ### What changes were proposed in this pull request? - Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`: - Change the operations resolving directions to top-down. - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions` - Push the aggregate expressions into the aggregate which contains the expanded operations. - Use UnresolvedHaving for all having clause. ### Why are the changes needed? Correctness bug fix. See the demo and analysis in SPARK-31663. ### Does this PR introduce _any_ user-facing change? Yes, correctness bug fix for HAVING with GROUPING SETS. ### How was this patch tested? New UTs added. Closes #28501 from xuanyuanking/SPARK-31663. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-16 04:37:18 +00:00
yi.wu	d8b001fa87	[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery ### What changes were proposed in this pull request? Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. ### Why are the changes needed? When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. ### Does this PR introduce _any_ user-facing change? Yes, user hit error previously but run query successfully after this change. ### How was this patch tested? Added a regression test. Closes #28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-15 15:36:28 +00:00
Karuppayya Rajendran	72601460ad	[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### What changes were proposed in this pull request? Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### Why are the changes needed? BEFORE ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` AFTER ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-13 23:18:38 -07:00
Wenchen Fan	fd2d55c991	[SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files ### What changes were proposed in this pull request? When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. ### Why are the changes needed? Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. ### Does this PR introduce _any_ user-facing change? Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. ### How was this patch tested? updated tests Closes #28477 from cloud-fan/rebase. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-14 12:32:40 +09:00
beliefer	a89006aba0	[SPARK-31393][SQL] Show the correct alias in schema for expression ### What changes were proposed in this pull request? Some alias of expression can not display correctly in schema. This PR will fix them. - `TimeWindow` - `MaxBy` - `MinBy` - `UnaryMinus` - `BitwiseCount` This PR also fix a typo issue, please look at `b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)` Note: 1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed. We can reuse `prettyName` to replace `funcName`. 2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below: `override def sql: String = s"bit_count(${child.sql})"` I don't think it's elegant enough. Because `Expression` gives the following definitions. ``` def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } ``` By this definition, `BitwiseCount` should override `prettyName` method. ### Why are the changes needed? Improve the implement of some expression. ### Does this PR introduce any user-facing change? 'Yes'. This PR will let user see the correct alias in schema. ### How was this patch tested? Jenkins test. Closes #28164 from beliefer/elegant-pretty-name-for-function. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-12 10:25:04 +09:00
Gabor Somogyi	5a5af46a94	[SPARK-31575][SQL] Synchronise global JVM security configuration modification ### What changes were proposed in this pull request? There is a race in secure JDBC connection providers. Namely multiple providers can read and/or write the exact same JVM security configuration at the same time. In this PR I've synchronised them with an object class. Since the configuration read and write takes couple of milliseconds it won't cause performance degradation. ### Why are the changes needed? There is a race in secure JDBC connection providers. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit + integration tests. Closes #28368 from gaborgsomogyi/SPARK-31575. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-11 09:10:58 -05:00
Max Gekk	5d5866be12	[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`. ### Why are the changes needed? This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-07 00:32:20.123\| ... \|1001-01-07 00:32:20.123\| +-----------------------+ ``` Expected values must be 1001-01-01 01:02:03.123 but not 1001-01-07 00:32:20.123. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-01 01:02:03.123\| ... \|1001-01-01 01:02:03.123\| +-----------------------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-11 04:58:08 +00:00
Max Gekk	ce63bef1da	[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`. ### Why are the changes needed? This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-07\| ... \|1001-01-07\| +----------+ ``` Expected values must be 1000-01-01 but not 1001-01-07. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-01\| ... \|1001-01-01\| +----------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-10 13:31:26 +09:00
manuzhang	77c690a725	[SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan ### What changes were proposed in this pull request? Show write commands on SQL UI of an AQE plan ### Why are the changes needed? Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI. #### Before ![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png) #### After ![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add UT. Closes #28474 from manuzhang/aqe-ui. Lead-authored-by: manuzhang <owenzhang1990@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-05-08 10:24:13 -07:00
Max Gekk	272d229005	[SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing ### What changes were proposed in this pull request? In PR, I propose to modify two tests of `ParquetIOSuite`: - SPARK-31159: rebasing timestamps in write - SPARK-31159: rebasing dates in write to check non-vectorized Parquet reader together with vectorized reader. ### Why are the changes needed? To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `PaquetIOSuite`: ``` $ ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-07 07:52:29 +00:00
Kent Yao	b31ae7bb0b	[SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions ### What changes were proposed in this pull request? The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods. This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of the RuntimeReplaceable ones with toPettySQL ### Why are the changes needed? Consistency of schema output between RuntimeReplaceable expressions and normal ones. For example, `date_format` vs `to_timestamp`, before this PR, they output differently #### Before ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS") struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp> ``` #### After ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS") struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp> ```` ### Does this PR introduce _any_ user-facing change? Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example ### How was this patch tested? regenerate all related tests Closes #28420 from yaooqinn/SPARK-31615. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-07 14:40:26 +09:00
Max Gekk	3d38bc2605	[SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite ### What changes were proposed in this pull request? Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`. ### Why are the changes needed? Avro config is not relevant to the parquet tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite` via ``` ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-07 09:46:42 +09:00
yi.wu	b16ea8e1ab	[SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries ### What changes were proposed in this pull request? Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages. ### Why are the changes needed? If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries. Before: <img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png"> After: <img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png"> ### Does this PR introduce _any_ user-facing change? No(AQE feature hasn't been released). ### How was this patch tested? Tested manually. Closes #28460 from Ngone51/fix_aqe_ui. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-06 12:52:53 +00:00
Liang-Chi Hsieh	4952f1a03c	[SPARK-31365][SQL] Enable nested predicate pushdown per data sources ### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-06 04:50:06 +00:00
sychen	588966d696	[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-06 10:56:19 +09:00
Max Gekk	bd26429931	[SPARK-31641][SQL] Fix days conversions by JSON legacy parser ### What changes were proposed in this pull request? Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date. ### Why are the changes needed? The changes fix a bug and restore compatibility with Spark 2.4 in which: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-11\| +----------+ ``` After: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### How was this patch tested? Add a test to `JsonSuite`. Closes #28453 from MaxGekk/json-rebase-legacy-days. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 14:15:31 +00:00
Max Gekk	bef5828e12	[SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold ### What changes were proposed in this pull request? Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`. ### Why are the changes needed? The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`. Before (see the PR's results https://github.com/apache/spark/pull/28440): ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X ``` After: ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X ``` Perf improvements in average of: 1. From java.sql.Timestamp is ~ 34% 2. To java.sql.Timestamps is ~16% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`. Closes #28441 from MaxGekk/opt-rebase-common-threshold. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 14:11:53 +00:00
turbofei	8d1f7d2a4a	[SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException ### What changes were proposed in this pull request? If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`. We may meet TableAlreadyExistsException. ``` org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default' ``` The reason is that, there is some tests that does not clean up the tables/views. In this PR, I add `withTempViews` for these tests. ### Why are the changes needed? To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed UT. Closes #28239 from turboFei/SPARK-31467. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-05 15:14:33 +09:00
Max Gekk	735771e7b4	[SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write ### What changes were proposed in this pull request? Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS, and rebasing on for INT96): timestamp type \| vectorized off (ns/row) \| vectorized on (ns/row) --\|--\|-- TIMESTAMP_MICROS\| 160.1 \| 50.2 INT96 \| 215.6 \| 117.8 TIMESTAMP_MILLIS \| 159.9 \| 60.6 ### Why are the changes needed? To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmarks via: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-05 05:40:15 +00:00
beliefer	b9494206a5	[SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line `b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L165)` just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples * FAILED * (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-05 10:04:16 +09:00
Max Gekk	372ccba063	[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default ### What changes were proposed in this pull request? This reverts commit `43a73e387c`. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-04 17:27:02 -07:00
Burak Yavuz	02a319d7e1	[SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog ## What changes were proposed in this pull request? SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such. ### Why are the changes needed? Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore. ### Does this PR introduce _any_ user-facing change? Fixes a bug ### How was this patch tested? Regression test Closes #28434 from brkyvz/ddlCommands. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-04 12:22:29 +00:00
Wenchen Fan	f72220b8ab	[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 15:30:10 +09:00
Tianshi Zhu	a222644e1d	[SPARK-31267][SQL] Flaky test: WholeStageCodegenSparkSubmitSuite.Generated code on driver should not embed platform-specific constant ### What changes were proposed in this pull request? Allow customized timeouts for `runSparkSubmit`, which will make flaky tests more likely to pass by using a larger timeout value. I was able to reproduce the test failure on my laptop, which took 1.5 - 2 minutes to finish the test. After increasing the timeout, the test now can pass locally. ### Why are the changes needed? This allows slow tests to use a larger timeout, so they are more likely to succeed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test was able to pass on my local env after the change. Closes #28438 from tianshizz/SPARK-31267. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 14:50:38 +09:00
Max Gekk	2fb85f6b68	[SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in `DateTimeBenchmark` ### What changes were proposed in this pull request? - Changed to the number of rows in benchmark cases from 3 to the actual number `N`. - Regenerated benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? The changes are needed to have: - Correct benchmark results - Base line for other perf improvements that can be checked in the same environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and checking its output. Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-04 09:39:50 +09:00
Max Gekk	13dddee9a8	[MINOR][SQL][TESTS] Disable UI in SQL benchmarks by default ### What changes were proposed in this pull request? Set `spark.ui.enabled` to `false` in `SqlBasedBenchmark.getSparkSession`. This disables UI in all SQL benchmarks by default. ### Why are the changes needed? UI overhead lowers numbers in the `Relative` column and impacts on `Stdev` in benchmark results. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked by running `DateTimeRebaseBenchmark`. Closes #28432 from MaxGekk/ui-off-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-02 17:54:36 +09:00
Pablo Langa	4fecc20f6e	[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-01 22:09:04 +09:00
Yuanjian Li	aec8b69435	[SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units ### What changes were proposed in this pull request? Add test cases covering all interval units: MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test only. Closes #28418 from xuanyuanking/SPARK-28424. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-01 10:32:37 +09:00
Yuanjian Li	7195a18bf2	[SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests ### What changes were proposed in this pull request? - Rephrase the API doc for `Column.as` - Simplify the UTs ### Why are the changes needed? Address comments in https://github.com/apache/spark/pull/28326 ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #28390 from xuanyuanking/SPARK-27340-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 06:24:00 +00:00
beliefer	1d1bb79bc6	[SPARK-31372][SQL][TEST] Display expression schema for double check ### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:58:04 +00:00
Kent Yao	9241f8282f	[SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations ### What changes were proposed in this pull request? Because of `ebc8fa50d0` and `beec8d535f`, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....` This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back. ### Why are the changes needed? restore the SQL string for datetime operations ### Does this PR introduce any user-facing change? NO, we are restoring here ### How was this patch tested? added unit tests Closes #28402 from yaooqinn/SPARK-31586-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:31:29 +00:00
Max Gekk	91648654da	[SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:20:10 +00:00
Takeshi Yamamuro	97f2c03d3b	[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema ### What changes were proposed in this pull request? This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ \|rand(7986133828002692830)\| +-------------------------+ \| 0.9524061403696937\| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ \| rand()\| +------------------+ \|0.7137935639522275\| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ \| rand(1)\| +------------------+ \|0.6363787615254752\| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == (1) Project [rand(-2282124938778456838) AS rand()#0] +- (1) Scan OneRowRelation[] ``` Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. ### Why are the changes needed? To make output schema deterministic. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-29 00:14:50 -07:00
Terry Kim	36803031e8	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:06:45 +00:00
Kent Yao	ea525fe8c0	[SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond ### What changes were proposed in this pull request? With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933 I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values. ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 ------- 14 (1 row) Query 20200428_135239_00000_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 ------- 13 (1 row) Query 20200428_135246_00001_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```sql postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part ----------- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part ----------- 13 ``` ``` spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); 0 spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); 0 ``` In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value. ### Why are the changes needed? Both satisfy the ANSI standard and common use cases in modern SQL platforms ### Does this PR introduce any user-facing change? No, it new in 3.0 ### How was this patch tested? add more uts Closes #28396 from yaooqinn/SPARK-31597. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 06:56:33 +00:00

1 2 3 4 5 ...

6801 commits