ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Maxim Gekk	820bb9985a	[SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time ### What changes were proposed in this pull request? 1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4. 2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case. ### Why are the changes needed? This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see <img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png"> Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets. - Run the following code to check that there is no difference between rebased and original micros for modern timestamps: ```scala test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } ``` ``` local date-time = 0001-01-01T00:00 diff = -2872 minutes local date-time = 0100-03-01T00:00 diff = -1432 minutes local date-time = 0200-03-01T00:00 diff = 7 minutes local date-time = 0300-03-01T00:00 diff = 1447 minutes local date-time = 0500-03-01T00:00 diff = 2887 minutes local date-time = 0600-03-01T00:00 diff = 4327 minutes local date-time = 0700-03-01T00:00 diff = 5767 minutes local date-time = 0900-03-01T00:00 diff = 7207 minutes local date-time = 1000-03-01T00:00 diff = 8647 minutes local date-time = 1100-03-01T00:00 diff = 10087 minutes local date-time = 1300-03-01T00:00 diff = 11527 minutes local date-time = 1400-03-01T00:00 diff = 12967 minutes local date-time = 1500-03-01T00:00 diff = 14407 minutes local date-time = 1582-10-15T00:00 diff = 7 minutes local date-time = 1883-11-18T12:22:58 diff = 0 minutes counter = 15 ``` The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds. - By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 1.8.0_242-8u242/11.0.6+10 \| Closes #28101 from MaxGekk/fix-local-date-overlap. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-03 04:35:31 +00:00
Takeshi Yamamuro	d98df7626b	[SPARK-31325][SQL][WEB UI] Control a plan explain mode in the events of SQL listeners via SQLConf ### What changes were proposed in this pull request? This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows; Before this PR: ![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png) After this PR: ![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png) ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`. ### How was this patch tested? Added unit tests. Closes #28097 from maropu/SPARK-31325. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-02 21:09:16 -07:00
Wenchen Fan	2c39502e84	[SPARK-31253][SQL][FOLLOWUP] Add metrics to AQE shuffle reader <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This is a followup of https://github.com/apache/spark/pull/28022, to address three issues: 1. Add an assert in `CustomShuffleReaderExec` to make sure the partitions specs are all `PartialMapperPartitionSpec` or none. 2. Do not use `lazy val` for `partitionDataSizeMetrics` and `skewedPartitionMetrics`, as they will be merged into `metrics`, and `lazy val` will be serialized. 3. mark `metrics` as `transient`, as it's only used at driver-side 4. move `FileUtils.byteCountToDisplaySize` to `logDebug`, to save some calculation if log level is above debug. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> followup improvement ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> no ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> existing tests Closes #28103 from cloud-fan/ui. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-04-02 16:02:47 -07:00
beliefer	a9260d0349	[SPARK-31315][SQL] SQLQueryTestSuite: Display the total compile time for generated java code ### What changes were proposed in this pull request? After my investigation, `SQLQueryTestSuite` spent a lot of time compiling the generated java code. Take `group-by.sql` as an example. At first, I added some debug log into `SQLQueryTestSuite`. Please reference `92b6af740c/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L402)` The execution command is as follows: `build/sbt "~sql/test-only *SQLQueryTestSuite -- -z group-by.sql"` The output show below: ``` 00:56:06.192 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=true. run time: 20604 00:56:13.719 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY. run time: 7526 00:56:18.786 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN. run time: 5066 ``` According to the log, we know. Config \| Run time(ms) -- \| -- spark.sql.codegen.wholeStage=true \| 20604 spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY \| 7526 spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN \| 5066 We should display the total compile time for generated java code. This PR will add the following to `SQLQueryTestSuite`'s output. ``` === Metrics of Whole Codegen === Total compile time: 80.564516529 seconds ``` Note: At first, I wanted to use `CodegenMetrics.METRIC_COMPILATION_TIME` to do this. After many experiments, I found that `CodegenMetrics.METRIC_COMPILATION_TIME` is only effective for a single test case, and cannot play a role in the whole life cycle of `SQLQueryTestSuite`. I checked the type of ` CodegenMetrics.METRIC_COMPILATION_TIME` is `Histogram` and the latter preserves 1028 elements.` Histogram` is a metric which calculates the distribution of a value. ### Why are the changes needed? Display the total compile time for generated java code. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28081 from beliefer/output-codegen-compile-time. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-02 09:13:22 +00:00
Kent Yao	1ce584f6b7	[SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder ### What changes were proposed in this pull request? The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`. In https://github.com/apache/spark/pull/25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set. ### Why are the changes needed? rm dead code. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests. Closes #28090 from yaooqinn/SPARK-31321. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-02 08:34:36 +00:00
Mukul Murthy	34abbb677d	[SPARK-31324][SS] Include stream ID in the termination timeout error message ### What changes were proposed in this pull request? This PR (SPARK-31324) aims to include stream ID in the error thrown when a stream does not stop() in time. ### Why are the changes needed? https://github.com/apache/spark/pull/26771/ added a conf to set a requested timeout for stopping a stream, after which the stop() method throws. From seeing this in a production use case with several streams running, it's helpful to include which stream failed to stop in the error message. ### Does this PR introduce any user-facing change? If a stream times out when terminating, the error message now includes the stream ID. Before: `Stream Execution thread failed to stop within 2000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.` After: `Stream Execution thread for stream [id = 8513769d-b9d2-4902-9b36-3668bd022245, runId = 21ed8c35-9bfe-423f-853d-c022d91818bc] failed to stop within 2000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.` ### How was this patch tested? Updated existing unit test Closes #28095 from mukulmurthy/31324-id. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-02 12:37:58 +09:00
Wenchen Fan	09f036a14c	[SPARK-31322][SQL] rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries ### What changes were proposed in this pull request? rename `QueryPlan.collectInPlanAndSubqueries` to `collectWithSubqueries` ### Why are the changes needed? The old name is too verbose. `QueryPlan` is internal but it's the core of catalyst and we'd better make the API name clearer before we release it. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #28092 from cloud-fan/rename. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-01 12:04:40 -07:00
Max Gekk	91af87d34e	[SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example. ### Why are the changes needed? To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 1.8.0_242-8u242/11.0.6+10 \| Closes #28076 from MaxGekk/rebase-benchmark-orc. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-01 07:02:26 +00:00
Maxim Gekk	c5323d2e8d	[SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write ### What changes were proposed in this pull request? In the PR, I propose to replace the following SQL configs: 1. `spark.sql.legacy.parquet.rebaseDateTime.enabled` by - `spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled` (`false` by default). The config enables rebasing dates/timestamps while saving to Parquet files. If it is set to `true`, dates/timestamps are converted to local date-time in Proleptic Gregorian calendar, date-time fields are extracted, and used in building new local date-time in the hybrid calendar (Julian + Gregorian). The resulted local date-time is converted to days or microseconds since the epoch. - `spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled` (`false` by default). The config enables rebasing of dates/timestamps in reading from Parquet files. 2. `spark.sql.legacy.avro.rebaseDateTime.enabled` by - `spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled` (`false` by default). It enables dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar via local date/timestamps. - `spark.sql.legacy.avro.rebaseDateTimeInRead.enabled` (`false` by default). It enables rebasing dates/timestamps from the hybrid calendar to Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z. ### Why are the changes needed? This allows to load dates/timestamps saved by Spark 2.4, and save to Parquet/Avro files without rebasing. And the reverse use case - load data saved by Spark 3.0, and save it in the form which is compatible with Spark 2.4. ### Does this PR introduce any user-facing change? Yes, users have to use new SQL configs. Old SQL configs are removed by the PR. ### How was this patch tested? By existing test suites `AvroV1Suite`, `AvroV2Suite` and `ParquetIOSuite`. Closes #28082 from MaxGekk/split-rebase-configs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-01 04:56:05 +00:00
Wenchen Fan	34c7ec8e0c	[SPARK-31253][SQL] Add metrics to AQE shuffle reader <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> Add SQL metrics to the AQE shuffle reader (`CustomShuffleReaderExec`) ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> to be more UI friendly ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> new test Closes #28022 from cloud-fan/metrics. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-31 13:03:52 -07:00
yi.wu	590b9a0132	[SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF ### What changes were proposed in this pull request? Added Java UDF suggestion in the in error message of untyped Scala UDF. ### Why are the changes needed? To help user migrate their use case from deprecate untyped Scala UDF to other supported UDF. ### Does this PR introduce any user-facing change? No. It haven't been released. ### How was this patch tested? Pass Jenkins. Closes #28070 from Ngone51/spark_31010. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-31 17:35:26 +00:00
Wenchen Fan	8b01473e8b	[SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2) ### What changes were proposed in this pull request? Create statement plans in `DataFrameWriter(V2)`, like the SQL API. ### Why are the changes needed? It's better to leave all the resolution work to the analyzer. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27992 from cloud-fan/statement. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-31 23:19:46 +08:00
Maxim Gekk	bb0b416f0b	[SPARK-31297][SQL] Speed up dates rebasing ### What changes were proposed in this pull request? In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`: \| date \| Proleptic Greg. days \| Hybrid (Julian+Greg) days \| diff\| \| ---- \| ----\|----\|----\| \|0001-01-01\|-719162\|-719164\|-2\| \|0100-03-01\|-682944\|-682945\|-1\| \|0200-03-01\|-646420\|-646420\|0\| \|0300-03-01\|-609896\|-609895\|1\| \|0500-03-01\|-536847\|-536845\|2\| \|0600-03-01\|-500323\|-500320\|3\| \|0700-03-01\|-463799\|-463795\|4\| \|0900-03-01\|-390750\|-390745\|5\| \|1000-03-01\|-354226\|-354220\|6\| \|1100-03-01\|-317702\|-317695\|7\| \|1300-03-01\|-244653\|-244645\|8\| \|1400-03-01\|-208129\|-208120\|9\| \|1500-03-01\|-171605\|-171595\|10\| \|1582-10-15\|-141427\|-141427\|0\| For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar. For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001. ### Why are the changes needed? To make dates rebasing faster. ### Does this PR introduce any user-facing change? No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`. ### How was this patch tested? - Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates. - Re-run `DateTimeRebaseBenchmark` on: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK8/11 \| Closes #28067 from MaxGekk/optimize-rebasing. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-31 17:38:47 +08:00
Ben Ryves	fa37856710	[SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](`a1dbcd13a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala (L71)`)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves <benjamin.ryves@getyourguide.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-31 15:16:17 +09:00
Dongjoon Hyun	cda2e30e77	Revert "[SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery" This reverts commit `f376d24ea1`.	2020-03-30 19:14:14 -07:00
Maxim Gekk	a1dbcd13a3	[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is ~6 times slower - Loading timestamps w/ vectorized off is ~4 times slower - Loading timestamps w/ vectorized on is ~10 times slower ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK8/11 \| Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-30 16:46:31 +08:00
Oleksii Kachaiev	22bb6b0fdd	[SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax ### What changes were proposed in this pull request? `DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably: * `approxQuantile` * `freqItems` * `cov` * `corr` (other functions from `DataFrameStatFunctions` already work correctly). See code examples below. ### Why are the changes needed? With current implementation some stat functions are impossible to use when joining datasets with similar column names. ### Does this PR introduce any user-facing change? Yes. Before the change, the following code would fail with `AnalysisException`. ```scala scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2") df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val dfx = df2.crossJoin(df1) dfx: org.apache.spark.sql.DataFrame = [num: int, num: int] scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0) res0: Array[Double] = Array(1.0) scala> dfx.stat.corr("table1.num", "table2.num") res1: Double = 1.0 scala> dfx.stat.cov("table1.num", "table2.num") res2: Double = 11.0 scala> dfx.stat.freqItems(Array("table1.num", "table2.num")) res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>] ``` ### How was this patch tested? Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532"). Closes #27916 from kachayev/fix-spark-30532. Authored-by: Oleksii Kachaiev <kachayev@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-30 13:20:57 +08:00
Maxim Gekk	d2ff5c5bfb	[SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp ### What changes were proposed in this pull request? In the PR, I propose to update the doc for the `timeZone` option in JSON/CSV datasources and for the `tz` parameter of the `from_utc_timestamp()`/`to_utc_timestamp()` functions, and to restrict format of config's values to 2 forms: 1. Geographical regions, such as `America/Los_Angeles`. 2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`. ### Why are the changes needed? Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html). ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`, and manual testing. Closes #28051 from MaxGekk/doc-time-zone-option. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-30 12:20:11 +08:00
Kent Yao	f376d24ea1	[SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery ### What changes were proposed in this pull request? ```sql scala> spark.sql(" select * from values(1), (2) t(key) where key in (select 1 as key where 1=0)").queryExecution res15: org.apache.spark.sql.execution.QueryExecution = == Parsed Logical Plan == 'Project [] +- 'Filter 'key IN (list#39 []) : +- Project [1 AS key#38] : +- Filter (1 = 0) : +- OneRowRelation +- 'SubqueryAlias t +- 'UnresolvedInlineTable [key], [List(1), List(2)] == Analyzed Logical Plan == key: int Project [key#40] +- Filter key#40 IN (list#39 []) : +- Project [1 AS key#38] : +- Filter (1 = 0) : +- OneRowRelation +- SubqueryAlias t +- LocalRelation [key#40] == Optimized Logical Plan == Join LeftSemi, (key#40 = key#38) :- LocalRelation [key#40] +- LocalRelation <empty>, [key#38] == Physical Plan == (1) BroadcastHashJoin [key#40], [key#38], LeftSemi, BuildRight :- *(1) LocalTableScan [key#40] +- Br... ``` `LocalRelation <empty> ` should be able to propagate after subqueries are lift up to joins ### Why are the changes needed? optimize query ### Does this PR introduce any user-facing change? no ### How was this patch tested? add new tests Closes #28043 from yaooqinn/SPARK-31280. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-29 11:32:22 -07:00
gatorsmile	3884455780	[SPARK-31087] [SQL] Add Back Multiple Removed APIs ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - functions.toDegrees/toRadians - functions.approxCountDistinct - functions.monotonicallyIncreasingId - Column.!== - Dataset.explode - Dataset.registerTempTable - SQLContext.getOrCreate, setActive, clearActive, constructors Below is the other removed APIs in the original PR, but not added back in this PR [https://issues.apache.org/jira/browse/SPARK-25908]: - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove unused Python StorageLevel constants - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? Added a new test suite for these APIs. Author: gatorsmile <gatorsmile@gmail.com> Author: yi.wu <yi.wu@databricks.com> Closes #27821 from gatorsmile/addAPIBackV2.	2020-03-28 22:05:16 -07:00
Zhenhua Wang	791d2ba346	[SPARK-31261][SQL] Avoid npe when reading bad csv input with `columnNameCorruptRecord` specified ### What changes were proposed in this pull request? SPARK-25387 avoids npe for bad csv input, but when reading bad csv input with `columnNameCorruptRecord` specified, `getCurrentInput` is called and it still throws npe. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add a test. Closes #28029 from wzhfy/corrupt_column_npe. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-29 13:30:14 +09:00
Kengo Seki	0b237bd615	[SPARK-31292][CORE][SQL] Replace toSet.toSeq with distinct for readability ### What changes were proposed in this pull request? This PR replaces the method calls of `toSet.toSeq` with `distinct`. ### Why are the changes needed? `toSet.toSeq` is intended to make its elements unique but a bit verbose. Using `distinct` instead is easier to understand and improves readability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested with the existing unit tests and found no problem. Closes #28062 from sekikn/SPARK-31292. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-29 08:48:08 +09:00
Dongjoon Hyun	d025ddbaa7	[SPARK-31238][SPARK-31284][TEST][FOLLOWUP] Fix readResourceOrcFile to create a local file from resource ### What changes were proposed in this pull request? This PR aims to copy a test resource file to a local file in `OrcTest` suite before reading it. ### Why are the changes needed? SPARK-31238 and SPARK-31284 added test cases to access the resouce file in `sql/core` module from `sql/hive` module. In Maven test environment, this causes a failure. ``` - SPARK-31238: compatibility with Spark 2.4 in reading dates * FAILED * java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test-data/before_1582_date_v2_4.snappy.orc ``` ``` - SPARK-31284: compatibility with Spark 2.4 in reading timestamps * FAILED * java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2-hive-2.3/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test-data/before_1582_ts_v2_4.snappy.orc ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with Maven. Closes #28059 from dongjoon-hyun/SPARK-31238. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-27 18:44:53 -07:00
Wenchen Fan	c4e98c065c	[SPARK-31271][UI] fix web ui for driver side SQL metrics ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/23551, we changed the metrics type of driver-side SQL metrics to size/time etc. which comes with max/min/median info. This doesn't make sense for driver side SQL metrics as they have only one value. It makes the web UI hard to read: ![image](https://user-images.githubusercontent.com/3182036/77653892-42db9900-6fab-11ea-8e7f-92f763fa32ff.png) This PR updates the SQL metrics UI to only display max/min/median if there are more than one metrics values: ![image](https://user-images.githubusercontent.com/3182036/77653975-5f77d100-6fab-11ea-849e-64c935377c8e.png) ### Why are the changes needed? Makes the UI easier to read ### Does this PR introduce any user-facing change? no ### How was this patch tested? manual test Closes #28037 from cloud-fan/ui. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-27 15:45:35 -07:00
Liang-Chi Hsieh	aa8776bb59	[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project ### What changes were proposed in this pull request? This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it. ### Why are the changes needed? In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27517 from viirya/SPARK-29721-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-27 10:47:21 -07:00
Maxim Gekk	fc2a974e03	[SPARK-31284][SQL][TESTS] Check rebasing of timestamps in ORC datasource ### What changes were proposed in this pull request? In the PR, I propose 2 tests to check that rebasing of timestamps from/to the hybrid calendar (Julian + Gregorian) to/from Proleptic Gregorian calendar works correctly. 1. The test `compatibility with Spark 2.4 in reading timestamps` load ORC file saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.write.orc("/Users/maxim/tmp/before_1582/2_4_5_ts_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_ts_orc").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-01 01:02:03.123456\| +--------------------------+ ``` 2. The test `rebasing timestamps in write` is round trip test. Since the previous test confirms correct rebasing of timestamps in read. This test should pass only if rebasing works correctly in write. ### Why are the changes needed? To guarantee that rebasing works correctly for timestamps in ORC datasource. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `OrcSourceSuite` for Hive 1.2 and 2.3 via the commands: ``` $ build/sbt -Phive-2.3 "test:testOnly OrcSourceSuite" ``` and ``` $ build/sbt -Phive-1.2 "test:testOnly OrcSourceSuite" ``` Closes #28047 from MaxGekk/rebase-ts-orc-test. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-27 09:06:59 -07:00
Maxim Gekk	9f0c010a5c	[SPARK-31277][SQL][TESTS] Migrate `DateTimeTestUtils` from `TimeZone` to `ZoneId` ### What changes were proposed in this pull request? In the PR, I propose to change types of `DateTimeTestUtils` values and functions by replacing `java.util.TimeZone` to `java.time.ZoneId`. In particular: 1. Type of `ALL_TIMEZONES` is changed to `Seq[ZoneId]`. 2. Remove `val outstandingTimezones: Seq[TimeZone]`. 3. Change the type of the time zone parameter in `withDefaultTimeZone` to `ZoneId`. 4. Modify affected test suites. ### Why are the changes needed? Currently, Spark SQL's date-time expressions and functions have been already ported on Java 8 time API but tests still use old time APIs. In particular, `DateTimeTestUtils` exposes functions that accept only TimeZone instances. This is inconvenient, and CPU consuming because need to convert TimeZone instances to ZoneId instances via strings (zone ids). ### Does this PR introduce any user-facing change? No ### How was this patch tested? By affected test suites executed by jenkins builds. Closes #28033 from MaxGekk/with-default-time-zone. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 21:14:25 +08:00
Kent Yao	5945d46c11	[SPARK-31225][SQL] Override sql method of OuterReference ### What changes were proposed in this pull request? OuterReference is one LeafExpression, so it's children is Nil, which makes its SQL representation always be outer(). This makes our explain-command and error msg unclear when OuterReference exists. e.g. ```scala org.apache.spark.sql.AnalysisException: Aggregate/Window/Generate expressions are not valid in where clause of the query. Expression in where clause: [(in.`value` = max(outer()))] Invalid expressions: [max(outer())];; ``` This PR override its `sql` method with its `prettyName` and single argment `e`'s `sql` methond ### Why are the changes needed? improve err message ### Does this PR introduce any user-facing change? yes, the err msg caused by OuterReference has changed ### How was this patch tested? modified ut results Closes #27985 from yaooqinn/SPARK-31225. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 15:21:19 +08:00
gatorsmile	b9eafcb526	[SPARK-31088][SQL] Add back HiveContext and createExternalTable ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - HiveContext - createExternalTable APIs ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? add a new test suite for createExternalTable APIs. Closes #27815 from gatorsmile/addAPIsBack. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-26 23:51:15 -07:00
gatorsmile	b7e4cc775b	[SPARK-31086][SQL] Add Back the Deprecated SQLContext methods ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - SQLContext.applySchema - SQLContext.parquetFile - SQLContext.jsonFile - SQLContext.jsonRDD - SQLContext.load - SQLContext.jdbc ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? The existing tests. Closes #27839 from gatorsmile/addAPIBackV3. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-26 23:49:24 -07:00
DB Tsai	cb0db21373	[SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][TEST-HIVE1.2] Nested Column Predicate Pushdown for Parquet ### What changes were proposed in this pull request? 1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions. 2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet. ### Why are the changes needed? Better performance for handling nested predicate pushdown. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests are added. Closes #27728 from dbtsai/SPARK-17636. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 14:28:57 +08:00
Kousuke Saruta	bc37fdc771	[SPARK-31275][WEBUI] Improve the metrics format in ExecutionPage for StageId ### What changes were proposed in this pull request? In ExecutionPage, metrics format for stageId, attemptId and taskId are displayed like `(stageId (attemptId): taskId)` for now. I changed this format like `(stageId.attemptId taskId)`. ### Why are the changes needed? As cloud-fan suggested [here](https://github.com/apache/spark/pull/27927#discussion_r398591519), `stageId.attemptId` is more standard in Spark. ### Does this PR introduce any user-facing change? Yes. Before applying this change, we can see the UI like as follows. ![with-checked](https://user-images.githubusercontent.com/4736016/77682421-42a6c200-6fda-11ea-92e4-e9f4554adb71.png) And after this change applied, we can like as follows. ![fix-merics-format-with-checked](https://user-images.githubusercontent.com/4736016/77682493-61a55400-6fda-11ea-801f-91a67da698fd.png) ### How was this patch tested? Modified `SQLMetricsSuite` and manual test. Closes #28039 from sarutak/improve-metrics-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 13:35:28 +08:00
Terry Kim	a97d3b9f4f	[SPARK-31204][SQL] HiveResult compatibility for DatasourceV2 command ### What changes were proposed in this pull request? `HiveResult` performs some conversions for commands to be compatible with Hive output, e.g.: ``` // If it is a describe command for a Hive table, we want to have the output format be similar with Hive. case ExecutedCommandExec(_: DescribeCommandBase) => ... // SHOW TABLES in Hive only output table names, while ours output database, table name, isTemp. case command ExecutedCommandExec(s: ShowTablesCommand) if !s.isExtended => ``` This conversion is needed for DatasourceV2 commands as well and this PR proposes to add the conversion for v2 commands `SHOW TABLES` and `DESCRIBE TABLE`. ### Why are the changes needed? This is a bug where conversion is not applied to v2 commands. ### Does this PR introduce any user-facing change? Yes, now the outputs for v2 commands `SHOW TABLES` and `DESCRIBE TABLE` are compatible with HIVE output. For example, with a table created as: ``` CREATE TABLE testcat.ns.tbl (id bigint COMMENT 'col1') USING foo ``` The output of `SHOW TABLES` has changed from ``` ns table ``` to ``` table ``` And the output of `DESCRIBE TABLE` has changed from ``` id bigint col1 # Partitioning Not partitioned ``` to ``` id bigint col1 # Partitioning Not partitioned ``` ### How was this patch tested? Added unit tests. Closes #28004 from imback82/hive_result. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 12:48:14 +08:00
Kent Yao	8be16907c2	[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir ### What changes were proposed in this pull request? In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps. Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`. This PR fixes CLiSuite failure with the hive-1.2 profile in https://github.com/apache/spark/pull/27933. In https://github.com/apache/spark/pull/27933, we fix the issue in JIRA by deciding the warehouse dir using all properties from spark conf and Hadoop conf, but properties from `--hiveconf` is not included, they will be applied to the `CliSessionState` instance after it initialized. When this command-line option key is `hive.metastore.warehouse.dir`, the actual warehouse dir is overridden. Because of the logic in Hive for creating the non-existing default database changed, that test passed with `Hive 2.3.6` but failed with `1.2`. So in this PR, Hadoop/Hive configurations are ordered by: ` spark.hive.xxx > spark.hadoop.xxx > --hiveconf xxx > hive-site.xml` througth `ShareState.loadHiveConfFile` before sessionState start ### Why are the changes needed? Bugfix for Spark SQL CLI to pick right confs ### Does this PR introduce any user-facing change? yes, 1. the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified. 2. configurations from `hive-site.xml` will not override command-line options or the properties defined with `spark.hadoo(hive).` prefix in spark conf. ### How was this patch tested? add cli ut Closes #27969 from yaooqinn/SPARK-31170-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-27 12:05:45 +08:00
beliefer	9e0fee933e	[SPARK-31262][SQL][TESTS] Fix bug tests imported bracketed comments ### What changes were proposed in this pull request? This PR related to https://github.com/apache/spark/pull/27481. If test case A uses `--IMPORT` to import test case B contains bracketed comments, the output can't display bracketed comments in golden files well. The content of `nested-comments.sql` show below: ``` -- This test case just used to test imported bracketed comments. -- the first case of bracketed comment --QUERY-DELIMITER-START /* This is the first example of bracketed comment. SELECT 'ommented out content' AS first; / SELECT 'selected content' AS first; --QUERY-DELIMITER-END ``` The test case `comments.sql` imports `nested-comments.sql` below: `--IMPORT nested-comments.sql` Before this PR, the output will be: ``` -- !query / This is the first example of bracketed comment. SELECT 'ommented out content' AS first -- !query schema struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException mismatched input '/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', ' ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == /* This is the first example of bracketed comment. ^^^ SELECT 'ommented out content' AS first -- !query / SELECT 'selected content' AS first -- !query schema struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException extraneous input '/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == / ^^^ SELECT 'selected content' AS first ``` After this PR, the output will be: ``` -- !query / This is the first example of bracketed comment. SELECT 'ommented out content' AS first; */ SELECT 'selected content' AS first -- !query schema struct<first:string> -- !query output selected content ``` ### Why are the changes needed? Golden files can't display the bracketed comments in imported test cases. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? New UT. Closes #28018 from beliefer/fix-bug-tests-imported-bracketed-comments. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-27 08:09:17 +09:00
Maxim Gekk	d72ec85741	[SPARK-31238][SQL] Rebase dates to/from Julian calendar in write/read for ORC datasource ### What changes were proposed in this pull request? This PR (SPARK-31238) aims the followings. 1. Modified ORC Vectorized Reader, in particular, OrcColumnVector v1.2 and v2.3. After the changes, it uses `DateTimeUtils. rebaseJulianToGregorianDays()` added by https://github.com/apache/spark/pull/27915 . The method performs rebasing days from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. It builds a local date in the original calendar, extracts date fields `year`, `month` and `day` from the local date, and builds another local date in the target calendar. After that, it calculates days from the epoch `1970-01-01` for the resulted local date. 2. Introduced rebasing dates while saving ORC files, in particular, I modified `OrcShimUtils. getDateWritable` v1.2 and v2.3, and returned `DaysWritable` instead of Hive's `DateWritable`. The `DaysWritable` class was added by the PR https://github.com/apache/spark/pull/27890 (and fixed by https://github.com/apache/spark/pull/27962). I moved `DaysWritable` from `sql/hive` to `sql/core` to re-use it in ORC datasource. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. ### Does this PR introduce any user-facing change? Yes. Before the changes, loading the date `1200-01-01` saved by Spark 2.4.5 returns the following: ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-08\| +----------+ ``` After the changes ```scala scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-01\| +----------+ ``` ### How was this patch tested? - By running `OrcSourceSuite` and `HiveOrcSourceSuite`. - Add new test `SPARK-31238: compatibility with Spark 2.4 in reading dates` to `OrcSuite` which reads an ORC file saved by Spark 2.4.5 via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> sql("select cast('1200-01-01' as date) dt").write.mode("overwrite").orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc") scala> spark.read.orc("/Users/maxim/tmp/before_1582/2_4_5_date_orc").show(false) +----------+ \|dt \| +----------+ \|1200-01-01\| +----------+ ``` - Add round trip test `SPARK-31238: rebasing dates in write`. The test `SPARK-31238: compatibility with Spark 2.4 in reading dates` confirms rebasing in read. So, we can check rebasing in write. Closes #28016 from MaxGekk/rebase-date-orc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-26 13:14:28 -07:00
Wenchen Fan	05498af72e	[SPARK-31201][SQL] Add an individual config for skewed partition threshold ### What changes were proposed in this pull request? Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so. Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold. This PR adds a new config for the threshold and set default value as 256 MB. ### Why are the changes needed? Avoid skew join handling that may introduce a perf regression. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27967 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 22:57:01 +09:00
yi.wu	8b798c1bc5	[SPARK-31242][SQL][TEST] mergeSparkConf in WithTestConf should also respect spark.sql.legacy.sessionInitWithConfigDefaults ### What changes were proposed in this pull request? Make `mergeSparkConf` in `WithTestConf` respects `spark.sql.legacy.sessionInitWithConfigDefaults`. ### Why are the changes needed? Without the fix, conf specified by `withSQLConf` can be reverted to original value in a cloned SparkSession. For example, you will fail test below without the fix: ``` withSQLConf(SQLConf.CODEGEN_FALLBACK.key -> "true") { val cloned = spark.cloneSession() SparkSession.setActiveSession(cloned) assert(SQLConf.get.getConf(SQLConf.CODEGEN_FALLBACK) === true) } ``` So we should fix it just as #24540 did before. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28014 from Ngone51/sparksession_clone. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 18:52:56 +08:00
Maxim Gekk	600319dcb9	[SPARK-31254][SQL] Use the current session time zone in `HiveResult.toHiveString` ### What changes were proposed in this pull request? In the PR, I propose to define `timestampFormatter`, `dateFormatter` and `zoneId` as methods of the `HiveResult` object. This should guarantee that the formatters pick the current session time zone in `toHiveString()` ### Why are the changes needed? Currently, date/timestamp formatters in `HiveResult.toHiveString` are initialized once on instantiation of the `HiveResult` object, and pick up the session time zone. If the sessions time zone is changed, the formatters still use the previous one. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By existing test suites, in particular, by `HiveResultSuite` Closes #28024 from MaxGekk/hive-result-datetime-formatters. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 17:45:29 +08:00
HyukjinKwon	3bd10ce007	[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type ### What changes were proposed in this pull request? This PR targets for non-nullable null type not to coerce to nullable type in complex types. Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types. This PR also reverts and supersedes `d7b97a1d0d` ### Why are the changes needed? To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema() spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema() ``` ### Does this PR introduce any user-facing change? Yes. ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema() ``` Before: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;; 'Project [cast(array() as array<int>) AS array()#68] +- Range (0, 1, step=1, splits=Some(12)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) ``` ``` root \|-- arr: array (nullable = false) \| \|-- element: integer (containsNull = true) ``` After: ``` root \|-- array(): array (nullable = false) \| \|-- element: integer (containsNull = false) ``` ``` root \|-- arr: array (nullable = false) \| \|-- element: integer (containsNull = false) ``` ### How was this patch tested? Unittests were added and manually tested. Closes #27991 from HyukjinKwon/SPARK-31227. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 15:42:54 +08:00
Kent Yao	44bd36ad7b	[SPARK-31234][SQL] ResetCommand should reset config to sc.conf only ### What changes were proposed in this pull request? Currently, ResetCommand clear all configurations, including sql configs, static sql configs and spark context level configs. for example: ```sql spark-sql> set xyz=abc; xyz abc spark-sql> set; spark.app.id local-1585055396930 spark.app.name SparkSQL::10.242.189.214 spark.driver.host 10.242.189.214 spark.driver.port 65094 spark.executor.id driver spark.jars spark.master local[*] spark.sql.catalogImplementation hive spark.sql.hive.version 1.2.1 spark.submit.deployMode client xyz abc spark-sql> reset; spark-sql> set; spark-sql> set spark.sql.hive.version; spark.sql.hive.version 1.2.1 spark-sql> set spark.app.id; spark.app.id <undefined> ``` In this PR, we restore spark confs to RuntimeConfig after it is cleared ### Why are the changes needed? reset command overkills configs which are static. ### Does this PR introduce any user-facing change? yes, the ResetCommand do not change static configs now ### How was this patch tested? add ut Closes #28003 from yaooqinn/SPARK-31234. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 15:03:16 +08:00
Maxim Gekk	cec9604eae	[SPARK-31237][SQL][TESTS] Replace 3-letter time zones by zone offsets ### What changes were proposed in this pull request? In the PR, I propose to add a few `ZoneId` constant values to the `DateTimeTestUtils` object, and reuse the constants in tests. Proposed the following constants: - PST = -08:00 - UTC = +00:00 - CEST = +02:00 - CET = +01:00 - JST = +09:00 - MIT = -09:30 - LA = America/Los_Angeles ### Why are the changes needed? All proposed constant values (except `LA`) are initialized by zone offsets according to their definitions. This will allow to avoid: - Using of 3-letter time zones that have been already deprecated in JDK, see _Three-letter time zone IDs_ in https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html - Incorrect mapping of 3-letter time zones to zone offsets, see SPARK-31237. For example, `PST` is mapped to `America/Los_Angeles` instead of the `-08:00` zone offset. Also this should improve stability and maintainability of test suites. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28001 from MaxGekk/replace-pst. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 13:36:00 +08:00
Wenchen Fan	4f274a4de9	[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables ### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors Apache Spark 3.0.0-preview2 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` Apache Spark 2.4.5 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 09:25:55 -07:00
samsetegne	44431d4b1a	[SPARK-30822][SQL] Remove semicolon at the end of a sql query # What changes were proposed in this pull request? This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception. # Why are the changes needed? When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior). # Does this PR introduce any user-facing change? No # How was this patch tested? Unit test added to `PlanParserSuite` ``` sbt> project catalyst sbt> testOnly *PlanParserSuite [info] - case insensitive (565 milliseconds) [info] - explain (9 milliseconds) [info] - set operations (41 milliseconds) [info] - common table expressions (31 milliseconds) [info] - simple select query (47 milliseconds) [info] - hive-style single-FROM statement (11 milliseconds) [info] - multi select query (32 milliseconds) [info] - query organization (41 milliseconds) [info] - insert into (12 milliseconds) [info] - aggregation (24 milliseconds) [info] - limit (11 milliseconds) [info] - window spec (11 milliseconds) [info] - lateral view (17 milliseconds) [info] - joins (62 milliseconds) [info] - sampled relations (11 milliseconds) [info] - sub-query (11 milliseconds) [info] - scalar sub-query (9 milliseconds) [info] - table reference (2 milliseconds) [info] - table valued function (8 milliseconds) [info] - SPARK-20311 range(N) as alias (2 milliseconds) [info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds) [info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds) [info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds) [info] - inline table (23 milliseconds) [info] - simple select query with !> and !< (5 milliseconds) [info] - select hint syntax (34 milliseconds) [info] - SPARK-20854: select hint syntax with expressions (12 milliseconds) [info] - SPARK-20854: multiple hints (4 milliseconds) [info] - TRIM function (16 milliseconds) [info] - OVERLAY function (16 milliseconds) [info] - precedence of set operations (18 milliseconds) [info] - create/alter view as insert into table (4 milliseconds) [info] - Invalid insert constructs in the query (10 milliseconds) [info] - relation in v2 catalog (3 milliseconds) [info] - CTE with column alias (2 milliseconds) [info] - statement containing terminal semicolons (3 milliseconds) [info] ScalaTest [info] Run completed in 3 seconds, 129 milliseconds. [info] Total number of tests run: 36 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 36, Failed 0, Errors 0, Passed 36 ``` ### Current behavior: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ scala> val df = sql("select 1;") // org.apache.spark.sql.catalyst.parser.ParseException: // extraneous input ';' expecting <EOF>(line 1, pos 8) // == SQL == // select 1; // --------^^^ // at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) // at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) // at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) // at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) // at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) // ... 47 elided ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #\| 1\| #+---+ #\| 1\| #+---+ df = spark.sql('select 1;') # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql # return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) # File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__ # File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco # raise converted # pyspark.sql.utils.ParseException: # extraneous input ';' expecting <EOF>(line 1, pos 8) # == SQL == # select 1; # --------^^^ ``` ### Behavior after proposed fix: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ scala> val df = sql("select 1;") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #\| 1 \| #+---+ #\| 1 \| #+---+ df = spark.sql('select 1;') df.show() #+---+ #\| 1 \| #+---+ #\| 1 \| #+---+ ``` Closes #27567 from samsetegne/semicolon. Authored-by: samsetegne <samuelsetegne@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-25 15:00:15 +08:00
Kousuke Saruta	999c9ed10c	[SPARK-31081][UI][SQL] Make display of stageId/stageAttemptId/taskId of sql metrics toggleable ### What changes were proposed in this pull request? This is another solution for `SPARK-31081` and #27849 . I added a checkbox which can toggle display of stageId/taskid in the SQL's DAG page. Mainly, I implemented the toggleable texts in boxes with HTML label feature provided by `dagre-d3`. The additional metrics are enclosed by `<span>` and control the appearance of the text. But the exception is additional metrics in clusters. We can use HTML label for cluster but layout will be broken so I choosed normal text label for clusters. Due to that, this solution contains a little bit tricky code in`spark-sql-viz.js` to manipulate the metric texts and generate DOMs. ### Why are the changes needed? It makes metrics harder to read after #26843 and user may not interest in extra info(stageId/StageAttemptId/taskId ) when they do not need debug. #27849 control the appearance by a new configuration property but providing a checkbox is more flexible. ### Does this PR introduce any user-facing change? Yes. [Additional metrics shown] ![with-checked](https://user-images.githubusercontent.com/4736016/77244214-0f6cd780-6c56-11ea-9275-a30758dd5339.png) [Additional metrics hidden] ![without-chedked](https://user-images.githubusercontent.com/4736016/77244219-14ca2200-6c56-11ea-9874-33a466085fce.png) ### How was this patch tested? Tested manually with a simple DataFrame operation. * The appearance of additional metrics in the boxes are controlled by the newly added checkbox. * No error found with JS-debugger. * Checked/not-checked state is preserved after reloading. Closes #27927 from sarutak/SPARK-31081. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-03-24 13:37:13 -07:00
yi.wu	f6ff7d0cf8	[SPARK-30127][SQL] Support case class parameter for typed Scala UDF ### What changes were proposed in this pull request? To support case class parameter for typed Scala UDF, e.g. ``` case class TestData(key: Int, value: String) val f = (d: TestData) => d.key * d.value.toInt val myUdf = udf(f) val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2") checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil) ``` ### Why are the changes needed? Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class. And here's benchmark result of using case class comparing to row: ```scala // case class: 58ms 65ms 59ms 64ms 61ms // row: 59ms 64ms 73ms 84ms 69ms val f1 = (d: TestData) => s"${d.key}, ${d.value}" val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}" val udf1 = udf(f1) // set spark.sql.legacy.allowUntypedScalaUDF=true val udf2 = udf(f2, StringType) val df = spark.range(100000).selectExpr("cast (id as int) as id") .select(struct('id, lit("str")).as("col")) df.cache().collect() // warmup to exclude some extra influence df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save() df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save() start = System.currentTimeMillis() df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save() println(System.currentTimeMillis() - start) start = System.currentTimeMillis() df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save() println(System.currentTimeMillis() - start) ``` ### Does this PR introduce any user-facing change? Yes. User now could be able to use typed Scala UDF with case class as input parameter. ### How was this patch tested? Added unit tests. Closes #27937 from Ngone51/udf_caseclass_support. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 23:03:57 +08:00
Wenchen Fan	1d0f54951e	[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 12:07:22 +08:00
Maxim Gekk	aa3a7429f4	[SPARK-31159][SQL][FOLLOWUP] Move checking of the `rebaseDateTime` flag out of the loop in `VectorizedColumnReader` ### What changes were proposed in this pull request? In the PR, I propose to refactor reading of timestamps of the `TIMESTAMP_MILLIS` logical type from Parquet files in `VectorizedColumnReader`, and move checking of the `rebaseDateTime` flag out of the internal loop. ### Why are the changes needed? To avoid any additional overhead of the checking the SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` introduced by the PR https://github.com/apache/spark/pull/27915. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the test suite `ParquetIOSuite`. Closes #27973 from MaxGekk/rebase-parquet-datetime-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-23 23:02:48 +09:00
LantaoJin	929b794e25	[SPARK-30494][SQL] Fix cached data leakage during replacing an existing view ### What changes were proposed in this pull request? The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak. We can reproduce by below commands: ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create or replace temporary view temp1 as select 1") scala> spark.sql("cache table temp1") scala> spark.sql("create or replace temporary view temp1 as select 1, 2") scala> spark.sql("cache table temp1") scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined) scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined) ``` ### Why are the changes needed? Fix the memory leak, specially for long running mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add an unit test. Closes #27185 from LantaoJin/SPARK-30494. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-22 22:22:13 -07:00
Kent Yao	88ae6c4481	[SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document ### What changes were proposed in this pull request? Fix errors and missing parts for datetime pattern document 1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own. 2. Some pattern letters are missing 3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N') 4. the second fraction pattern different logic for parsing and formatting ### Why are the changes needed? fix and improve doc ### Does this PR introduce any user-facing change? yes, new and updated doc ### How was this patch tested? pass Jenkins viewed locally with `jekyll serve` ![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png) Closes #27956 from yaooqinn/SPARK-31189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 21:59:26 +08:00
Dongjoon Hyun	f1cc86792f	[SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases ### What changes were proposed in this pull request? A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side. The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR. ### Why are the changes needed? This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 12:28:57 +08:00
Takeshi Yamamuro	ca499e9409	[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#24L] +- (2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- (1) Range (0, 10, step=1, splits=4) +- (2) Project [id#26L] +- (2) Filter isnotnull(id#26L) +- (2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#3L] +- (2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- (2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- (1) Project [id#5L] +- (1) Filter isnotnull(id#5L) +- (1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from https://github.com/apache/spark/pull/22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 20:11:04 -07:00
Dongjoon Hyun	c6a6d5e006	Revert "[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir" This reverts commit `5bc0d76591`. Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 16:08:51 -07:00
Kris Mok	a1776288f4	[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- (1) Range (0, 10, step=1, splits=16) (2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes #27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 20:53:01 +09:00
Maxim Gekk	bb295d80e3	[SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.parquet.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion. 2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on. 5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`. 6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`. 7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`. ### Why are the changes needed? - For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. - It fixes the bug of incorrect saving/loading timestamps of the `INT96` type ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-07 11:32:20.123456\| +--------------------------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-01 01:02:03.123456\| +--------------------------+ ``` ### How was this patch tested? 1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date") scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96") ``` 2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes): ```bash $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Read the saved date/timestamp by Spark 2.4.5: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Closes #27915 from MaxGekk/rebase-parquet-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-19 12:49:51 +08:00
Burak Yavuz	4237251861	[SPARK-31178][SQL] Prevent V2 exec nodes from executing multiple times ### What changes were proposed in this pull request? This PR prevents the execution of V2 DataSource exec nodes multiple times when `collect()` is called on them. For V1 DataSources, commands would be executed as a RunnableCommand, which would cache the result as part of the `ExecutedCommandExec` node. We extend `V2CommandExec` for all the data writing commands so that they only get executed once as well. ### Why are the changes needed? Calling `collect()` on a SQL command that inserts data or creates a table gets executed multiple times otherwise. ### Does this PR introduce any user-facing change? Fixes a bug ### How was this patch tested? Unit tests Closes #27941 from brkyvz/doubleInsert. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-03-18 18:07:24 -07:00
Wenchen Fan	8643e5d9c5	[SPARK-31171][SQL][FOLLOWUP] update document ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/27936 to update document. ### Why are the changes needed? correct document ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27950 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:29:31 +09:00
Kent Yao	3d695954e5	[SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text ### What changes were proposed in this pull request? pattern `''` means literal `'` ```sql select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS"); 5377-02-14 06:27:19'000000519 ``` `0946a9514f` missed this case and this pr add it back. ### Why are the changes needed? bugfix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #27949 from yaooqinn/SPARK-31150-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:27:06 +09:00
Kent Yao	57fcc49306	[SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar ### What changes were proposed in this pull request? The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g. ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee'); ``` Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before. This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class. And 'e' and 'c' char checking in this method. Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g. ```sql spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'"); 20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")] java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'"); NULL ``` ### Why are the changes needed? avoid vagueness bug fix ### Does this PR introduce any user-facing change? no, these are not exposed yet ### How was this patch tested? add ut Closes #27939 from yaooqinn/SPARK-31176. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 20:19:50 +08:00
Kent Yao	f1d27cdd91	[SPARK-31119][SQL] Add interval value support for extract expression as extract source ### What changes were proposed in this pull request? ``` <extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren> <extract source> ::= <datetime value expression> \| <interval value expression> ``` We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval. This pr adds interval value support for `extract` expression as extract source ### Why are the changes needed? For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions. ### Does this PR introduce any user-facing change? yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals ### How was this patch tested? add unit tests Closes #27876 from yaooqinn/SPARK-31119. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 12:29:39 +08:00
manuzhang	4e4e08f372	[SPARK-31047][SQL] Improve file listing for ViewFileSystem ### What changes were proposed in this pull request? Use `listLocatedStatus` when `lnMemoryFileIndex` is listing files from a `ViewFileSystem` which should delegate to that of `DistributedFileSystem`. ### Why are the changes needed? When `ViewFileSystem` is used to manage several `DistributedFileSystem`, the change will improve performance of file listing, especially when there are many files. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27801 from manuzhang/spark-31047. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 14:23:28 -07:00
Wenchen Fan	dc5ebc2d5b	[SPARK-31171][SQL] size(null) should return null under ansi mode ### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 11:48:54 -07:00
Kent Yao	5bc0d76591	[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir ### What changes were proposed in this pull request? In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps. Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`. ### Why are the changes needed? fix bug for Spark SQL cli to pick right confs ### Does this PR introduce any user-facing change? yes, the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified. ### How was this patch tested? add cli ut Closes #27933 from yaooqinn/SPARK-31170. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 23:03:18 +08:00
Kent Yao	0946a9514f	[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp ### What changes were proposed in this pull request? This PR is to support parsing timestamp values with variable length second fraction parts. e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7 ```sql select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values ('2019-10-06 10:11:12.'), ('2019-10-06 10:11:12.0'), ('2019-10-06 10:11:12.1'), ('2019-10-06 10:11:12.12'), ('2019-10-06 10:11:12.123UTC'), ('2019-10-06 10:11:12.1234'), ('2019-10-06 10:11:12.12345CST'), ('2019-10-06 10:11:12.123456PST') t(v) 2019-10-06 03:11:12.123 2019-10-06 08:11:12.12345 2019-10-06 10:11:12 2019-10-06 10:11:12 2019-10-06 10:11:12.1 2019-10-06 10:11:12.12 2019-10-06 10:11:12.1234 2019-10-06 10:11:12.123456 select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') NULL ``` Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated. ### Why are the changes needed? improve timestamp parsing and more compatible with 2.4.x ### Does this PR introduce any user-facing change? no, the related changes are newly added ### How was this patch tested? add uts Closes #27906 from yaooqinn/SPARK-31150. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 21:53:46 +08:00
Takeshi Yamamuro	124b4ce2e6	[MINOR][SQL] Update the DataFrameWriter.bucketBy comment ### What changes were proposed in this pull request? This PR intends to update the `DataFrameWriter.bucketBy` comment for clearly describing that the bucketBy scheme follows a Spark "specific" one. I saw the questions about the current bucketing compatibility with Hive in [SPARK-31162](https://issues.apache.org/jira/browse/SPARK-31162?focusedCommentId=17060408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17060408) and [SPARK-17495](https://issues.apache.org/jira/browse/SPARK-17495?focusedCommentId=17059847&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17059847) from users and IMHO the comment is a bit confusing to users about the compatibility ### Why are the changes needed? To make users understood smoothly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27930 from maropu/UpdateBucketByComment. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 00:52:45 -07:00
Wenchen Fan	30d95356f1	[SPARK-31134][SQL] optimize skew join after shuffle partitions are coalesced ### What changes were proposed in this pull request? Run the `OptimizeSkewedJoin` rule after the `CoalesceShufflePartitions` rule. ### Why are the changes needed? Remove duplicated coalescing code in `OptimizeSkewedJoin`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27893 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-17 00:23:16 -07:00
Zhenhua Wang	1369a973cd	[SPARK-31164][SQL] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes #27924 from wzhfy/inconsistent_rdd_partitioning. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com>	2020-03-17 14:20:16 +08:00
Wenchen Fan	d7b97a1d0d	[SPARK-31166][SQL] UNION map<null, null> and other maps should not fail ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps. The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it. ### Why are the changes needed? To avoid breaking queries. ### Does this PR introduce any user-facing change? Yes, now some queries that work in 2.x can work in 3.0 as well. ### How was this patch tested? new test Closes #27926 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 12:01:29 +08:00
zero323	01f20394ac	[SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions ### What changes were proposed in this pull request? - Adds following overloaded variants to Scala `o.a.s.sql.functions`: - `percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Double, accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column` (primarily for Python interop). - `percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column` - Adds `percentile_approx` to `pyspark.sql.functions`. - Adds `percentile_approx` function to SparkR. ### Why are the changes needed? Currently we support `percentile_approx` only in SQL expression. It is inconvenient and makes this function relatively unknown. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests for SparkR an PySpark. As for now there are no additional tests in Scala API ‒ `ApproximatePercentile` is well tested and Python (including docstrings) and R tests provide additional tests, so it seems unnecessary. Closes #27278 from zero323/SPARK-30569. Lead-authored-by: zero323 <mszymkiewicz@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-17 10:44:21 +09:00
yi.wu	cb26f636b0	[SPARK-31163][SQL] TruncateTableCommand with acl/permission should handle non-existed path ### What changes were proposed in this pull request? This fix #26956 Wrap try-catch on `fs.getFileStatus(path)` within acl/permission in case of the path doesn't exist. ### Why are the changes needed? `truncate table` may fail to re-create path in case of interruption or something else. As a result, next time we `truncate table` on the same table with acl/permission, it will fail due to `FileNotFoundException`. And it also brings behavior change compares to previous Spark version, which could still `truncate table` successfully even if the path doesn't exist. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #27923 from Ngone51/fix_truncate. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 11:45:25 -07:00
HyukjinKwon	6704103499	[SPARK-31146][SQL] Leverage the helper method for aliasing in built-in SQL expressions ### What changes were proposed in this pull request? This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable. - `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default. - When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`, `FormatString` and `XPathDouble`. This PR fixes the aliases of the functions below: \| class \| alias \| \|--------------------------\|------------------\| \|`Rand` \|`random` \| \|`Ceil` \|`ceiling` \| \|`Remainder` \|`mod` \| \|`Pow` \|`pow` \| \|`Signum` \|`sign` \| \|`Chr` \|`char` \| \|`Length` \|`char_length` \| \|`Length` \|`character_length`\| \|`FormatString` \|`printf` \| \|`Substring` \|`substr` \| \|`Upper` \|`ucase` \| \|`XPathDouble` \|`xpath_number` \| \|`DayOfMonth` \|`day` \| \|`CurrentTimestamp` \|`now` \| \|`Size` \|`cardinality` \| \|`Sha1` \|`sha` \| \|`CallMethodViaReflection` \|`java_method` \| Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser. Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller. ### Why are the changes needed? To respect expression name. ### Does this PR introduce any user-facing change? Yes, it will change the output column name. ### How was this patch tested? Manually tested, and unittests were added. Closes #27901 from HyukjinKwon/31146. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 11:22:34 -07:00
Tae-kyeom, Kim	e736c62764	[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter ### What changes were proposed in this pull request? This PR (SPARK-31116) add caseSensitive parameter to ParquetRowConverter so that it handle materialize parquet properly with respect to case sensitivity ### Why are the changes needed? From spark 3.0.0, below statement throws IllegalArgumentException in caseInsensitive mode because of explicit field index searching in ParquetRowConverter. As we already constructed parquet requested schema and catalyst requested schema during schema clipping in ParquetReadSupport, just follow these behavior. ```scala val path = "/some/temp/path" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.parquet(path) val caseInsensitiveSchema = new StructType() .add( "StructColumn", new StructType() .add("LowerCase", LongType) .add("camelcase", LongType)) spark.read.schema(caseInsensitiveSchema).parquet(path).show() ``` ### Does this PR introduce any user-facing change? No. The changes are only in unreleased branches (`master` and `branch-3.0`). ### How was this patch tested? Passed new test cases that check parquet column selection with respect to schemas and case sensitivities Closes #27888 from kimtkyeom/parquet_row_converter_case_sensitivity. Authored-by: Tae-kyeom, Kim <kimtkyeom@devsisters.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 10:31:56 -07:00
jiake	21c02ee5d0	[SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution ### What changes were proposed in this pull request? This PR will add the user guide for AQE and the detailed configurations about the three mainly features in AQE. ### Why are the changes needed? Add the detailed configurations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? only add doc no need ut. Closes #27616 from JkSelf/aqeuserguide. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-16 23:33:56 +08:00
LantaoJin	08bdc9c9b2	[SPARK-31068][SQL] Avoid IllegalArgumentException in broadcast exchange ### What changes were proposed in this pull request? Fix the IllegalArgumentException in broadcast exchange when numRows over 341 million but less than 512 million. Since the maximum number of keys that `BytesToBytesMap` supports is 1 << 29, and only 70% of the slots can be used before growing in `HashedRelation`, So here the limitation should not be greater equal than 341 million (1 << 29 / 1.5(357913941)) instead of 512 million. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually test. Closes #27828 from LantaoJin/SPARK-31068. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Alan Jin <jinlantao@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-15 20:20:23 -05:00
Shixiong Zhu	1ddf44dfca	[SPARK-31144][SQL] Wrap Error with QueryExecutionException to notify QueryExecutionListener ### What changes were proposed in this pull request? This PR manually reverts changes in #25292 and then wraps java.lang.Error with `QueryExecutionException` to notify `QueryExecutionListener` to send it to `QueryExecutionListener.onFailure` which only accepts `Exception`. The bug fix PR for 2.4 is #27904. It needs a separate PR because the touched codes were changed a lot. ### Why are the changes needed? Avoid API changes and fix a bug. ### Does this PR introduce any user-facing change? Yes. Reverting an API change happening in 3.0. QueryExecutionListener APIs will be the same as 2.4. ### How was this patch tested? The new added test. Closes #27907 from zsxwing/SPARK-31144. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-13 15:55:29 -07:00
Dale Clarke	2a4fed0443	[SPARK-30654][WEBUI] Bootstrap4 WebUI upgrade ### What changes were proposed in this pull request? Spark's Web UI is using an older version of Bootstrap (v. 2.3.2) for the portal pages. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs: https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889 https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700 https://snyk.io/vuln/npm:bootstrap:20180529 https://snyk.io/vuln/npm:bootstrap:20160627 I haven't validated each CVE, but it would be nice to resolve any potential issues and get on a supported release. The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the UI for functionality and appearance. This is a fairly large change so I'm sure additional testing and fixes will be needed. ### How was this patch tested? This has been manually tested, but there is a ton of functionality and there are many pages and detail pages so it is very possible bugs introduced from the upgrade were missed. Additional testing and feedback is welcomed. If it appears a whole page was missed let me know and I'll take a pass at addressing that page/section. Closes #27370 from clarkead/bootstrap4-core-upgrade. Authored-by: Dale Clarke <a.dale.clarke@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-03-13 15:24:48 -07:00
Kousuke Saruta	680981587d	[SPARK-31004][WEBUI][SS] Show message for empty Streaming Queries instead of empty timelines and histograms ### What changes were proposed in this pull request? `StreamingQueryStatisticsPage` shows a message "No visualization information available because there is no batches" instead of showing empty timelines and histograms for empty streaming queries. [Before this change applied] ![before-fix-for-empty-streaming-query](https://user-images.githubusercontent.com/4736016/75642391-b32e1d80-5c7e-11ea-9c07-e2f0f1b5b4f9.png) [After this change applied] ![after-fix-for-empty-streaming-query2](https://user-images.githubusercontent.com/4736016/75694583-1904be80-5cec-11ea-9b13-dc7078775188.png) ### Why are the changes needed? Empty charts are ugly and a little bit confusing. It's better to clearly say "No visualization information available". Also, this change fixes a JS error shown in the capture above. This error occurs because `drawTimeline` in `streaming-page.js` is called even though `formattedDate` will be `undefined` for empty streaming queries. ### Does this PR introduce any user-facing change? Yes. screen captures are shown above. ### How was this patch tested? Manually tested by creating an empty streaming query like as follows. ``` val df = spark.readStream.format("socket").options(Map("host"->"<non-existing hostname>", "port"->"...")).load df.writeStream.format("console").start ``` This streaming query will fail because of `non-existing hostname` and has no batches. Closes #27755 from sarutak/fix-for-empty-batches. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-03-13 12:58:49 -07:00
Gabor Somogyi	231e65092f	[SPARK-30874][SQL] Support Postgres Kerberos login in JDBC connector ### What changes were proposed in this pull request? When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it. This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues. In this PR I've added Postgres support (other supported databases will come in later PRs). What this PR contains: * Added `keytab` and `principal` JDBC options * Added `ConnectionProvider` trait and it's impementations: * `BasicConnectionProvider` => unsecure connection * `PostgresConnectionProvider` => postgres secure connection * Added `ConnectionProvider` tests * Added `PostgresKrbIntegrationSuite` docker integration test * Created `SecurityUtils` to concentrate re-usable security related functionalities * Documentation ### Why are the changes needed? Missing JDBC kerberos support. ### Does this PR introduce any user-facing change? Yes, 2 additional JDBC options added: * keytab * principal If both provided then Spark does kerberos authentication. ### How was this patch tested? To demonstrate the functionality with a standalone application I've created this repository: https://github.com/gaborgsomogyi/docker-kerberos * Additional + existing unit tests * Additional docker integration test * Test on cluster manually * `SKIP_API=1 jekyll build` Closes #27637 from gaborgsomogyi/SPARK-30874. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org>	2020-03-12 19:04:35 -07:00
Wenchen Fan	b27b3c91f1	[SPARK-31090][SPARK-25457] Revert "IntegralDivide returns data type of the operands" ### What changes were proposed in this pull request? This reverts commit `47d6e80a2e`. ### Why are the changes needed? There is no standard requiring that `div` must return the type of the operand, and always returning long type looks fine. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `div` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27835 from cloud-fan/revert2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-13 10:47:36 +09:00
Kent Yao	fbc9dc7e9d	[SPARK-31129][SQL][TESTS] Fix IntervalBenchmark and DateTimeBenchmark ### What changes were proposed in this pull request? This PR aims to recover `IntervalBenchmark` and `DataTimeBenchmark` due to banning intervals as output. ### Why are the changes needed? This PR recovers the benchmark suite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually, re-run the benchmark. Closes #27885 from yaooqinn/SPARK-31111-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-12 12:59:29 -07:00
Wenchen Fan	77c49cb702	[SPARK-31124][SQL] change the default value of minPartitionNum in AQE ### What changes were proposed in this pull request? AQE has a perf regression when using the default settings: if we coalesce the shuffle partitions into one or few partitions, we may leave many CPU cores idle and the perf is worse than with AQE off (which leverages all CPU cores). Technically, this is not a bad thing. If there are many queries running at the same time, it's better to coalesce shuffle partitions into fewer partitions. However, the default settings of AQE should try to avoid any perf regression as possible as we can. This PR changes the default value of minPartitionNum when coalescing shuffle partitions, to be `SparkContext.defaultParallelism`, so that AQE can leverage all the CPU cores. ### Why are the changes needed? avoid AQE perf regression ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27879 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-12 21:28:24 +08:00
yi.wu	feb9b9e771	[SPARK-31010][SQL][FOLLOW-UP] Give an example for typed Scala UDF in error message ### What changes were proposed in this pull request? In the error message, adding an example for typed Scala UDF. ### Why are the changes needed? Help user to know how to migrate to typed Scala UDF. ### Does this PR introduce any user-facing change? No, it's a new error message in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27884 from Ngone51/spark_31010_followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-12 21:16:02 +09:00
Maxim Gekk	3d3e366aa8	[SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time ### What changes were proposed in this pull request? In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components. ### Why are the changes needed? This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar. Here is the example which demonstrates the issue: ```sql scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) ``` ### Does this PR introduce any user-facing change? Yes, after the changes: ```sql scala> sql("select date '1100-10-10'").collect() res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10]) ``` ### How was this patch tested? By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27807 from MaxGekk/rebase-timestamp-before-1582. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 20:53:56 +08:00
Kent Yao	2b46662bd0	[SPARK-31111][SQL][TESTS] Fix interval output issue in ExtractBenchmark ### What changes were proposed in this pull request? fix the error caused by interval output in ExtractBenchmark ### Why are the changes needed? fix a bug in the test ```scala [info] Running case: cast to interval [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot use interval type in the table schema.;; [error] OverwriteByExpression RelationV2[] noop-table, true, true [error] +- Project [(subtractdates(cast(cast(id#0L as timestamp) as date), -719162) + subtracttimestamps(cast(id#0L as timestamp), -30610249419876544)) AS ((CAST(CAST(id AS TIMESTAMP) AS DATE) - DATE '0001-01-01') + (CAST(id AS TIMESTAMP) - TIMESTAMP '1000-01-01 01:02:03.123456'))#2] [error] +- Range (1262304000, 1272304000, step=1, splits=Some(1)) [error] [error] at org.apache.spark.sql.catalyst.util.TypeUtils$.failWithIntervalType(TypeUtils.scala:106) [error] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$25(CheckAnalysis.scala:389) [error] at org.a ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? re-run benchmark Closes #27867 from yaooqinn/SPARK-31111. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 20:13:59 +08:00
Liang-Chi Hsieh	15557a7d05	[SPARK-31071][SQL] Allow annotating non-null fields when encoding Java Beans ### What changes were proposed in this pull request? When encoding Java Beans to Spark DataFrame, respecting `javax.annotation.Nonnull` and producing non-null fields. ### Why are the changes needed? When encoding Java Beans to Spark DataFrame, non-primitive types are encoded as nullable fields. Although It works for most cases, it can be an issue under a few situations, e.g. the one described in the JIRA ticket when saving DataFrame to Avro format with non-null field. We should allow Spark users more flexibility when creating Spark DataFrame from Java Beans. Currently, Spark users cannot create DataFrame with non-nullable fields in the schema from beans with non-nullable properties. Although it is possible to project top-level columns with SQL expressions like `AssertNotNull` to make it non-null, for nested fields it is more tricky to do it similarly. ### Does this PR introduce any user-facing change? Yes. After this change, Spark users can use `javax.annotation.Nonnull` to annotate non-null fields in Java Beans when encoding beans to Spark DataFrame. ### How was this patch tested? Added unit test. Closes #27851 from viirya/SPARK-31071. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 18:27:48 +08:00
Yuanjian Li	3493162c78	[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime ### What changes were proposed in this pull request? In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) ### Why are the changes needed? For backward compatibility. ### Does this PR introduce any user-facing change? No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. ### How was this patch tested? Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes #27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 14:11:13 +08:00
Wenchen Fan	d5f5720efa	[SPARK-31070][SQL] make skew join split skewed partitions more evenly <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> There are two problems when splitting skewed partitions: 1. It's impossible that we can't split the skewed partition, then we shouldn't create a skew join. 2. When splitting, it's possible that we create a partition for very small amount of data.. This PR fixes them 1. don't create `PartialReducerPartitionSpec` if we can't split. 2. merge small partitions to the previous partition. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> make skew join split skewed partitions more evenly ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> no ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> updated test Closes #27833 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-10 21:50:44 -07:00
yi.wu	34be83e08b	[SPARK-31037][SQL][FOLLOW-UP] Replace legacy ReduceNumShufflePartitions with CoalesceShufflePartitions in comment ### What changes were proposed in this pull request? Replace legacy `ReduceNumShufflePartitions` with `CoalesceShufflePartitions` in comment. ### Why are the changes needed? Rule `ReduceNumShufflePartitions` has renamed to `CoalesceShufflePartitions`, we should update related comment as well. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #27865 from Ngone51/spark_31037_followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-10 11:09:36 -07:00
Kent Yao	3bd6ebff81	[SPARK-30189][SQL] Interval from year-month/date-time string should handle whitespaces ### What changes were proposed in this pull request? Currently, we parse interval from multi units strings or from date-time/year-month pattern strings, the former handles all whitespace, the latter not or even spaces. ### Why are the changes needed? behavior consistency ### Does this PR introduce any user-facing change? yes, interval in date-time/year-month like ``` select interval '\n-\t10\t 12:34:46.789\t' day to second -- !query 126 schema struct<INTERVAL '-10 days -12 hours -34 minutes -46.789 seconds':interval> -- !query 126 output -10 days -12 hours -34 minutes -46.789 seconds ``` is valid now. ### How was this patch tested? add ut. Closes #26815 from yaooqinn/SPARK-30189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 22:08:58 +08:00
Terry Kim	294f6056fa	[SPARK-31078][SQL] Respect aliases in output ordering ### What changes were proposed in this pull request? Currently, in the following scenario, an unnecessary `Sort` node is introduced: ```scala withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { val df = (0 until 20).toDF("i").as("df") df.repartition(8, df("i")).write.format("parquet") .bucketBy(8, "i").sortBy("i").saveAsTable("t") val t1 = spark.table("t") val t2 = t1.selectExpr("i as ii") t1.join(t2, t1("i") === t2("ii")).explain } ``` ``` == Physical Plan == (3) SortMergeJoin [i#8], [ii#10], Inner :- (1) Project [i#8] : +- (1) Filter isnotnull(i#8) : +- (1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- (2) Sort [ii#10 ASC NULLS FIRST], false, 0 <==== UNNECESSARY +- (2) Project [i#8 AS ii#10] +- (2) Filter isnotnull(i#8) +- (2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` Notice that `Sort [ii#10 ASC NULLS FIRST], false, 0` is introduced even though the underlying data is already sorted. This is because `outputOrdering` doesn't handle aliases correctly. This PR proposes to fix this issue. ### Why are the changes needed? To better handle aliases in `outputOrdering`. ### Does this PR introduce any user-facing change? Yes, now with the fix, the `explain` prints out the following: ``` == Physical Plan == (3) SortMergeJoin [i#8], [ii#10], Inner :- (1) Project [i#8] : +- (1) Filter isnotnull(i#8) : +- (1) ColumnarToRow : +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 +- (2) Project [i#8 AS ii#10] +- (2) Filter isnotnull(i#8) +- *(2) ColumnarToRow +- FileScan parquet default.t[i#8] Batched: true, DataFilters: [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int>, SelectedBucketsCount: 8 out of 8 ``` ### How was this patch tested? Tests added. Closes #27842 from imback82/alias_aware_sort_order. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 20:15:48 +08:00
HyukjinKwon	815c7929c2	[SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source ### What changes were proposed in this pull request? This PR proposes two things: 1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string. 2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference. ### Why are the changes needed? To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option. ### Does this PR introduce any user-facing change? Yes, it does. ```scala import collection.JavaConverters._ import org.apache.spark.sql.functions._ spark.range(1).select(schema_of_json(lit("""{"id": ""}"""))).show() spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false) ``` Before: ``` struct<id:null> struct<drop:struct<drop:null>,id:string> ``` After: ``` struct<id:string> struct<id:string> ``` ### How was this patch tested? Manually tested, and unittests were added. Closes #27854 from HyukjinKwon/SPARK-31065. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-10 00:33:32 -07:00
maryannxue	de6d9e4307	[SPARK-31096][SQL] Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec` ### What changes were proposed in this pull request? This PR changes the type of `CustomShuffleReaderExec`'s `partitionSpecs` from `Array` to `Seq`, since `Array` compares references not values for equality, which could lead to potential plan reuse problem. ### Why are the changes needed? Unlike `Seq`, `Array` compares references not values for equality, which could lead to potential plan reuse problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passes existing UTs. Closes #27857 from maryannxue/aqe-customreader-fix. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 14:15:44 +08:00
Yuchen Huo	a22994333a	[SPARK-30902][SQL][FOLLOW-UP] Allow ReplaceTableAsStatement to have none provider ### What changes were proposed in this pull request? This is a follow up for https://github.com/apache/spark/pull/27650 where allow None provider for create table. Here we are doing the same thing for ReplaceTable. ### Why are the changes needed? Although currently the ASTBuilder doesn't seem to allow `replace` without `USING` clause. This would allow `DataFrameWriterV2` to use the statements instead of commands directly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #27838 from yuchenhuo/SPARK-30902. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-10 11:37:31 +08:00
yi.wu	ef51ff9dc8	[SPARK-31082][CORE] MapOutputTrackerMaster.getMapLocation should handle last mapIndex correctly ### What changes were proposed in this pull request? In `getMapLocation`, change the condition from `...endMapIndex < statuses.length` to `...endMapIndex <= statuses.length`. ### Why are the changes needed? `endMapIndex` is exclusive, we should include it when comparing to `statuses.length`. Otherwise, we can't get the location for last mapIndex. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated existed test. Closes #27850 from Ngone51/fix_getmaploction. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-09 15:53:34 +08:00
DB Tsai	7911f95202	[SPARK-31064][SQL] New Parquet Predicate Filter APIs with multi-part Identifier Support ### What changes were proposed in this pull request? Parquet's org.apache.parquet.filter2.predicate.FilterApi uses `dots` as separators to split the column name into multi-parts of nested fields. The drawback is this causes issues when the field name contains `dots`. The new APIs that will be added will take array of string directly for multi-parts of nested fields, so no confusion as using `dots` as separators. ### Why are the changes needed? To support nested predicate pushdown and predicate pushdown for columns containing `dots`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27824 from dbtsai/SPARK-31064. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-03-06 21:09:24 +00:00
Takeshi Yamamuro	71c73d58f6	[SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID ### What changes were proposed in this pull request? This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs; `e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)` For example, the query below generates wrong grouping IDs in the master; ``` scala> val numCols = 32 // or, 31 scala> val cols = (0 until numCols).map { i => s"c$i" } scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet") scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",") scala> sql(s"insert into test_$numCols values ($insertVals,3)") scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false) scala> sql(s"drop table test_$numCols") // numCols = 32 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|0 \|3 \| // Wrong Grouping ID +-------------+------+ // numCols = 31 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|1 \|3 \| +-------------+------+ ``` To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values. ### Why are the changes needed? To support more cases in `GROUPING_ID`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #26918 from maropu/FixGroupingIdIssue. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-06 16:57:03 +09:00
Gengliang Wang	1426ad8968	[SPARK-23817][FOLLOWUP][TEST] Add OrcV2QuerySuite ### What changes were proposed in this pull request? Add `OrcV2QuerySuite` which explicitly sets the configuration `USE_V1_SOURCE_LIST` as `""` to use ORC V2 implementation. ### Why are the changes needed? As now file source V2 is disabled by default, the test suite `OrcQuerySuite` is testing V1 implementation as well as the `OrcV1QuerySuite`. We should fix it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27816 from gengliangwang/orcQuerySuite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-05 21:22:32 -08:00
yi.wu	587266f887	[SPARK-31010][SQL][FOLLOW-UP] Deprecate untyped scala UDF ### What changes were proposed in this pull request? Use scala annotation deprecate to deprecate untyped scala UDF. ### Why are the changes needed? After #27488, it's weird to see the untyped scala UDF will fail by default without deprecation. ### Does this PR introduce any user-facing change? Yes, user will see the warning: ``` <console>:26: warning: method udf in object functions is deprecated (since 3.0.0): Untyped Scala UDF API is deprecated, please use typed Scala UDF API such as 'def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction' instead. val myudf = udf(() => Math.random(), DoubleType) ^ ``` ### How was this patch tested? Tested manually. Closes #27794 from Ngone51/deprecate_untyped_scala_udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 13:00:04 +08:00
Maxim Gekk	59f1e76b82	[SPARK-31020][SPARK-31023][SPARK-31025][SPARK-31044][SQL] Support foldable args by `from_csv/json` and `schema_of_csv/json` ### What changes were proposed in this pull request? In the PR, I propose: 1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression. 2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression. 3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression. 4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression. ### Why are the changes needed? This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 ``` and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7; 'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)] +- OneRowRelation spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7; 'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)] +- OneRowRelation ``` ### Does this PR introduce any user-facing change? Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above: ```sql spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '')); {"id":1,"city":"Moscow"} ``` After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example: ```sql spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1)); struct<_c0:double,_c1:int> spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '')); struct<id:bigint,price:double> ``` ### How was this patch tested? Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`. Closes #27804 from MaxGekk/foldable-arg-csv-json-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 12:29:35 +08:00
maryannxue	d705d36c0c	[SPARK-31045][SQL] Add config for AQE logging level ### What changes were proposed in this pull request? This PR adds an internal config for changing the logging level of adaptive execution query plan evolvement. ### Why are the changes needed? To make AQE debugging easier. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #27798 from maryannxue/aqe-log-level. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 11:41:45 +08:00
Maxim Gekk	cf7c397ede	[MINOR][SQL] Remove an ignored test from JsonSuite ### What changes were proposed in this pull request? Remove ignored and outdated test `Type conflict in primitive field values (Ignored)` from JsonSuite. ### Why are the changes needed? The test is not maintained for long time. It can be removed to reduce size of JsonSuite, and improve maintainability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the command `./build/sbt "test:testOnly *JsonV2Suite"` Closes #27795 from MaxGekk/remove-ignored-test-in-JsonSuite. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-06 10:35:44 +09:00
Peter Toth	72b52a3cdf	[SPARK-30563][SQL] Disable using commit coordinator with NoopDataSource ### What changes were proposed in this pull request? This PR disables using commit coordinator with `NoopDataSource`. ### Why are the changes needed? No need for a coordinator in benchmarks. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27791 from peter-toth/SPARK-30563-disalbe-commit-coordinator. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-06 10:30:59 +09:00
DB Tsai	fe126a6a05	[SPARK-31058][SQL][TEST-HIVE1.2] Consolidate the implementation of `quoteIfNeeded` ### What changes were proposed in this pull request? There are two implementation of quoteIfNeeded. One is in `org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote` and the other is in `OrcFiltersBase.quoteAttributeNameIfNeeded`. This PR will consolidate them into one. ### Why are the changes needed? Simplify the codebase. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27814 from dbtsai/SPARK-31058. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-03-06 00:13:57 +00:00
Wenchen Fan	ba86524b25	[SPARK-31037][SQL] refine AQE config names ### What changes were proposed in this pull request? When introducing AQE to others, I feel the config names are a bit incoherent and hard to use. This PR refines the config names: 1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere. 2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`. 3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions` 4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace 5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum` 6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin` ### Why are the changes needed? Make the config names easy to understand. ### Does this PR introduce any user-facing change? deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize` ### How was this patch tested? N/A Closes #27793 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-06 00:46:34 +08:00
Wenchen Fan	807ea413b4	[SPARK-31019][SQL] make it clear that people can deduplicate map keys ### What changes were proposed in this pull request? rename the config and make it non-internal. ### Why are the changes needed? Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday. ### Does this PR introduce any user-facing change? no, just rename a config which was added in 3.0 ### How was this patch tested? add more tests for the fail behavior. Closes #27772 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-05 20:43:52 +09:00
Kent Yao	f45ae7f2c5	[SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone ### What changes were proposed in this pull request? The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid. ### Why are the changes needed? improve configuration ### Does this PR introduce any user-facing change? yes, will fail fast if the value is a wrong timezone id ### How was this patch tested? add ut Closes #27792 from yaooqinn/SPARK-31038. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 19:38:20 +08:00
maryannxue	9b602e26d2	[SPARK-31046][SQL] Make more efficient and clean up AQE update UI code ### What changes were proposed in this pull request? This PR avoids sending redundant metrics (those that have been included in previous update) as well as useless metrics (those in future stages) to Spark UI in AQE UI metrics update. ### Why are the changes needed? This change will make UI metrics update more efficient. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual test in Spark UI. Closes #27799 from maryannxue/aqe-ui-cleanup. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:53:01 +08:00
Terry Kim	66b4fd040e	[SPARK-31024][SQL] Allow specifying session catalog name `spark_catalog` in qualified column names for v1 tables ### What changes were proposed in this pull request? Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` fails with `cannot resolve 'spark_catalog.default.t.i`. This is inconsistent with v2 table behavior where catalog name can be used: ``` SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id ``` This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables. ### Why are the changes needed? Fixing an inconsistent behavior. ### Does this PR introduce any user-facing change? Yes, now the following query works: ``` SELECT spark_catalog.default.t.i FROM spark_catalog.default.t ``` ### How was this patch tested? Added new tests. Closes #27776 from imback82/spark_catalog_col_name_resolution. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 18:33:59 +08:00
Yuanjian Li	7db0af5785	[SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter ### What changes were proposed in this pull request? This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message. ### Why are the changes needed? Avoid silent result change for new behavior in 3.0. ### Does this PR introduce any user-facing change? Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null. ### How was this patch tested? Extend existing UT. Closes #27537 from xuanyuanking/SPARK-30668-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-05 15:29:39 +08:00
DB Tsai	3c16fae5c1	[SPARK-31027][SQL] Refactor DataSourceStrategy to be more extendable ### What changes were proposed in this pull request? Refactor `DataSourceStrategy.scala` and `DataSourceStrategySuite.scala` so it's more extendable to implement nested predicate pushdown. ### Why are the changes needed? To support nested predicate pushdown. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests and new tests. Closes #27778 from dbtsai/SPARK-31027. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 23:41:49 +09:00
Terry Kim	b30278107f	[SPARK-30885][SQL][FOLLOW-UP] Fix issues where some V1 commands allow tables that are not fully qualified ### What changes were proposed in this pull request? There are few V1 commands such as `REFRESH TABLE` that still allow `spark_catalog.t` because they run the commands with parsed table names without trying to load them in the catalog. This PR addresses this issue. The PR also addresses the issue brought up in https://github.com/apache/spark/pull/27642#discussion_r382402104. ### Why are the changes needed? To fix a bug where for some V1 commands, `spark_catalog.t` is allowed. ### Does this PR introduce any user-facing change? Yes, a bug is fixed and `REFRESH TABLE spark_catalog.t` is not allowed. ### How was this patch tested? Added new test. Closes #27718 from imback82/fix_TempViewOrV1Table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 18:09:48 +08:00
Wenchen Fan	e4c61e35da	[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ \|jsontostructs({"time":"1123-2-22 2:22:22"})\| +-------------------------------------------+ \|[1123-02-22 02:22:22] \| +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ \|jsontostructs({"time":"2000-12-12T12:12:12"})\| +---------------------------------------------+ \|[2000-12-12 12:12:12] \| +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ \|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})\| +----------------------------------------------------------+ \|[1123-02-15 02:22:22.123] \| +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ \|jsontostructs({"time":"1123-02-22Tabc"})\| +----------------------------------------+ \|[1123-02-22] \| +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ \|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})\| +--------------------------------------------------+ \|[1123-02-22 02:22:22.123] \| +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes #27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 18:27:44 +09:00
Shixiong Zhu	ebfff7af6a	[SPARK-30984][SS] Add UI test for Structured Streaming UI ### What changes were proposed in this pull request? - Add a UI test for Structured Streaming UI - Fix the unsafe usages of `SimpleDateFormat` by using a ThreadLocal shared object. - Use `start` to replace `submission` to be consistent with the API `StreamingQuery.start()`. ### Why are the changes needed? Structured Streaming UI is missing UI tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The new test. Closes #27732 from zsxwing/ss-ui-test. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 13:55:34 +08:00
yi.wu	380e887631	[SPARK-30999][SQL] Don't cancel a QueryStageExec which failed before call doMaterialize ### What changes were proposed in this pull request? This PR proposes to not cancel a `QueryStageExec` which failed before calling `doMaterialize`. Besides, this PR also includes 2 minor improvements: * fail fast when stage failed before calling `doMaterialize` * format Exception with Cause ### Why are the changes needed? For a stage which failed before materializing the lazy value (e.g. `inputRDD`), calling `cancel` on it could re-trigger the same failure again, e.g. executing child node again(see `AdaptiveQueryExecSuite`.`SPARK-30291: AQE should catch the exceptions when doing materialize` for example). And finally, the same failure will be counted 2 times, one is for materialize error and another is for cancel error. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated test. Closes #27752 from Ngone51/avoid_cancel_finished_stage. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-03 13:40:51 -08:00
Takeshi Yamamuro	4a1d273a4a	[SPARK-30997][SQL] Fix an analysis failure in generators with aggregate functions ### What changes were proposed in this pull request? We have supported generators in SQL aggregate expressions by SPARK-28782. But, the generator(explode) query with aggregate functions in DataFrame failed as follows; ``` // SPARK-28782: Generator support in aggregate expressions scala> spark.range(3).toDF("id").createOrReplaceTempView("t") scala> sql("select explode(array(min(id), max(id))) from t").show() +---+ \|col\| +---+ \| 0\| \| 2\| +---+ // A failure case handled in this pr scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() org.apache.spark.sql.AnalysisException: The query operator `Generate` contains one or more unsupported expression types Aggregate, Window or Generate. Invalid expressions: [min(`id`), max(`id`)];; Project [col#46L] +- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L] +- Range (0, 3, step=1, splits=Some(4)) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129) ``` The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions before `GlobalAggregates` replaces it with an aggregate as follows; ``` scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn") scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show() 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#72L), max(id#72L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Project [explode(array(min(id#72L), max(id#72L))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1: === Result of Batch Resolution === !'Project [explode(array(min('id), max('id))) AS List()] Project [col#76L] !+- Range (0, 3, step=1, splits=Some(4)) +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L] ! +- Range (0, 3, step=1, splits=Some(4)) // the analysis failed here... ``` To avoid the case in `ExtractGenerator`, this pr addes a condition to ignore generators having aggregate functions. A correct sequence of rules is as follows; ``` 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project [explode(array(min('id), max('id))) AS List()] 'Project [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === !'Project [explode(array(min(id#27L), max(id#27L))) AS List()] 'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] +- Range (0, 3, step=1, splits=Some(4)) +- Range (0, 3, step=1, splits=Some(4)) 20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator === !'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()] 'Project [explode(_gen_input_0#31) AS List()] !+- Range (0, 3, step=1, splits=Some(4)) +- Aggregate [array(min(id#27L), max(id#27L)) AS _gen_input_0#31] ! +- Range (0, 3, step=1, splits=Some(4)) ``` ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27749 from maropu/ExplodeInAggregate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-03 12:25:12 -08:00
Terry Kim	c263c15408	[SPARK-31015][SQL] Star() expression fails when used with qualified column names for v2 tables ### What changes were proposed in this pull request? For a v2 table created with `CREATE TABLE testcat.ns1.ns2.tbl (id bigint, name string) USING foo`, the following works as expected ``` SELECT testcat.ns1.ns2.tbl.id FROM testcat.ns1.ns2.tbl ``` , but a query with qualified column name with star() ``` SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl [info] org.apache.spark.sql.AnalysisException: cannot resolve 'testcat.ns1.ns2.tbl.' given input columns 'id, name'; ``` fails to resolve. And this PR proposes to fix this issue. ### Why are the changes needed? To fix a bug as describe above. ### Does this PR introduce any user-facing change? Yes, now `SELECT testcat.ns1.ns2.tbl. FROM testcat.ns1.ns2.tbl` works as expected. ### How was this patch tested? Added new test. Closes #27766 from imback82/fix_star_expression. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-04 00:55:26 +08:00
Takeshi Yamamuro	313e62c376	[SPARK-30998][SQL] ClassCastException when a generator having nested inner generators ### What changes were proposed in this pull request? A query below failed in the master; ``` scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show() 20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ... ``` This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators. ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27750 from maropu/HandleNestedGenerators. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-03 19:00:33 +09:00
Kent Yao	1fac06c430	Revert "[SPARK-30808][SQL] Enable Java 8 time API in Thrift server" This reverts commit `afaeb29599`. ### What changes were proposed in this pull request? Based on the result and comment from https://github.com/apache/spark/pull/27552#discussion_r385531744 In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-03 14:21:20 +08:00
HyukjinKwon	3956e95f05	[SPARK-25202][SQL][FOLLOW-UP] Keep the old parameter name 'pattern' at split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in https://github.com/apache/spark/pull/22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-03 10:24:50 +09:00
maryannxue	473a28c1d0	[SPARK-30991] Refactor AQE readers and RDDs ### What changes were proposed in this pull request? This PR combines `CustomShuffledRowRDD` and `LocalShuffledRowRDD` into `ShuffledRowRDD`, and creates `CustomShuffleReaderExec` to unify and replace all existing AQE readers: `CoalescedShuffleReaderExec`, `LocalShuffleReaderExec` and `SkewJoinShuffleReaderExec`. ### Why are the changes needed? To reduce code redundancy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #27742 from maryannxue/aqe-readers. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-02 16:04:00 -08:00
Josh Rosen	f0010c81e2	[SPARK-31003][TESTS] Fix incorrect uses of assume() in tests ### What changes were proposed in this pull request? This patch fixes several incorrect uses of `assume()` in our tests. If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable). In contrast, `assert(condition)` will fail the test if the condition doesn't hold. If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test. This patch fixes several such cases, replacing certain `assume()` calls with `assert()`. Credit to ahirreddy for spotting this problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27754 from JoshRosen/fix-assume-vs-assert. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-03-02 15:20:45 -08:00
Jungtaek Lim (HeartSaVioR)	f24a46011c	[SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable ### What changes were proposed in this pull request? This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable. It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled. We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase. `0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)` Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value. ### Why are the changes needed? Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #27747 from HeartSaVioR/SPARK-30993. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-02 22:33:11 +08:00
Maxim Gekk	f828453e95	[SPARK-30988][SQL][TESTS] Add more edge-case exercising values to stats tests ### What changes were proposed in this pull request? Added more test cases to `StatisticsCollectionTestBase`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `StatisticsSuite` and `StatisticsCollectionSuite`. Closes #27741 from MaxGekk/stat-collect-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:30:00 +09:00
Josh Rosen	f4499f678d	[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) ### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](`d841b33ba3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (L3414)`) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:19:12 +09:00
iRakson	92a5ae2ae4	[SPARK-30234][SQL][FOLLOWUP] Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### What changes were proposed in this pull request? Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### Why are the changes needed? To follow the naming convention ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27725 from iRakson/SPARK-30234_CONFIG. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-01 10:55:41 +09:00
iRakson	a40a2f8338	[SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType' ### What changes were proposed in this pull request? Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`. ### Why are the changes needed? Better readability of configuration. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27719 from iRakson/SPARK-27619_FOLLOWUP. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 22:57:50 +08:00
Eric Wu	eba2076ca3	[SPARK-30842][SQL] Adjust abstraction structure for join operators ### What changes were proposed in this pull request? Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment https://github.com/apache/spark/pull/27509#discussion_r379613391 . This PR refined from the following aspects: 1. Refined structure of all physical join operators 2. Add missing joinType field for CartesianProductExec operator 3. Refined codes related to Explain Formatted The EXPLAIN FORMATTED changes are 1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty. 2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`. 3. `#1` will NOT affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before. 4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added. 5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before. ### Why are the changes needed? Make the code consistent with other operators and for future handiness of join operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27595 from Eric5553/RefineJoin. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 18:42:15 +08:00
Wenchen Fan	f21894e5fa	[SPARK-30902][SQL] Default table provider should be decided by catalog implementations ### What changes were proposed in this pull request? When `CREATE TABLE` SQL statement does not specify the provider, leave it to the catalog implementations to decide. ### Why are the changes needed? It's super weird if we set the default provider to parquet when creating a table in a JDBC catalog. ### Does this PR introduce any user-facing change? Yes, v2 catalog will not see a "provider" property in table properties if it's not specified in `CREATE TABLE` SQL statement. V2 catalog is new in 3.0. ### How was this patch tested? new tests Closes #27650 from cloud-fan/create_table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-28 15:14:23 +09:00
yi.wu	a1d2ce90b0	[SPARK-30972][SQL] PruneHiveTablePartitions should be executed as earlyScanPushDownRules ### What changes were proposed in this pull request? Make rule `PruneHiveTablePartitions` to execute as `earlyScanPushDownRules`. ### Why are the changes needed? Similar to rule `PruneFileSourcePartitions`, `PruneHiveTablePartitions` should also be executed as earlyScanPushDownRules to eliminate the impact on statistic computation later. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27723 from Ngone51/early_hive_prune. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 11:50:20 +08:00
Liang-Chi Hsieh	ba032acf95	[SPARK-30955][SQL] Exclude Generate output when aliasing in nested column pruning ### What changes were proposed in this pull request? When aliasing in nested column pruning in Project on top of Generate, we should exclude Generate outputs. ### Why are the changes needed? Right now we would prune nested columns in Project on top of Generate. It is possible that referred nested columns are from Generate's outputs, not from its child. To address that case, we should exclude Generate outputs when aliasing in nested column pruning. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #27702 from viirya/fix-nested-pruning. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-28 12:29:46 +09:00
Eric Wu	bce8d9354c	[SPARK-30765][SQL] Refine base operator abstraction code style ### What changes were proposed in this pull request? When doing base operator abstraction work, we found there are still some code snippet is inconsistent with other abstraction code style. This PR addressed following two code refactor cases. Case 1 Override keyword missed for some fields in derived classes. The compiler will not capture it if we rename some fields in the future. Case 2 Inconsistent abstract class field definition. The updated style will simplify derived class definition, e.g. `EvalPythonExec` `WindowExecBase` ### Why are the changes needed? Improve the code style consistency and code quality ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27511 from Eric5553/BaseClassAbstraction. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-27 16:40:10 +09:00
Liang-Chi Hsieh	160c144baa	[SPARK-30590][SQL] Untyped select API cannot take typed column expression that needs input type ### What changes were proposed in this pull request? This patch proposes to throw clear analysis exception if untyped `Dataset.select` takes typed column expression that needs input type. ### Why are the changes needed? `Dataset` provides few typed `select` helper functions to select typed column expressions. The maximum number of typed columns supported is 5. If wanting to select more than 5 typed columns, it silently calls untyped `Dataset.select` and can causes weird unresolved error, like: ``` org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];; 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS foo_agg_6#141] +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS e#17, _6#11 AS F#18] +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:430) ``` However, to fully disallow typed columns as input to untyped `select` API will break current usage like `count` that is a `TypedColumn` in `functions`. In order to keep compatibility, we should allow current usage of certain `TypedColumn`s as input to untyped `select` API. For the `TypedColumn`s that will cause unresolved exception, we should explicitly let users know that they are incorrectly calling untyped `select` with typed columns which need input type. ### Does this PR introduce any user-facing change? Yes, but this PR only refines the error message. When users call `Dataset.select` API with typed column that needs input type, an analysis exception will be thrown. Previously an unresolved error will be thrown. ### How was this patch tested? Unit tests. Closes #27499 from viirya/SPARK-30590. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 14:09:07 +08:00
Wenchen Fan	eced93215f	[SPARK-30918][SQL][FOLLOWUP] Fix typo in OptimizeSkewedJoin ### What changes were proposed in this pull request? This is a follow up of #27669 in order to fix a typo. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27714 from cloud-fan/typo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-26 13:59:43 -08:00
iRakson	c913b9d8b5	[SPARK-27619][SQL] MapType should be prohibited in hash expressions ### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 01:48:12 +08:00
Terry Kim	73305475c1	[SPARK-30782][SQL] Column resolution doesn't respect current catalog/namespace for v2 tables ### What changes were proposed in this pull request? This PR proposes to fix an issue where qualified columns are not matched for v2 tables if current catalog/namespace are used. For v1 tables, you can currently perform the following: ```SQL SELECT default.t.id FROM t; ``` For v2 tables, the following fails: ```SQL USE testcat.ns1.ns2; SELECT testcat.ns1.ns2.t.id FROM t; org.apache.spark.sql.AnalysisException: cannot resolve '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7; ``` ### Why are the changes needed? It is a bug since qualified column names cannot match if current catalog/namespace are used. ### Does this PR introduce any user-facing change? Yes, now the following works: ```SQL USE testcat.ns1.ns2; SELECT testcat.ns1.ns2.t.id FROM t; ``` ### How was this patch tested? Added new tests Closes #27532 from imback82/qualifed_col_respect_current. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 00:21:38 +08:00
gatorsmile	28b8713036	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT ### What changes were proposed in this pull request? This patch is to bump the master branch version to 3.1.0-SNAPSHOT. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27698 from gatorsmile/updateVersion. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-25 19:44:31 -08:00
Jungtaek Lim (HeartSaVioR)	9ea6c0a897	[SPARK-30943][SS] Show "batch ID" in tool tip string for Structured Streaming UI graphs ### What changes were proposed in this pull request? This patch changes the tool tip string in Structured Streaming UI graphs to show batch ID (and timestamp as well) instead of only showing timestamp, which was a key for DStream but no longer a key for Structured Streaming. This patch does some refactoring as there're some spots on confusion between js file for streaming and structured streaming. Note that this patch doesn't actually change the x axis, as once we change it we should decouple the logic for graphs between streaming and structured streaming. It won't change UX meaningfully as in x axis we only show min and max which we still would like to know about "time" as well as batch ID. ### Why are the changes needed? In Structured Streaming, everything is aligned for "batch ID" where the UI is only showing timestamp - end users have to manually find and correlate batch ID and the timestamp which is clearly a huge pain. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested. Screenshots: ![Screen Shot 2020-02-25 at 7 22 38 AM](https://user-images.githubusercontent.com/1317309/75197701-40b2ce80-57a2-11ea-9578-c2eb2d1091de.png) ![Screen Shot 2020-02-25 at 7 22 44 AM](https://user-images.githubusercontent.com/1317309/75197704-427c9200-57a2-11ea-9439-e0a8303d0860.png) ![Screen Shot 2020-02-25 at 7 22 58 AM](https://user-images.githubusercontent.com/1317309/75197706-43152880-57a2-11ea-9617-1276c3ba181e.png) ![Screen Shot 2020-02-25 at 7 23 04 AM](https://user-images.githubusercontent.com/1317309/75197708-43152880-57a2-11ea-9de2-7d37eaf88102.png) ![Screen Shot 2020-02-25 at 7 23 31 AM](https://user-images.githubusercontent.com/1317309/75197710-43adbf00-57a2-11ea-9ae4-4e292de39c36.png) Closes #27687 from HeartSaVioR/SPARK-30943. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-02-25 15:29:36 -08:00
Wenchen Fan	8f247e5d36	[SPARK-30918][SQL] improve the splitting of skewed partitions ### What changes were proposed in this pull request? Use the average size of the non-skewed partitions as the target size when splitting skewed partitions, instead of ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD ### Why are the changes needed? The goal of skew join optimization is to make the data distribution move even. So it makes more sense the use the average size of the non-skewed partitions as the target size. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27669 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-25 14:10:29 -08:00
Maxim Gekk	ffc0935e64	[SPARK-30869][SQL] Convert dates to/from timestamps in microseconds precision ### What changes were proposed in this pull request? In the PR, I propose to replace: 1. `millisToDays()` by `microsToDays()` which accepts microseconds since the epoch and returns days since the epoch in the specified time zone. The last one is the internal representation of Catalyst's DateType. 2. `daysToMillis()` by `daysToMicros()` which accepts days since the epoch in some time zone and returns the number of microseconds since the epoch. The last one is internal representation of Catalyst's TimestampType. 3. `fromMillis()` by `millisToMicros()` 4. `toMillis()` by `microsToMillis()` ### Why are the changes needed? Spark stores timestamps in microseconds precision, so, there is no actual need to convert dates to milliseconds, and then to microseconds. As examples, look at DateTimeUtils functions `monthsBetween()` and `truncTimestamp()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites UnivocityParserSuite, DateExpressionsSuite, ComputeCurrentTimeSuite, DateTimeUtilsSuite, DateFunctionsSuite, JsonSuite, StreamSuite. Closes #27618 from MaxGekk/replace-millis-by-micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 23:05:28 +08:00
Kent Yao	761209c1f2	[SPARK-30919][SQL] Make interval multiply and divide's overflow behavior consistent with other operations ### What changes were proposed in this pull request? The current behavior of interval multiply and divide follows the ANSI SQL standard when overflow, it is compatible with other operations when `spark.sql.ansi.enabled` is true, but not compatible when `spark.sql.ansi.enabled` is false. When `spark.sql.ansi.enabled` is false, as the factor is a double value, so it should use java's rounding or truncation behavior for casting double to integrals. when divided by zero, it returns `null`. we also follow the natural rules for intervals as defined in the Gregorian calendar, so we do not add the month fraction to days but add days fraction to microseconds. ### Why are the changes needed? Make interval multiply and divide's overflow behavior consistent with other interval operations ### Does this PR introduce any user-facing change? no, these are new features in 3.0 ### How was this patch tested? add uts Closes #27672 from yaooqinn/SPARK-30919. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 22:19:24 +08:00
Yuanjian Li	e45f2c7fc0	[SPARK-28228][SQL][TESTS] Refactoring for nested CTE tests ### What changes were proposed in this pull request? Split the nested CTE cases into a single file `cte-nested.sql`, which will be reused in cte-legacy.sql and cte-nonlegacy.sql. ### Why are the changes needed? Make the cases easy to maintain. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #27667 from xuanyuanking/SPARK-28228-test. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-25 17:37:34 +09:00
Terry Kim	0fd4fa70c8	[SPARK-30885][SQL] V1 table name should be fully qualified if catalog name is provided ### What changes were proposed in this pull request? For the following: ``` CREATE TABLE t USING json AS SELECT 1 AS i SELECT * FROM spark_catalog.t ``` `spark_catalog.t` is resolved to `spark_catalog.default.t` assuming the current namespace is `default`. However, this is not consistent with V2 behavior where the namespace must be specified if the catalog name is provided. This PR proposes to fix this inconsistency. ### Why are the changes needed? To be consistent with V2 table naming scheme in SQL commands. ### Does this PR introduce any user-facing change? Yes, now the user has to specify the namespace if the catalog name is provided. For example, ``` SELECT * FROM spark_catalog.t # Will throw AnalysisException with 'Session catalog cannot have an empty namespace: spark_catalog.t' SELECT * FROM spark_catalog.default.t # OK ``` ### How was this patch tested? Added new tests Closes #27642 from imback82/disallow_spark_catalog_wihtout_db. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 13:04:28 +08:00
Shixiong Zhu	3126557b07	[SPARK-30936][CORE] Set FAIL_ON_UNKNOWN_PROPERTIES to false by default to parse Spark events ### What changes were proposed in this pull request? Set `FAIL_ON_UNKNOWN_PROPERTIES` to `false` in `JsonProtocol` to allow ignore unknown fields in a Spark event. After this change, if we add new fields to a Spark event parsed by `ObjectMapper`, the event json string generated by a new Spark version can still be read by an old Spark History Server. Since Spark History Server is an extra service, it usually takes time to upgrade, and it's possible that a Spark application is upgraded before SHS. Forwards-compatibility will allow an old SHS to support new Spark applications (may lose some new features but most of functions should still work). ### Why are the changes needed? `JsonProtocol` is supposed to provide strong backwards-compatibility and forwards-compatibility guarantees: any version of Spark should be able to read JSON output written by any other version, including newer versions. However, the forwards-compatibility guarantee is broken for events parsed by `ObjectMapper`. If a new field is added to an event parsed by `ObjectMapper` (e.g., `6dc5921e66 (diff-dc5c7a41fbb7479cef48b67eb41ad254R33)`), the event json string generated by a new Spark version cannot be parsed by an old version of SHS right now. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The new added tests. Closes #27680 from zsxwing/SPARK-30936. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-25 12:28:31 +08:00
Peter Toth	1a4e2423b2	[SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure ### What changes were proposed in this pull request? This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues: ``` SELECT explodedvalue.* FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value) LATERAL VIEW explode(value) AS explodedvalue ``` This is a regression from Spark 2.4. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT. Closes #27675 from peter-toth/SPARK-30870. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-24 13:46:21 -08:00
Shixiong Zhu	293e5364e5	[SPARK-30927][SS] StreamingQueryManager should avoid keeping reference to terminated StreamingQuery ### What changes were proposed in this pull request? Right now `StreamingQueryManager` will keep the last terminated query until `resetTerminated` is called. When the last terminated query has lots of states (a large sql plan, cached RDDs, etc.), it will keep a lot of memory unnecessarily. Actually, what `StreamingQueryManager` really needs is just the exception of the last failed query. This PR changes the internal field `lastTerminatedQuery` in `StreamingQueryManager` to remember the last exception rather than the query to save the memory. ### Why are the changes needed? Avoid keeping memory unnecessarily. ### Does this PR introduce any user-facing change? No ### How was this patch tested? This PR doesn't change any public behaviors. The existing tests have covered the touched codes. Closes #27678 from zsxwing/SPARK-30927. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-24 18:48:19 +09:00
beliefer	621e37e2ab	[SPARK-28880][SQL] Support ANSI nested bracketed comments ### What changes were proposed in this pull request? Spark SQL support single comments and bracketed comments now. This PR will support nested bracketed comments. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Expressions/Comments.htm?zoom_highlight=comments Note: Because Spark SQL not exists UT for single comments and bracketed comments, so I add some UT for them. ### Why are the changes needed? nested bracketed comments is ANSI standard. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT Closes #27495 from beliefer/nested-brancket-comments. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-24 00:28:46 -08:00
Burak Yavuz	4ff2718d54	[SPARK-30924][SQL][3.0] Add additional checks to Merge Into ### What changes were proposed in this pull request? Merge Into is currently missing additional validation around: 1. The lack of any WHEN statements 2. The first WHEN MATCHED statement needs to have a condition if there are two WHEN MATCHED statements. 3. Single use of UPDATE/DELETE This PR introduces these validations. (1) is required, because otherwise the MERGE statement is useless. (2) is required, because otherwise the second WHEN MATCHED condition becomes dead code (3) is up for debate, but the idea there is that a single expression should be sufficient to specify when you would like to update or delete your records. We restrict it for now to reduce surface area and ambiguity. ### Why are the changes needed? To ease DataSource developers when building implementations for MERGE ### Does this PR introduce any user-facing change? Adds additional validation checks ### How was this patch tested? Unit tests Closes #27677 from brkyvz/mergeChecks. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 15:16:37 +08:00
jiake	f4696ba252	[SPARK-30922][SQL] remove the max splits config in skewed join ### What changes were proposed in this pull request? When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits` is to resolve the above ui shown issue. And after [PR#27493](https://github.com/apache/spark/pull/27493) combined the skewed readers into one, we not need this config. ### Why are the changes needed? remove the unnecessary config ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test Closes #27673 from JkSelf/removeMaxSplitNum. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 14:29:25 +08:00
Maxim Gekk	c41ef39819	[SPARK-30925][SQL] Prevent overflow/round errors in conversions of milliseconds to/from microseconds ### What changes were proposed in this pull request? - Use `Math.multiplyExact()` in `DateTimeUtils.fromMillis()` to prevent silent overflow in conversion milliseconds to microseconds. - Use `DateTimeUtils.fromMillis()` in all places where milliseconds are converted to microseconds - Use `DateTimeUtils.toMillis()` in all places where microseconds are converted to milliseconds ### Why are the changes needed? 1. To prevent silent arithmetic overflow while multiplying by 1000 in `fromMillis()`. Instead of it, `new ArithmeticException("long overflow")` will be thrown, and handled accordantly. 2. To correctly round microseconds in conversion to milliseconds. For example, `1965-01-01 10:11:12.123456` is represented as `-157700927876544` in micro precision. In milliseconds precision the above needs to be represented as `-157700927877` or `1965-01-01 10:11:12.123`. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By `TimestampFormatterSuite`, `CastSuite`, `DateExpressionsSuite`, `IntervalExpressionsSuite`, `ExpressionParserSuite`, `ExpressionParserSuite`, `DateTimeUtilsSuite`, `IntervalUtilsSuite` Closes #27676 from MaxGekk/millis-2-micros-overflow. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-24 14:06:25 +08:00
yi.wu	9c2eadc726	[SPARK-30844][SQL] Static partition should also follow StoreAssignmentPolicy when insert into table ### What changes were proposed in this pull request? Make static partition also follows `StoreAssignmentPolicy` when insert into table: if `StoreAssignmentPolicy=LEGACY`, using `Cast`; if `StoreAssignmentPolicy=ANSI \| STRIC`, using `AnsiCast`; E.g., for the table `t` created by: ``` create table t(a int, b string) using parquet partitioned by (a) ``` and insert values with `StoreAssignmentPolicy=ANSI` using: ``` insert into t partition(a='ansi') values('ansi') ``` Before this PR: ``` +----+----+ \| b\| a\| +----+----+ \|ansi\|null\| +----+----+ ``` After this PR, insert will fail by: ``` java.lang.NumberFormatException: invalid input syntax for type numeric: ansi ``` (It should be better if we could use `TableOutputResolver.checkField` to fully follow `StoreAssignmentPolicy`. But since we lost the data type of static partition's value at first place, it's hard to use `TableOutputResolver.checkField`.) ### Why are the changes needed? I think we should follow `StoreAssignmentPolicy` when insert into table for any columns, including static partition. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new test. Closes #27597 from Ngone51/fix-static-partition. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-02-23 17:46:19 +09:00
yi.wu	25f5bfaa6e	[SPARK-30903][SQL] Fail fast on duplicate columns when analyze columns <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> Add new `CommandCheck` rule and fail fast when detects duplicate columns in `AnalyzeColumnCommand`. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To avoid duplicate statistics computation for the same column in `AnalyzeColumnCommand`. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> Yes. User now get exception when input duplicate columns. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Added new test. Closes #27651 from Ngone51/fail_on_dup_cols. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-02-23 09:52:54 +09:00
beliefer	59d6d5cbb0	[SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder ### What changes were proposed in this pull request? Spark `ConfigEntry` and `ConfigBuilder` missing Spark version information of each configuration at release. This is not good for Spark user when they visiting the page of spark configuration. http://spark.apache.org/docs/latest/configuration.html The new Spark SQL config docs looks like: ![sql配置截屏](https://user-images.githubusercontent.com/8486025/74604522-cb882f00-50f9-11ea-8683-57a90f9e3347.png) ``` > SET -v spark.sql.adaptive.enabled false When true, enable adaptive query execution. spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin 0.2 The relation with a non-empty partition ratio lower than this config will not be considered as the build side of a broadcast-hash join in adaptive execution regardless of its size.This configuration only has an effect when 'spark.sql.adaptive.enabled' is enabled. spark.sql.adaptive.optimizeSkewedJoin.enabled true When true and adaptive execution is enabled, a skewed join is automatically handled at runtime. spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor 10 A partition is considered as a skewed partition if its size is larger than this factor multiple the median partition size and also larger than spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionMaxSplits 5 Configures the maximum number of task to handle a skewed partition in adaptive skewedjoin. spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold 64MB Configures the minimum size in bytes for a partition that is considered as a skewed partition in adaptive skewed join. spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled true Whether to fetch the continuous shuffle blocks in batch. Instead of fetching blocks one by one, fetching continuous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple continuous blocks exist in single fetch request only happen when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled, this feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol. spark.sql.adaptive.shuffle.localShuffleReader.enabled true When true and 'spark.sql.adaptive.enabled' is enabled, this enables the optimization of converting the shuffle reader to local shuffle reader for the shuffle exchange of the broadcast hash join in probe side. spark.sql.adaptive.shuffle.maxNumPostShufflePartitions <undefined> The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled. ``` Note: Because there are so many configuration items that are exposed and require a lot of finishing, I will add the version numbers of these configuration items in another PR. ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Exists UT Closes #27592 from beliefer/add-version-to-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-22 09:46:42 +09:00
Eric Wu	1f0300fb16	[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style ### What changes were proposed in this pull request? The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in https://github.com/apache/spark/pull/27368#discussion_r376694496 https://github.com/apache/spark/pull/27368#discussion_r376927143 Observations/Ideas: 1. Using comma as the separator is not clear, especially commas are used inside the expressions too. 2. Show the column counts first? For example, `Results [4]: …` 3. Currently the attribute names are automatically generated, this need to refined. 4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between `EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`. 5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123 This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability. ### Why are the changes needed? The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan. ### Does this PR introduce any user-facing change? Yes, `EXPLAIN FORMATTED` output style changed. ### How was this patch tested? Update expect results of test cases in explain.sql Closes #27509 from Eric5553/ExplainFormattedRefine. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 23:36:14 +08:00
maryannxue	6058ce97b9	[SPARK-30906][SQL] Turning off AQE in CacheManager is not thread-safe ### What changes were proposed in this pull request? This PR aims to fix the thread-safety issue in turning off AQE for CacheManager by cloning the current session and changing the AQE conf on the cloned session. This PR also adds a utility function for cloning the session with AQE disabled conf value, which can be shared by another caller. ### Why are the changes needed? To fix the potential thread-unsafe problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested CachedTableSuite with AQE settings enabled. Closes #27659 from maryannxue/spark-30906. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 22:49:20 +08:00
Yuanjian Li	a5efbb284e	[SPARK-30809][SQL] Review and fix issues in SQL API docs ### What changes were proposed in this pull request? - Add missing `since` annotation. - Don't show classes under `org.apache.spark.sql.dynamicpruning` package in API docs. - Fix the scope of `xxxExactNumeric` to remove it from the API docs. ### Why are the changes needed? Avoid leaking APIs unintentionally in Spark 3.0.0. ### Does this PR introduce any user-facing change? No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0. ### How was this patch tested? Manually generated the API docs and verified the above issues have been fixed. Closes #27560 from xuanyuanking/SPARK-30809. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 17:03:22 +08:00
yi.wu	82ce4753aa	[SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 14:46:54 +08:00
wuyi	5eb004f4bb	Revert "[SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue" ### What changes were proposed in this pull request? This reverts commit `bef5d9d6c3`. ### Why are the changes needed? Revert it according to https://github.com/apache/spark/pull/24902#issuecomment-584511167. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27540 from Ngone51/revert_spark_28093. Lead-authored-by: wuyi <yi.wu@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 12:55:32 +08:00
Maxim Gekk	a551715fd2	[SPARK-29930][SPARK-30416][SQL][FOLLOWUP] Move deprecated/removed config checks from RuntimeConfig to SQLConf ### What changes were proposed in this pull request? - Output warnings for deprecated SQL configs in `SQLConf. setConfWithCheck()` and in `SQLConf. unsetConf()` - Throw an exception for removed SQL configs in `SQLConf. setConfWithCheck()` when they set to non-default values - Remove checking of deprecated and removed SQL configs from RuntimeConfig ### Why are the changes needed? Currently, warnings/exceptions are printed only when a SQL config is set dynamically, for instance via `spark.conf.set()`. After the changes, removed/deprecated SQL configs will be checked when they set statically. For example: ``` $ bin/spark-shell --conf spark.sql.fromJsonForceNullableSchema=false scala> spark.emptyDataFrame java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': ... Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value. ``` ``` $ bin/spark-shell --conf spark.sql.hive.verifyPartitionPath=false scala> spark.emptyDataFrame 20/02/20 02:10:26 WARN SQLConf: The SQL config 'spark.sql.hive.verifyPartitionPath' has been deprecated in Spark v3.0 and may be removed in the future. This config is replaced by 'spark.files.ignoreMissingFiles'. ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By `SQLConfSuite` Closes #27645 from MaxGekk/remove-sql-configs-followup-2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 00:00:48 +08:00
Wenchen Fan	704d249a56	[SPARK-26071][FOLLOWUP] Improve migration guide of disallowing map type map key ### What changes were proposed in this pull request? mention the workaround if users do want to use map type as key, and add a test to demonstrate it. ### Why are the changes needed? it's better to provide an alternative when we ban something. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27621 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-20 22:10:04 +08:00
herman	c92d437c46	[SPARK-30811][SQL] CTE should not cause stack overflow when it refers to non-existent table with same name ### Why are the changes needed? This ports the tests introduced in `7285eea683` to master to avoid future regressions. ### Background A query with Common Table Expressions can cause a stack overflow when it contains a CTE that refers a non-existing table with the same name. The name of the table need to have a database qualifier. This is caused by a couple of things: - CTESubstitution runs analysis on the CTE, but this does not throw an exception because the table has a database qualifier. The reason is that we don't fail is because we re-attempt to resolve the relation in a later rule; - CTESubstitution replace logic does not check if the table it is replacing has a database, it shouldn't replace the relation if it does. So now we will happily replace nonexist.t with t; Note that this not an issue for master or the spark-3.0 branch. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added regression test to `AnalysisErrorSuite` and `DataFrameSuite`. Closes #27635 from hvanhovell/SPARK-30811-master. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-19 10:17:46 -08:00
LantaoJin	c0715221b2	[SPARK-30785][SQL] Create table like should keep tracksPartitionsInCatalog same with source table ### What changes were proposed in this pull request? Table generated by `CREATE TABLE LIKE` a partitioned table is a partitioned table. But when run `ALTER TABLE ADD PARTITION`, it will throw `AnalysisException: ALTER TABLE ADD PARTITION is not allowed`. That's because the default value of `tracksPartitionsInCatalog` from `CREATE TABLE LIKE` always is false. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add a unit test. Closes #27538 from LantaoJin/SPARK-30785. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-19 15:05:34 +08:00
beliefer	0894dbab2c	[MINOR][SQL] Improve readability for window execution ### What changes were proposed in this pull request? I read the comments of `WindowExec` and found some comment will cause confusion and another need to improve. ### Why are the changes needed? This PR will enhance the readability and let developer works more easy ### Does this PR introduce any user-facing change? No ### How was this patch tested? No need Closes #27431 from beliefer/improve-window-readability. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-19 14:26:27 +08:00
Wenchen Fan	1b67d546bd	revert SPARK-29663 and SPARK-29688 ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/26325 and https://github.com/apache/spark/pull/26347 ### Why are the changes needed? When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double. However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption. ### Does this PR introduce any user-facing change? yes, remove 2 features added in 3.0 ### How was this patch tested? N/A Closes #27619 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-02-18 21:19:57 +01:00
yi.wu	643a480b11	[SPARK-30863][SQL] Distinguish Cast and AnsiCast in toString ### What changes were proposed in this pull request? Prefix by `ansi_` in `toString` if it's a `AnsiCast` or ansi enabled `Cast`. E.g. run `spark.sql("select cast('51' as int)").queryExecution.analyzed` under ansi mode. Before this PR: ``` Project [cast(51 as int) AS CAST(51 AS INT)#0] +- OneRowRelation ``` After this PR: ``` Project [ansi_cast(51 as int) AS CAST(51 AS INT)#0] +- OneRowRelation ``` ### Why are the changes needed? This is useful while comparing `LogicalPlan`s literally. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27608 from Ngone51/ansi_cast_tostring. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 16:10:43 +08:00
Terry Kim	5866bc77d7	[SPARK-30814][SQL] ALTER TABLE ... ADD COLUMN position should be able to reference columns being added ### What changes were proposed in this pull request? In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema: ``` root: - a: string - b: long ``` , the following should work: ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue. ### Why are the changes needed? To fix a bug described above. ### Does this PR introduce any user-facing change? Yes, now ``` ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x) ``` works as expected. ### How was this patch tested? Added new tests Closes #27584 from imback82/alter_table_pos_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 13:01:45 +08:00
Liang Zhang	d8c0599e54	[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset ### What changes were proposed in this pull request? This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class. ### Why are the changes needed? They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs. ### Does this PR introduce any user-facing change? ``` scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2") df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2") df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2") df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2") df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int] scala> df1.semanticHash res0: Int = 594427822 scala> df2.semanticHash res1: Int = 594427822 scala> df1.sameSemantics(df2) res2: Boolean = true scala> df1.sameSemantics(df3) res3: Boolean = false scala> df3.semanticHash res4: Int = -1592702048 scala> df4.semanticHash res5: Int = -1592702048 scala> df4.sameSemantics(df3) res6: Boolean = true ``` ### How was this patch tested? Unit test in scala and doctest in python. Note: comments are copied from the corresponding lower-level APIs. Note: There are some issues to be fixed that would improve the hash collision rate: https://github.com/apache/spark/pull/27565#discussion_r379881028 Closes #27565 from liangz1/df-same-result. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-18 09:22:26 +08:00
Ajith	657d151395	[SPARK-29174][SQL] Support LOCAL in INSERT OVERWRITE DIRECTORY to data source ### What changes were proposed in this pull request? `INSERT OVERWRITE LOCAL DIRECTORY` is supported with ensuring the provided path is always using `file://` as scheme and removing the check which throws exception if we do insert overwrite by mentioning directory with `LOCAL` syntax ### Why are the changes needed? without the modification in PR, ``` insert overwrite local directory <location> using ``` throws exception ``` Error: org.apache.spark.sql.catalyst.parser.ParseException: LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 0) ``` which was introduced in https://github.com/apache/spark/pull/18975, but this restriction is not needed, hence dropping the same. Keep behaviour consistent for local and remote file-system in `INSERT OVERWRITE DIRECTORY` ### Does this PR introduce any user-facing change? Yes, after this change `INSERT OVERWRITE LOCAL DIRECTORY` will not throw exception ### How was this patch tested? Added UT Closes #27039 from ajithme/insertoverwrite2. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-18 09:42:31 +09:00
Ajith	2854091d12	[SPARK-22590][SQL] Copy sparkContext.localproperties to child thread in BroadcastExchangeExec.executionContext ### What changes were proposed in this pull request? In `org.apache.spark.sql.execution.exchange.BroadcastExchangeExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the broadcast execution thread in `org.apache.spark.sql.execution.exchange.BroadcastExchangeExec#executionContext` ### Why are the changes needed? When executing `BroadcastExchangeExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads. These threads are created in the executionContext (thread pools). Each Thread pool has a default `keepAliveSeconds` of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #27266 from ajithme/broadcastlocalprop. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 02:26:52 +08:00
Maxim Gekk	afaeb29599	[SPARK-30808][SQL] Enable Java 8 time API in Thrift server ### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:00:00. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 02:15:44 +08:00
Yuanjian Li	5ffc5ff55e	[SPARK-11150][SQL][FOLLOWUP] Move sql/dynamicpruning to sql/execution/dynamicpruning ### What changes were proposed in this pull request? Follow-up work for #25600. In this PR, we move `sql/dynamicpruning` to `sql/execution/dynamicpruning`. ### Why are the changes needed? Fix the unexpected public APIs in 3.0.0 #27560. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #27581 from xuanyuanking/SPARK-11150-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 01:44:14 +08:00
wangguangxin.cn	0ae3ff60c4	[SPARK-30806][SQL] Evaluate once per group in UnboundedWindowFunctionFrame ### What changes were proposed in this pull request? We only need to do aggregate evaluation once per group in `UnboundedWindowFunctionFrame` ### Why are the changes needed? Currently, in `UnboundedWindowFunctionFrame.write`，it re-evaluate the processor for each row in a group, which is not necessary in fact which I'll address later. It hurts performance when the evaluation is time-consuming (for example, Percentile's eval need to sort its buffer and do some calculation). In our production, there is a percentile with window operation sql, it costs more than 10 hours in SparkSQL while 10min in Hive. In fact, `UnboundedWindowFunctionFrame` can be treated as `SlidingWindowFunctionFrame` with `lbound = UnboundedPreceding` and `ubound = UnboundedFollowing`, just as its comments. In that case, `SlidingWindowFunctionFrame` also only do evaluation once for each group. The performance issue can be reproduced by running the follow scripts in local spark-shell ``` spark.range(100100).map(i => (i, "India")).toDF("uv", "country").createOrReplaceTempView("test") sql("select uv, country, percentile(uv, 0.95) over (partition by country) as ptc95 from test").collect.foreach(println) ``` Before this patch, the sql costs 128048 ms. With this patch, the sql costs 3485 ms. If we increase the data size to 10001000 for example, then spark cannot even produce result without this patch(I'v waited for several hours). ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Existing UT Closes #27558 from WangGuangxin/windows. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: herman <herman@databricks.com>	2020-02-17 18:15:54 +01:00
Yuanjian Li	e4a541b278	[SPARK-30829][SQL] Define LegacyBehaviorPolicy enumeration as the common value for result change configs ### What changes were proposed in this pull request? Define a new enumeration `LegacyBehaviorPolicy` in SQLConf, it will be used as the common value for result change configs. ### Why are the changes needed? During API auditing for the 3.0 release, we found several new approaches that will change the results silently. For these features, we need a common three-value config. ### Does this PR introduce any user-facing change? Yes, original config `spark.sql.legacy.ctePrecedence.enabled` change to `spark.sql.legacy.ctePrecedencePolicy`. ### How was this patch tested? Existing UT. Closes #27579 from xuanyuanking/SPARK-30829. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 00:52:05 +08:00
Arwin Tio	25e9156bc0	[SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? See JIRA: https://issues.apache.org/jira/browse/SPARK-29089 Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck. From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls: > - have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those. > - add parallel execution to the glob and existence checks ### Why are the changes needed? Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added a test suite `DataSourceSuite` - open to suggestions for better naming. See [here](https://github.com/apache/spark/pull/25899#issuecomment-534380034) and [here](https://github.com/apache/spark/pull/25899#issuecomment-534069194) for some measurements Closes #25899 from cozos/master. Lead-authored-by: Arwin Tio <Arwin.tio@adroll.com> Co-authored-by: Arwin Tio <arwin.tio@hotmail.com> Co-authored-by: Arwin Tio <arwin.tio@adroll.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-17 09:30:35 -06:00
Maxim Gekk	06217cfded	[SPARK-30793][SQL] Fix truncations of timestamps before the epoch to minutes and seconds ### What changes were proposed in this pull request? In the PR, I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `SECOND` and `MINUTE` levels. ### Why are the changes needed? This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `SECOND` and `MINUTE` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp: ```sql spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123'); 2020-02-11 00:01:02 ``` but seconds in the truncated timestamp before the epoch are increased by 1: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:03 ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the example above outputs correct result: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:02 ``` ### How was this patch tested? Added new tests to `DateFunctionsSuite`. Closes #27543 from MaxGekk/fix-second-minute-truc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:51:56 +08:00
Yuanjian Li	ab186e3659	[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:06:58 +08:00
Maxim Gekk	9107f77f15	[SPARK-30843][SQL] Fix getting of time components before 1582 year ### What changes were proposed in this pull request? 1. Rewrite DateTimeUtils methods `getHours()`, `getMinutes()`, `getSeconds()`, `getSecondsWithFraction()`, `getMilliseconds()` and `getMicroseconds()` using Java 8 time APIs. This will automatically switch the `Hour`, `Minute`, `Second` and `DatePart` expressions on Proleptic Gregorian calendar. 2. Remove unused methods and constant of DateTimeUtils - `to2001`, `YearZero `, `toYearZero` and `absoluteMicroSecond()`. 3. Remove unused value `timeZone` from `TimeZoneAwareExpression` since all expressions have been migrated to Java 8 time API, and legacy instance of `TimeZone` is not needed any more. 4. Change signatures of modified DateTimeUtils methods, and pass `ZoneId` instead of `TimeZone`. This will allow to avoid unnecessary conversions `TimeZone` -> `String` -> `ZoneId`. 5. Modify tests in `DateTimeUtilsSuite` and in `DateExpressionsSuite` to pass `ZoneId` instead of `TimeZone`. Correct the tests, to pass tested zone id instead of None. ### Why are the changes needed? The changes fix the issue of wrong results returned by the `hour()`, `minute()`, `second()`, `date_part('millisecond', ...)` and `date_part('microsecond', ....)`, see example in [SPARK-30843](https://issues.apache.org/jira/browse/SPARK-30843). ### Does this PR introduce any user-facing change? Yes. After the changes, the results of examples from SPARK-30843: ```sql spark-sql> select hour(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select minute(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select second(timestamp '0010-01-01 00:00:00'); 0 spark-sql> select date_part('milliseconds', timestamp '0010-01-01 00:00:00'); 0.000 spark-sql> select date_part('microseconds', timestamp '0010-01-01 00:00:00'); 0 ``` ### How was this patch tested? - By existing test suites `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`. - Add new tests to `DateExpressionsSuite` and `DateTimeUtilsSuite` for 10 year, like: ```scala input = date(10, 1, 1, 0, 0, 0, 0, zonePST) assert(getHours(input, zonePST) === 0) ``` - Re-run `DateTimeBenchmark` using Amazon EC2. \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 \| Closes #27596 from MaxGekk/localtimestamp-greg-cal. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-1-30.us-west-2.compute.internal> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 13:59:21 +08:00
Wenchen Fan	ab07c6300c	[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view ### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 12:07:46 +08:00
Maxim Gekk	8b73b92aad	[SPARK-30826][SQL] Respect reference case in `StringStartsWith` pushed down to parquet ### What changes were proposed in this pull request? In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet. ### Why are the changes needed? This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table: ```sql CREATE TABLE t1 (col STRING) USING parquet OPTIONS (path '$path') ``` created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result: ```scala spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show +---+ \|col\| +---+ +---+ ``` ### Does this PR introduce any user-facing change? Yes. After the changes the result is correct for the example above: ```scala spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show +---+ \|col\| +---+ \| 42\| +---+ ``` ### How was this patch tested? Added a test to `ParquetFilterSuite` Closes #27574 from MaxGekk/parquet-StringStartsWith-case-sens. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-15 19:49:58 +08:00
DB Tsai	d0f9614760	[SPARK-30289][SQL] Partitioned by Nested Column for `InMemoryTable` ### What changes were proposed in this pull request? 1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct. This PR implements partitioned by nested column for `InMemoryTable`. ### Why are the changes needed? This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests and new tests. Closes #26929 from dbtsai/addTests. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-02-14 21:46:01 +00:00
Maxim Gekk	7137a6d065	[SPARK-30766][SQL] Fix the timestamp truncation to the `HOUR` and `DAY` levels ### What changes were proposed in this pull request? In the PR, I propose to use Java 8 time API in timestamp truncations to the levels of `HOUR` and `DAY`. The problem is in the usage of `timeZone.getOffset(millis)` in days/hours truncations where the combined calendar (Julian + Gregorian) is used underneath. ### Why are the changes needed? The change fix wrong truncations. For example, the following truncation to hours should print `0010-01-01 01:00:00` but it outputs wrong timestamp: ```scala Seq("0010-01-01 01:02:03.123456").toDF() .select($"value".cast("timestamp").as("ts")) .select(date_trunc("HOUR", $"ts").cast("string")) .show(false) +------------------------------------+ \|CAST(date_trunc(HOUR, ts) AS STRING)\| +------------------------------------+ \|0010-01-01 01:30:17 \| +------------------------------------+ ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the result of the example above is: ```scala +------------------------------------+ \|CAST(date_trunc(HOUR, ts) AS STRING)\| +------------------------------------+ \|0010-01-01 01:00:00 \| +------------------------------------+ ``` ### How was this patch tested? - Added new test to `DateFunctionsSuite` - By `DateExpressionsSuite` and `DateTimeUtilsSuite` Closes #27512 from MaxGekk/fix-trunc-old-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-14 22:16:57 +08:00
HyukjinKwon	2a270a731a	[SPARK-30810][SQL] Parses and convert a CSV Dataset having different column from 'value' in csv(dataset) API ### What changes were proposed in this pull request? This PR fixes `DataFrameReader.csv(dataset: Dataset[String])` API to take a `Dataset[String]` originated from a column name different from `value`. This is a long-standing bug started from the very first place. `CSVUtils.filterCommentAndEmpty` assumed the `Dataset[String]` to be originated with `value` column. This PR changes to use the first column name in the schema. ### Why are the changes needed? For `DataFrameReader.csv(dataset: Dataset[String])` to support any `Dataset[String]` as the signature indicates. ### Does this PR introduce any user-facing change? Yes, ```scala val ds = spark.range(2).selectExpr("concat('a,b,', id) AS text").as[String] spark.read.option("header", true).option("inferSchema", true).csv(ds).show() ``` Before: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input columns: [text];; 'Filter (length(trim('value, None)) > 0) +- Project [concat(a,b,, cast(id#0L as string)) AS text#2] +- Range (0, 2, step=1, splits=Some(2)) ``` After: ``` +---+---+---+ \| a\| b\| 0\| +---+---+---+ \| a\| b\| 1\| +---+---+---+ ``` ### How was this patch tested? Unittest was added. Closes #27561 from HyukjinKwon/SPARK-30810. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-14 18:20:18 +08:00
maryannxue	0aed77a015	[SPARK-30801][SQL] Subqueries should not be AQE-ed if main query is not ### What changes were proposed in this pull request? This PR makes sure AQE is either enabled or disabled for the entire query, including the main query and all subqueries. Currently there are unsupported queries by AQE, e.g., queries that contain DPP filters. We need to make sure that if the main query is unsupported, none of the sub-queries should apply AQE, otherwise it can lead to performance regressions due to missed opportunity of sub-query reuse. ### Why are the changes needed? To get rid of potential perf regressions when AQE is turned on. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite: 1. Removed the existing workaround `withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")` 2. Added `DynamicPartitionPruningSuiteAEOn` and `DynamicPartitionPruningSuiteAEOff` to enable testing this suite with AQE on and off options 3. Added a check in `checkPartitionPruningPredicate` to verify that the subqueries are always in sync with the main query in terms of whether AQE is applied. Closes #27554 from maryannxue/spark-30801. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-14 11:20:55 +08:00
Ali Afroozeh	e2d3983de7	[SPARK-30798][SQL] Scope Session.active in QueryExecution ### What changes were proposed in this pull request? This PR scopes `SparkSession.active` to prevent problems with processing queries with possibly different spark sessions (and different configs). A new method, `withActive` is introduced on `SparkSession` that restores the previous spark session after the block of code is executed. ### Why are the changes needed? `SparkSession.active` is a thread local variable that points to the current thread's spark session. It is important to note that the `SQLConf.get` method depends on `SparkSession.active`. In the current implementation it is possible that `SparkSession.active` points to a different session which causes various problems. Most of these problems arise because part of the query processing is done using the configurations of a different session. For example, when creating a data frame using a new session, i.e., `session.sql("...")`, part of the data frame is constructed using the currently active spark session, which can be a different session from the one used later for processing the query. ### Does this PR introduce any user-facing change? The `withActive` method is introduced on `SparkSession`. ### How was this patch tested? Unit tests (to be added) Closes #27387 from dbaliafroozeh/UseWithActiveSessionInQueryExecution. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-02-13 23:58:55 +01:00
Wenchen Fan	a4ceea6868	[SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This is a followup of https://github.com/apache/spark/pull/26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> no ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- (5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- (3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53] : +- (1) Project [(id#0L % 2) AS key1#2L] : +- (1) Filter isnotnull((id#0L % 2)) : +- (1) Range (0, 100000, step=1, splits=6) +- (4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64] +- (2) Project [((id#4L % 2) + 1) AS key2#6L] +- (2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes #27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-02-13 20:09:24 +01:00
beliefer	04604b9899	[SPARK-30758][SQL][TESTS] Improve bracketed comments tests ### What changes were proposed in this pull request? Although Spark SQL support bracketed comments, but `SQLQueryTestSuite` can't treat bracketed comments well and lead to generated golden files can't display bracketed comments well. This PR will improve the treatment of bracketed comments and add three test case in `PlanParserSuite`. Spark SQL can't support nested bracketed comments and https://github.com/apache/spark/pull/27495 used to support it. ### Why are the changes needed? Golden files can't display well. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT. Closes #27481 from beliefer/ansi-brancket-comments. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 22:06:24 +08:00
Terry Kim	a6b4b914f2	[SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax ### What changes were proposed in this pull request? This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns The user now can do the following: ```SQL CREATE TABLE t (col1 int, col2 int) USING Foo; ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3'); ``` , which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`. ### Why are the changes needed? This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful. ### Does this PR introduce any user-facing change? Yes, adding a new DDL statement. ### How was this patch tested? More tests to be added. Closes #27482 from imback82/replace_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 20:13:36 +08:00
maryannxue	453d5261b2	[SPARK-30528][SQL] Turn off DPP subquery duplication by default ### What changes were proposed in this pull request? This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression. When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression. This PR turns off the subquery duplication DPP filter by: 1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default. 2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible. ### Why are the changes needed? This is to fix a potential performance regression caused by DPP. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Updated DynamicPartitionPruningSuite to test the new configuration. Closes #27551 from maryannxue/spark-30528. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 19:32:38 +08:00
iRakson	926e3a1efe	[SPARK-30790] The dataType of map() should be map<null,null> ### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to #27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes #27542 from iRakson/SPARK-30790. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 12:23:40 +08:00
Thomas Graves	496f6ac860	[SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes ### What changes were proposed in this pull request? This is another PR for stage level scheduling. In particular this adds changes to the dynamic allocation manager and the scheduler backend to be able to track what executors are needed per ResourceProfile. Note the api is still private to Spark until the entire feature gets in, so this functionality will be there but only usable by tests for profiles other then the DefaultProfile. The main changes here are simply tracking things on a ResourceProfile basis as well as sending the executor requests to the scheduler backend for all ResourceProfiles. I introduce a ResourceProfileManager in this PR that will track all the actual ResourceProfile objects so that we can keep them all in a single place and just pass around and use in datastructures the resource profile id. The resource profile id can be used with the ResourceProfileManager to get the actual ResourceProfile contents. There are various places in the code that use executor "slots" for things. The ResourceProfile adds functionality to keep that calculation in it. This logic is more complex then it should due to standalone mode and mesos coarse grained not setting the executor cores config. They default to all cores on the worker, so calculating slots is harder there. This PR keeps the functionality to make the cores the limiting resource because the scheduler still uses that for "slots" for a few things. This PR does also add the resource profile id to the Stage and stage info classes to be able to test things easier. That full set of changes will come with the scheduler PR that will be after this one. The PR stops at the scheduler backend pieces for the cluster manager and the real YARN support hasn't been added in this PR, that again will be in a separate PR, so this has a few of the API changes up to the cluster manager and then just uses the default profile requests to continue. The code for the entire feature is here for reference: https://github.com/apache/spark/pull/27053/files although it needs to be upmerged again as well. ### Why are the changes needed? Needed for stage level scheduling feature. ### Does this PR introduce any user-facing change? No user facing api changes added here. ### How was this patch tested? Lots of unit tests and manually testing. I tested on yarn, k8s, standalone, local modes. Ran both failure and success cases. Closes #27313 from tgravescs/SPARK-29148. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-02-12 16:45:42 -06:00
Liang-Chi Hsieh	5b76367a9d	[SPARK-30797][SQL] Set tradition user/group/other permission to ACL entries when setting up ACLs in truncate table ### What changes were proposed in this pull request? This is a follow-up to the PR #26956. In #26956, the patch proposed to preserve path permission when truncating table. When setting up original ACLs, we need to set user/group/other permission as ACL entries too, otherwise if the path doesn't have default user/group/other ACL entries, ACL API will complain an error `Invalid ACL: the user, group and other entries are required.`. In short this change makes sure: 1. Permissions for user/group/other are always kept into ACLs to work with ACL API. 2. Other custom ACLs are still kept after TRUNCATE TABLE (#26956 did this). ### Why are the changes needed? Without this fix, `TRUNCATE TABLE` will get an error when setting up ACLs if there is no default default user/group/other ACL entries. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Update unit test. Manual test on dev Spark cluster. Set ACLs for a table path without default user/group/other ACL entries: ``` hdfs dfs -setfacl --set 'user:liangchi:rwx,user::rwx,group::r--,other::r--' /user/hive/warehouse/test.db/test_truncate_table hdfs dfs -getfacl /user/hive/warehouse/test.db/test_truncate_table # file: /user/hive/warehouse/test.db/test_truncate_table # owner: liangchi # group: supergroup user::rwx user:liangchi:rwx group::r-- mask::rwx other::r-- ``` Then run `sql("truncate table test.test_truncate_table")`, it works by normally truncating the table and preserve ACLs. Closes #27548 from viirya/fix-truncate-table-permission. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-12 14:27:18 -08:00
Maxim Gekk	aa0d13683c	[SPARK-30760][SQL] Port `millisToDays` and `daysToMillis` on Java 8 time API ### What changes were proposed in this pull request? In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API. I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL. ### Why are the changes needed? New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations. ### Does this PR introduce any user-facing change? Yes, this might effect behavior for old dates before 1582 year. ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`. Closes #27494 from MaxGekk/millis-2-days-java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 02:31:48 +08:00
Eric Wu	5919bd3b8d	[SPARK-30651][SQL] Add detailed information for Aggregate operators in EXPLAIN FORMATTED ### What changes were proposed in this pull request? Currently `EXPLAIN FORMATTED` only report input attributes of HashAggregate/ObjectHashAggregate/SortAggregate, while `EXPLAIN EXTENDED` provides more information of Keys, Functions, etc. This PR enhanced `EXPLAIN FORMATTED` to sync with original explain behavior. ### Why are the changes needed? The newly added `EXPLAIN FORMATTED` got less information comparing to the original `EXPLAIN EXTENDED` ### Does this PR introduce any user-facing change? Yes, taking HashAggregate explain result as example. SQL ``` EXPLAIN FORMATTED SELECT COUNT(val) + SUM(key) as TOTAL, COUNT(key) FILTER (WHERE val > 1) FROM explain_temp1; ``` EXPLAIN EXTENDED ``` == Physical Plan == (2) HashAggregate(keys=[], functions=[count(val#6), sum(cast(key#5 as bigint)), count(key#5)], output=[TOTAL#62L, count(key) FILTER (WHERE (val > 1))#71L]) +- Exchange SinglePartition, true, [id=#89] +- HashAggregate(keys=[], functions=[partial_count(val#6), partial_sum(cast(key#5 as bigint)), partial_count(key#5) FILTER (WHERE (val#6 > 1))], output=[count#75L, sum#76L, count#77L]) +- (1) ColumnarToRow +- FileScan parquet default.explain_temp1[key#5,val#6] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/XXX/spark-dev/spark/spark-warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,val:int> ``` EXPLAIN FORMATTED - BEFORE ``` == Physical Plan == * HashAggregate (5) +- Exchange (4) +- HashAggregate (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) ... ... (5) HashAggregate [codegen id : 2] Input: [count#91L, sum#92L, count#93L] ... ... ``` EXPLAIN FORMATTED - AFTER ``` == Physical Plan == * HashAggregate (5) +- Exchange (4) +- HashAggregate (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) ... ... (5) HashAggregate [codegen id : 2] Input: [count#91L, sum#92L, count#93L] Keys: [] Functions: [count(val#6), sum(cast(key#5 as bigint)), count(key#5)] Results: [(count(val#6)#84L + sum(cast(key#5 as bigint))#85L) AS TOTAL#78L, count(key#5)#86L AS count(key) FILTER (WHERE (val > 1))#87L] Output: [TOTAL#78L, count(key) FILTER (WHERE (val > 1))#87L] ... ... ``` ### How was this patch tested? Three tests added in explain.sql for HashAggregate/ObjectHashAggregate/SortAggregate. Closes #27368 from Eric5553/ExplainFormattedAgg. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 02:00:23 +08:00
Maxim Gekk	c1986204e5	[SPARK-30788][SQL] Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see https://github.com/apache/spark/pull/26507 & https://github.com/apache/spark/pull/26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 20:12:38 +08:00
beliefer	f5026b1ba7	[SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract ### What changes were proposed in this pull request? The current implement of `regexp_extract` will throws a unprocessed exception show below: `SELECT regexp_extract('1a 2b 14m', 'd+')` ``` java.lang.IndexOutOfBoundsException: No group 1 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) ``` I think should treat this exception well. ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException No group 1 ` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? New UT Closes #27508 from beliefer/fix-regexp_extract-bug. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-12 14:49:22 +08:00
herman	b25359cca3	[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions ### What changes were proposed in this pull request? This is a small follow-up for https://github.com/apache/spark/pull/27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 10:48:29 +09:00
Maxim Gekk	45db48e2d2	Revert "[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to revert the commit `8aebc80e0e`. ### Why are the changes needed? See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites. Closes #27531 from MaxGekk/revert-like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-11 10:15:34 -08:00
fuwhu	f1d0dce484	[MINOR][DOC] Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions ### What changes were proposed in this pull request? Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions. ### Why are the changes needed? To describe these two classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #27535 from fuwhu/SPARK-15616-FOLLOW-UP. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-11 22:16:44 +08:00
HyukjinKwon	0045be766b	[SPARK-29462][SQL] The data type of "array()" should be array<null> ### What changes were proposed in this pull request? This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 17:22:08 +09:00
Yuanjian Li	a6b91d2bf7	[SPARK-30556][SQL][FOLLOWUP] Reset the status changed in SQLExecution.withThreadLocalCaptured ### What changes were proposed in this pull request? Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured. ### Why are the changes needed? For code safety. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #27516 from xuanyuanking/SPARK-30556-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: herman <herman@databricks.com>	2020-02-10 22:16:25 +01:00
Liang-Chi Hsieh	acfdb46a60	[SPARK-27946][SQL][FOLLOW-UP] Change doc and error message for SHOW CREATE TABLE ### What changes were proposed in this pull request? This is a follow-up for #24938 to tweak error message and migration doc. ### Why are the changes needed? Making user know workaround if SHOW CREATE TABLE doesn't work for some Hive tables. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27505 from viirya/SPARK-27946-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-02-10 10:45:00 -08:00
jiake	5a240603fd	[SPARK-30719][SQL] Add unit test to verify the log warning print when intentionally skip AQE ### What changes were proposed in this pull request? This is a follow up in [#27452](https://github.com/apache/spark/pull/27452). Add a unit test to verify whether the log warning is print when intentionally skip AQE. ### Why are the changes needed? Add unit test ### Does this PR introduce any user-facing change? No ### How was this patch tested? adding unit test Closes #27515 from JkSelf/aqeLoggingWarningTest. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-10 21:48:00 +08:00
Kent Yao	58b9ca1e6f	[SPARK-30592][SQL][FOLLOWUP] Add some round-trip test cases ### What changes were proposed in this pull request? Add round-trip tests for CSV and JSON functions as https://github.com/apache/spark/pull/27317#discussion_r376745135 asked. ### Why are the changes needed? improve test coverage ### Does this PR introduce any user-facing change? no ### How was this patch tested? add uts Closes #27510 from yaooqinn/SPARK-30592-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-10 16:23:44 +09:00
Liang-Chi Hsieh	9f8172e96a	Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project This reverts commit `a0e63b61e7`. ### What changes were proposed in this pull request? This reverts the patch at #26978 based on gatorsmile's suggestion. ### Why are the changes needed? Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27504 from viirya/revert-SPARK-29721. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-09 19:45:16 -08:00
Gengliang Wang	b877aac146	[SPARK-30684 ][WEBUI][FollowUp] A new approach for SPARK-30684 ### What changes were proposed in this pull request? Simplify the changes for adding metrics description for WholeStageCodegen in https://github.com/apache/spark/pull/27405 ### Why are the changes needed? In https://github.com/apache/spark/pull/27405, the UI changes can be made without using the function `adjustPositionOfOperationName` to adjust the position of operation name and mark as an operation-name class. I suggest we make simpler changes so that it would be easier for future development. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual test with the queries provided in https://github.com/apache/spark/pull/27405 ``` sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output") sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output") ``` ![image](https://user-images.githubusercontent.com/1097932/74073629-e3f09f00-49bf-11ea-90dc-1edb5ca29e5e.png) Closes #27490 from gengliangwang/wholeCodegenUI. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-02-09 14:18:51 -08:00
Nicholas Chammas	339c0f9a62	[SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options ### What changes were proposed in this pull request? This PR adds a doc builder for Spark SQL's configuration options. Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)): ![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png) Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql): ![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png) ### Why are the changes needed? There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql). ### Does this PR introduce any user-facing change? No, apart from new documentation. ### How was this patch tested? I tested this manually by building the docs and reviewing them in my browser. Closes #27459 from nchammas/SPARK-30510-spark-sql-options. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-09 19:20:47 +09:00
Yuanjian Li	3db3e39f11	[SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause ### What changes were proposed in this pull request? This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior. ### Why are the changes needed? The original change might risky to end-users, it changes behavior silently. ### Does this PR introduce any user-facing change? Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional. ### How was this patch tested? New UT. Closes #27454 from xuanyuanking/SPARK-28228-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-08 14:10:28 -08:00
Terry Kim	a7451f44d2	[SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time ### What changes were proposed in this pull request? The current ALTER COLUMN syntax allows to change multiple properties at a time: ``` ALTER TABLE table=multipartIdentifier (ALTER \| CHANGE) COLUMN? column=multipartIdentifier (TYPE dataType)? (COMMENT comment=STRING)? colPosition? ``` The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan) This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time. Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior. ### Why are the changes needed? To follow SQL standard (and other recent SQL systems) behavior. ### Does this PR introduce any user-facing change? Yes, now the user can update the column properties only one at a time. For example, ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment' ``` should be broken into ``` ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment' ``` ### How was this patch tested? Updated existing tests. Closes #27444 from imback82/alter_column_one_at_a_time. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:47:44 +08:00
Maxim Gekk	a3e77773cf	[SPARK-30752][SQL] Fix `to_utc_timestamp` on daylight saving day ### What changes were proposed in this pull request? - Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API - Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence. ### Why are the changes needed? - Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example: ```scala scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show +-------------------+ \| local UTC\| +-------------------+ \|2019-11-03 03:00:00\| +-------------------+ ``` but the result must be 2019-11-03 04:00:00: <img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png"> - Simplifies the code, and make it more maintainable - Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions. ### Does this PR introduce any user-facing change? Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. - Added `convert time zones on a daylight saving day` to DateFunctionsSuite Closes #27474 from MaxGekk/port-convertTz-on-Java8-api. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-08 02:32:07 +08:00
Wenchen Fan	5a4c70b4e2	[SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26656. We don't support window aggregate function with filter predicate yet and we should fail explicitly. Observable metrics has the same issue. This PR fixes it as well. ### Why are the changes needed? If we simply ignore filter predicate when we don't support it, the result is wrong. ### Does this PR introduce any user-facing change? yea, fix the query result. ### How was this patch tested? new tests Closes #27476 from cloud-fan/filter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-06 13:33:39 -08:00
Wenchen Fan	8ce58627eb	[SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply ### What changes were proposed in this pull request? Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally. This PR also add a config to not skip AQE. ### Why are the changes needed? It's not a warning at all if we intentionally skip AQE. ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs. Closes #27452 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-02-06 09:16:14 -08:00
yi.wu	368ee62a5d	[SPARK-27297][DOC][FOLLOW-UP] Improve documentation for various Scala functions ### What changes were proposed in this pull request? Add examples and parameter description for these Scala functions: * transform * exists * forall * aggregate * zip_with * transform_keys * transform_values * map_filter * map_zip_with ### Why are the changes needed? Better documentation for UX. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27449 from Ngone51/doc-funcs. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:34:29 +08:00
yi.wu	3f5b23340e	[SPARK-30744][SQL] Optimize AnalyzePartitionCommand by calculating location sizes in parallel ### What changes were proposed in this pull request? Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27471 from Ngone51/dev_calculate_in_parallel. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 20:20:44 +08:00
beliefer	c8ef1dee90	[SPARK-29108][SQL][TESTS][FOLLOWUP] Comment out no use test case and add 'insert into' statement of window.sql (Part 2) ### What changes were proposed in this pull request? When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty. I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`. Because only one case reference the table `empsalary` and it throws `AnalysisException`. ``` -- !query select last(salary) over(order by salary range between 1000 preceding and 1000 following), lag(salary) over(order by salary range between 1000 preceding and 1000 following), salary from empsalary -- !query schema struct<> -- !query output org.apache.spark.sql.AnalysisException Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1); ``` So we should do four work: 1. comment out the only one case and create a new ticket. 2. Add `INSERT INTO empsalary`. Note: window_part4.sql not use the table `empsalary`. ### Why are the changes needed? Supplementary test data. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New test case Closes #27439 from beliefer/add-insert-to-window. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-06 15:24:26 +09:00
Terry Kim	c27a616450	[SPARK-30612][SQL] Resolve qualified column name with v2 tables ### What changes were proposed in this pull request? This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables. This PR would allow qualified column names in query as following: ```SQL SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl ``` ### Why are the changes needed? This is a bug because you cannot qualify column names in queries. ### Does this PR introduce any user-facing change? Yes, now users can qualify column names for v2 tables. ### How was this patch tested? Added new tests. Closes #27391 from imback82/qualified_col. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-06 13:54:17 +08:00
Wenchen Fan	3b26f807a0	[SPARK-30721][SQL][TESTS] Fix DataFrameAggregateSuite when enabling AQE ### What changes were proposed in this pull request? update `DataFrameAggregateSuite` to make it pass with AQE ### Why are the changes needed? We don't need to turn off AQE in `DataFrameAggregateSuite` ### Does this PR introduce any user-facing change? no ### How was this patch tested? run `DataFrameAggregateSuite` locally with AQE on. Closes #27451 from cloud-fan/aqe-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 12:36:51 -08:00
Yuanjian Li	4938905a1c	[SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message ### What changes were proposed in this pull request? Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message. ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27464 from xuanyuanking/SPARK-29864-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-05 11:19:42 -08:00
turbofei	6d507b4a31	[SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer ### What changes were proposed in this pull request? When spark.sql.ansi.enabled is true, for the statement: ``` select cast(cast(2147483648 as Float) as Integer) //result is 2147483647 ``` Its result is 2147483647 and does not throw `ArithmeticException`. The root cause is that, the below code does not work for some corner cases. `94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)` For example: ![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png) In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly. ### Why are the changes needed? Result corrupt. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added Unit test. Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 21:24:02 +08:00
Maxim Gekk	459e757ed4	[SPARK-30668][SQL] Support `SimpleDateFormat` patterns in parsing timestamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit `51a6ba0181`, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 18:48:45 +08:00
Liang-Chi Hsieh	7631275f97	[SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser ### What changes were proposed in this pull request? This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string. ### Why are the changes needed? Adding a legacy config to make migration easy for Spark users. ### Does this PR introduce any user-facing change? Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0. ### How was this patch tested? Unit test. Closes #27456 from viirya/SPARK-25040-followup. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:22:23 -08:00
maryannxue	6097b343ba	[SPARK-30717][SQL] AQE subquery map should cache `SubqueryExec` instead of `ExecSubqueryExpression` ### What changes were proposed in this pull request? This PR is to fix a potential bug in AQE where an `ExecSubqueryExpression` could be mistakenly replaced with another `ExecSubqueryExpression` with the same `ListQuery` but a different `child` expression. This is because a ListQuery's id can only identify the ListQuery itself, not the parent expression `InSubquery`, but right now the `subqueryMap` in `InsertAdaptiveSparkPlan` uses the `ListQuery`'s id as key and the corresponding `InSubqueryExec` for the `ListQuery`'s parent expression as value. So the fix uses the corresponding `SubqueryExec` for the `ListQuery` itself as the map's value. ### Why are the changes needed? This logical bug could potentially cause a wrong query plan, which could throw an exception related to unresolved columns. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing UTs. Closes #27446 from maryannxue/spark-30717. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-04 12:31:44 +08:00
Yuanjian Li	a4912cee61	[SPARK-29543][SS][FOLLOWUP] Move `spark.sql.streaming.ui.*` configs to StaticSQLConf ### What changes were proposed in this pull request? Put the configs below needed by Structured Streaming UI into StaticSQLConf: - spark.sql.streaming.ui.enabled - spark.sql.streaming.ui.retainedProgressUpdates - spark.sql.streaming.ui.retainedQueries ### Why are the changes needed? Make all SS UI configs consistent with other similar configs in usage and naming. ### Does this PR introduce any user-facing change? Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`. ### How was this patch tested? Existing UT. Closes #27425 from xuanyuanking/SPARK-29543-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-02-02 23:37:13 -08:00
Burak Yavuz	2eccfd8a73	[SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView ### What changes were proposed in this pull request? Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog. ### Why are the changes needed? This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to DataSourceV2SessionCatalogSuite Closes #27423 from brkyvz/isViewF. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-03 14:08:59 +08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
yi.wu	82b4f753a0	[SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource ### What changes were proposed in this pull request? This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like ``` val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions") ``` Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any). ### Why are the changes needed? This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested). Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal. ### Does this PR introduce any user-facing change? Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`. ### How was this patch tested? Added a new test suite. Closes #27199 from Ngone51/dev-executeCommand. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 15:05:26 -08:00
Maxim Gekk	2d4b5eaee4	[SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double ### What changes were proposed in this pull request? - Replace `new Integer(0)` by a serializable instance in RDD.scala - Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html ### Why are the changes needed? This fixes the following warnings: 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite Closes #27399 from MaxGekk/eliminate-warning-part4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 15:03:16 -06:00
Kousuke Saruta	18bc4e55ef	[SPARK-30684][WEBUI] Show the descripton of metrics for WholeStageCodegen in DAG viz ### What changes were proposed in this pull request? Added description for metrics shown in the WholeStageCodegen-node in DAG viz. This is before the change is applied. ![before-changed](https://user-images.githubusercontent.com/4736016/73469870-5cf16480-43ca-11ea-9a13-714083508a3b.png) And following is after change. ![after-fixing-layout](https://user-images.githubusercontent.com/4736016/73469364-983f6380-43c9-11ea-8b7e-ddab030d0270.png) For this change, I also modify the layout of DAG viz. Actually, I noticed it's not enough to just added the description. Following is without changing the layout. ![layout-is-broken](https://user-images.githubusercontent.com/4736016/73470178-cffadb00-43ca-11ea-86d7-aed109b105e6.png) ### Why are the changes needed? Users can't understand what those metrics mean. ### Does this PR introduce any user-facing change? Yes. The layout is a little bit changed. ### How was this patch tested? I confirm the result of DAG viz with following 3 operations. `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show` `sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output")` `sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output")` Closes #27405 from sarutak/sql-dag-metrics. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 11:58:52 -08:00
Wenchen Fan	33546d637d	Revert "[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by" This reverts commit `a2de20c0e6`.	2020-02-01 03:02:52 +08:00
Jungtaek Lim (HeartSaVioR)	5e0faf9a3d	[SPARK-29779][SPARK-30479][CORE][SQL][FOLLOWUP] Reflect review comments on post-hoc review ### What changes were proposed in this pull request? This PR reflects review comments on post-hoc review among PRs for SPARK-29779 (#27085), SPARK-30479 (#27164). The list of review comments this PR addresses are below: * https://github.com/apache/spark/pull/27085#discussion_r373304218 * https://github.com/apache/spark/pull/27164#discussion_r373300793 * https://github.com/apache/spark/pull/27164#discussion_r373301193 * https://github.com/apache/spark/pull/27164#discussion_r373301351 I also applied review comments to the CORE module (BasicEventFilterBuilder.scala) as well, as the review comments for SQL/core module (SQLEventFilterBuilder.scala) can be applied there as well. ### Why are the changes needed? There're post-hoc reviews on PRs for such issues, like links in above section. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27414 from HeartSaVioR/SPARK-28869-SPARK-29779-SPARK-30479-FOLLOWUP-posthoc-reviews. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-31 10:17:07 -08:00
Tathagata Das	481e5211d2	[SPARK-30657][SPARK-30658][SS] Fixed two bugs in streaming limits This PR solves two bugs related to streaming limits Bug 1 (SPARK-30658): Limit before a streaming aggregate (i.e. `df.limit(5).groupBy().count()`) in complete mode was not being planned as a stateful streaming limit. The planner rule planned a logical limit with a stateful streaming limit plan only if the query is in append mode. As a result, instead of allowing max 5 rows across batches, the planned streaming query was allowing 5 rows in every batch thus producing incorrect results. Solution: Change the planner rule to plan the logical limit with a streaming limit plan even when the query is in complete mode if the logical limit has no stateful operator before it. Bug 2 (SPARK-30657): `LocalLimitExec` does not consume the iterator of the child plan. So if there is a limit after a stateful operator like streaming dedup in append mode (e.g. `df.dropDuplicates().limit(5)`), the state changes of streaming duplicate may not be committed (most stateful ops commit state changes only after the generated iterator is fully consumed). Solution: Change the planner rule to always use a new `StreamingLocalLimitExec` which always fully consumes the iterator. This is the safest thing to do. However, this will introduce a performance regression as consuming the iterator is extra work. To minimize this performance impact, add an additional post-planner optimization rule to replace `StreamingLocalLimitExec` with `LocalLimitExec` when there is no stateful operator before the limit that could be affected by it. No Updated incorrect unit tests and added new ones Closes #27373 from tdas/SPARK-30657. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-31 09:27:34 -08:00
yi.wu	5ccbb38a71	[SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand All credit to Ngone51, Closes #27293. ### What changes were proposed in this pull request? This PR improves `AlterTableAddPartitionCommand` by: 1. adds an internal config for partitions batch size to avoid hard code 2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse ### Why are the changes needed? Improve code quality. ### Does this PR introduce any user-facing change? Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change. ### How was this patch tested? Pass Jenkins. Closes #27413 from xuanyuanking/SPARK-29938. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-01 01:03:00 +08:00
Burak Yavuz	290a528bff	[SPARK-30615][SQL] Introduce Analyzer rule for V2 AlterTable column change resolution ### What changes were proposed in this pull request? Adds an Analyzer rule to normalize the column names used in V2 AlterTable table changes. We need to handle all ColumnChange operations. We add an extra match statement for future proofing new changes that may be added. This prevents downstream consumers (e.g. catalogs) to deal about case sensitivity or check that columns exist, etc. We also fix the behavior for ALTER TABLE CHANGE COLUMN (Hive style syntax) for adding comments to complex data types. Currently, the data type needs to be provided as part of the Hive style syntax. This assumes that the data type as changed when it may have not and the user only wants to add a comment, which fails in CheckAnalysis. ### Why are the changes needed? Currently we do not handle case sensitivity correctly for ALTER TABLE ALTER COLUMN operations. ### Does this PR introduce any user-facing change? No, fixes a bug. ### How was this patch tested? Introduced v2CommandsCaseSensitivitySuite and added a test around HiveStyle Change columns to PlanResolutionSuite Closes #27350 from brkyvz/normalizeAlter. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 16:41:10 +08:00
herman	a5c7090ffa	[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation ### What changes were proposed in this pull request? This PR makes `SparkSession.emptyDataFrame` use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations. ### Why are the changes needed? It allows us to optimize empty dataframes better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a test case to `DataFrameSuite`. Closes #27400 from hvanhovell/SPARK-30671. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-31 16:14:07 +09:00
Burak Yavuz	1cd19ad92d	[SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming ### What changes were proposed in this pull request? We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example. We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream. ### Why are the changes needed? Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface. ### Does this PR introduce any user-facing change? DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once(). ### How was this patch tested? Existing tests, as this API is mostly internal Closes #27380 from brkyvz/rateLimit. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-30 22:02:48 -08:00
sandeep katta	5f3ec6250f	[SPARK-30362][CORE] Update InputMetrics in DataSourceRDD ### What changes were proposed in this pull request? Incase of DS v2 InputMetrics are not updated Before Fix ![inputMetrics](https://user-images.githubusercontent.com/35216143/71501010-c216df00-288d-11ea-8522-fdd50b13eae1.png) After Fix we can see that `Input Size / Records` is updated in the UI ![image](https://user-images.githubusercontent.com/35216143/71501000-b88d7700-288d-11ea-92fe-a727b2b79908.png) ### Why are the changes needed? InputMetrics like bytesread and recordread should be updated ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT and also verified manually Closes #27021 from sandeep-katta/dsv2inputmetrics. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 14:01:32 +08:00
Wenchen Fan	9f42be25eb	[SPARK-29665][SQL] refine the TableProvider interface ### What changes were proposed in this pull request? Instead of having several overloads of `getTable` method in `TableProvider`, it's better to have 2 methods explicitly: `inferSchema` and `inferPartitioning`. With a single `getTable` method that takes everything: schema, partitioning and properties. This PR also adds a `supportsExternalMetadata` method in `TableProvider`, to indicate if the source support external table metadata. If this flag is false: 1. spark.read.schema... is disallowed and fails 2. when we support creating v2 tables in session catalog, spark only keeps table properties in the catalog. ### Why are the changes needed? API improvement. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26868 from cloud-fan/provider2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-31 13:37:43 +08:00
Jungtaek Lim (HeartSaVioR)	cbb714f67e	[SPARK-29438][SS] Use partition ID of StateStoreAwareZipPartitionsRDD for determining partition ID of state store in stream-stream join ### What changes were proposed in this pull request? Credit to uncleGen for discovering the problem and providing simple reproducer as UT. New UT in this patch is borrowed from #26156 and I'm retaining a commit from #26156 (except unnecessary part on this path) to properly give a credit. This patch fixes the issue that partition ID could be mis-assigned when the query contains UNION and stream-stream join is placed on the right side. We assume the range of partition IDs as `(0 ~ number of shuffle partitions - 1)` for stateful operators, but when we use stream-stream join on the right side of UNION, the range of partition ID of task goes to `(number of partitions in left side, number of partitions in left side + number of shuffle partitions - 1)`, which `number of partitions in left side` can be changed in some cases (new UT points out the one of the cases). The root reason of bug is that stream-stream join picks the partition ID from TaskContext, which wouldn't be same as partition ID from source if union is being used. Hopefully we can pick the right partition ID from source in StateStoreAwareZipPartitionsRDD - this patch leverages that partition ID. ### Why are the changes needed? This patch will fix the broken of assumption of partition range on stateful operator, as well as fix the issue reported in JIRA issue SPARK-29438. ### Does this PR introduce any user-facing change? Yes, if their query is using UNION and stream-stream join is placed on the right side. They may encounter the problem to read state from checkpoint and may need to discard checkpoint to continue. ### How was this patch tested? Added UT which fails on current master branch, and passes with this patch. Closes #26162 from HeartSaVioR/SPARK-29438. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-01-30 20:21:43 -08:00
Wenchen Fan	e5f572af06	[SPARK-30680][SQL] ResolvedNamespace does not require a namespace catalog ### What changes were proposed in this pull request? Update `ResolvedNamespace` to accept catalog as `CatalogPlugin` not `SupportsNamespaces`. This is extracted from https://github.com/apache/spark/pull/27345 ### Why are the changes needed? not all commands that need to resolve namespaces require a namespace catalog. For example, `SHOW TABLE` is implemented by `TableCatalog.listTables`, and is nothing to do with `SupportsNamespace`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #27403 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 10:34:59 -08:00
zero323	b1f81f0072	[MINOR][SQL][DOCS] Fix typos in scaladoc strings of higher order functions ### What changes were proposed in this pull request? Fix following typos: - tranformation -> transformation - the boolean -> the Boolean - signle -> single ### Why are the changes needed? ### Does this PR introduce any user-facing change? No ### How was this patch tested? Scala linter. Closes #27382 from zero323/functions-typos. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-29 18:42:18 -06:00
uncleGen	7173786153	[SPARK-29543][SS][UI] Structured Streaming Web UI ### What changes were proposed in this pull request? This PR adds two pages to Web UI for Structured Streaming: - "/streamingquery": Streaming Query Page, providing some aggregate information for running/completed streaming queries. - "/streamingquery/statistics": Streaming Query Statistics Page, providing detailed information for streaming query, including `Input Rate`, `Process Rate`, `Input Rows`, `Batch Duration` and `Operation Duration` ![Screen Shot 2020-01-29 at 1 38 00 PM](https://user-images.githubusercontent.com/1000778/73399837-cd01cc80-429c-11ea-9d4b-1d200a41b8d5.png) ![Screen Shot 2020-01-29 at 1 39 16 PM](https://user-images.githubusercontent.com/1000778/73399838-cd01cc80-429c-11ea-8185-4e56db6866bd.png) ### Why are the changes needed? It helps users to better monitor Structured Streaming query. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - new added and existing UTs - manual test Closes #26201 from uncleGen/SPARK-29543. Lead-authored-by: uncleGen <hustyugm@gmail.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Genmao Yu <hustyugm@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2020-01-29 13:43:51 -08:00
Takeshi Yamamuro	ec1fb6b4e1	[SPARK-30234][SQL][FOLLOWUP] Add `.enabled` in the suffix of the ADD FILE legacy option ### What changes were proposed in this pull request? This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`. ### Why are the changes needed? For consistent option names. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27372 from maropu/SPARK-30234-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-29 12:23:59 +09:00
Maxim Gekk	8aebc80e0e	[SPARK-30625][SQL] Support `escape` as third parameter of the `like` function ### What changes were proposed in this pull request? In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by `187f3c1773`. ### Why are the changes needed? The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions. ### Does this PR introduce any user-facing change? Yes, before `like` fails with the exception: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7 ``` After: ```sql spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_'); true ``` ### How was this patch tested? - Add new example for the `like` function which is checked by `SQLQuerySuite` - Run `RegexpExpressionsSuite` and `ExpressionParserSuite`. Closes #27355 from MaxGekk/like-3-args. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 11:19:32 -08:00
Jungtaek Lim (HeartSaVioR)	0436b3d3f8	[SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files ### What changes were proposed in this pull request? This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files. ### Why are the changes needed? The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` grep -IUrl --color "^M" . \| grep "\.java\\|\.scala\\|\.xml\\|\.py\\|\.R" \| grep -v "/target/" \| grep -v "/build/" \| grep -v "/dist/" \| grep -v "dependency-reduced-pom.xml" \| grep -v ".pyc" ``` (Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.) Before the patch, the result is: ``` ./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala ./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala ``` and after the patch, the result is None. And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below: ``` warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala. The file will have its original line endings in your working directory. ``` Closes #27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-27 10:20:51 -08:00
Yuchen Huo	d0800fc8e2	[SPARK-30314] Add identifier and catalog information to DataSourceV2Relation ### What changes were proposed in this pull request? Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step. ### Why are the changes needed? In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests in the following suites: CatalogManagerSuite.scala CatalogV2UtilSuite.scala SupportsCatalogOptionsSuite.scala PlanResolutionSuite.scala Closes #26957 from yuchenhuo/SPARK-30314. Authored-by: Yuchen Huo <yuchen.huo@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-01-26 12:59:24 -08:00
Xiao Li	48f647882a	[SPARK-30644][SQL][TEST] Remove query index from the golden files of SQLQueryTestSuite ### What changes were proposed in this pull request? This PR is to remove query index from the golden files of SQLQueryTestSuite ### Why are the changes needed? Because the SQLQueryTestSuite's golden files have the query index for each query, removal of any query statement [except the last one] will generate many unneeded difference. This will make code review harder. The number of changed lines is misleading. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27361 from gatorsmile/removeIndexNum. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 23:17:36 -08:00
Xiao Li	d69ed9afdf	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `1d20d13149`. Closes #27351 from gatorsmile/revertSPARK25496. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 21:34:12 -08:00
Liang-Chi Hsieh	a0e63b61e7	[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project ### What changes were proposed in this pull request? This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it. ### Why are the changes needed? In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #26978 from viirya/SPARK-29721. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-24 22:17:28 -08:00
Gengliang Wang	ed44926117	[SPARK-30627][SQL] Disable all the V2 file sources by default ### What changes were proposed in this pull request? Disable all the V2 file sources in Spark 3.0 by default. ### Why are the changes needed? There are still some missing parts in the file source V2 framework: 1. It doesn't support reporting file scan metrics such as "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. This requires another patch in the data source V2 framework. Tracked by [SPARK-30362](https://issues.apache.org/jira/browse/SPARK-30362) 2. It doesn't support partition pruning with subqueries(including dynamic partition pruning) for now. Tracked by [SPARK-30628](https://issues.apache.org/jira/browse/SPARK-30628) As we are going to code freeze on Jan 31st, this PR proposes to disable all the V2 file sources in Spark 3.0 by default. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #27348 from gengliangwang/disableFileSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-23 21:42:43 -08:00
Xiao Li	ddf83159a8	[SPARK-28962][SQL][FOLLOW-UP] Add the parameter description for the Scala function API filter ### What changes were proposed in this pull request? This PR is a follow-up PR https://github.com/apache/spark/pull/25666 for adding the description and example for the Scala function API `filter`. ### Why are the changes needed? It is hard to tell which parameter is the index column. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27336 from gatorsmile/spark28962. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-23 16:23:16 -08:00
Terry Kim	4847f7380d	[SPARK-30298][SQL] Respect aliases in output partitioning of projects and aggregates ### What changes were proposed in this pull request? Currently, in the following scenario, bucket join is not utilized: ```scala val df = (0 until 20).map(i => (i, i)).toDF("i", "j").as("df") df.write.format("parquet").bucketBy(8, "i").saveAsTable("t") sql("CREATE VIEW v AS SELECT * FROM t") sql("SELECT * FROM t a JOIN v b ON a.i = b.i").explain ``` ``` == Physical Plan == (4) SortMergeJoin [i#13], [i#15], Inner :- (1) Sort [i#13 ASC NULLS FIRST], false, 0 : +- (1) Project [i#13, j#14] : +- (1) Filter isnotnull(i#13) : +- (1) ColumnarToRow : +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8 +- (3) Sort [i#15 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i#15, 8), true, [id=#64] <----- Exchange node introduced +- (2) Project [i#13 AS i#15, j#14 AS j#16] +- (2) Filter isnotnull(i#13) +- (2) ColumnarToRow +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8 ``` Notice that `Exchange` is present. This is because `Project` introduces aliases and `outputPartitioning` and `requiredChildDistribution` do not consider aliases while considering bucket join in `EnsureRequirements`. This PR addresses to allow this scenario. ### Why are the changes needed? This allows bucket join to be utilized in the above example. ### Does this PR introduce any user-facing change? Yes, now with the fix, the `explain` out is as follows: ``` == Physical Plan == (3) SortMergeJoin [i#13], [i#15], Inner :- (1) Sort [i#13 ASC NULLS FIRST], false, 0 : +- (1) Project [i#13, j#14] : +- (1) Filter isnotnull(i#13) : +- (1) ColumnarToRow : +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:.., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8 +- (2) Sort [i#15 ASC NULLS FIRST], false, 0 +- (2) Project [i#13 AS i#15, j#14 AS j#16] +- (2) Filter isnotnull(i#13) +- (2) ColumnarToRow +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:.., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8 ``` Note that the `Exchange` is no longer present. ### How was this patch tested? Closes #26943 from imback82/bucket_alias. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-01-24 08:23:07 +09:00
Kent Yao	3228d723a4	[SPARK-30603][SQL] Move RESERVED_PROPERTIES from SupportsNamespaces and TableCatalog to CatalogV2Util ### What changes were proposed in this pull request? In this PR, I propose to move the `RESERVED_PROPERTIES `s from `SupportsNamespaces` and `TableCatalog` to `CatalogV2Util`, which can keep `RESERVED_PROPERTIES ` safe for interval usages only. ### Why are the changes needed? the `RESERVED_PROPERTIES` should not be changed by subclasses ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing uts Closes #27318 from yaooqinn/SPARK-30603. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-23 13:13:25 -08:00

... 3 4 5 6 7 ...

6863 commits