ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	e7995c2ddc	[SPARK-31633][BUILD] Upgrade SLF4J from 1.7.16 to 1.7.30 ### What changes were proposed in this pull request? This PR aims to upgrade SLF4J from 1.7.16 to 1.7.30. ### Why are the changes needed? SLF4J 1.7.23+ is required to enable `slf4j-log4j12` with MDC feature to run under Java 9. Also, this will bring all latest bug fixes. - http://www.slf4j.org/news.html > When running under Java 9, log4j version 1.2.x is unable to correctly parse the "java.version" system property. Assuming an inccorect Java version, it proceeded to disable its MDC functionality. The slf4j-log4j12 module shipping in this release fixes the issue by tweaking MDC internals by reflection, allowing log4j to run under Java 9. See also SLF4J-393. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #28446 from dongjoon-hyun/SPARK-31633. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-04 08:14:12 -07:00
Burak Yavuz	02a319d7e1	[SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog ## What changes were proposed in this pull request? SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such. ### Why are the changes needed? Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore. ### Does this PR introduce _any_ user-facing change? Fixes a bug ### How was this patch tested? Regression test Closes #28434 from brkyvz/ddlCommands. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-04 12:22:29 +00:00
Kazuaki Ishizaki	35fcc8d5c5	[MINOR][DOCS] Fix typo in documents ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in `project/MimaExcludes.scala` ### Why are the changes needed? Better readability of documents ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test needed Closes #28447 from kiszk/typo_20200504. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 16:53:50 +09:00
Wenchen Fan	f72220b8ab	[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 15:30:10 +09:00
Yuming Wang	7ef0b69a92	[SPARK-31626][SQL] Port HIVE-10415: hive.start.cleanup.scratchdir configuration is not taking effect ### What changes were proposed in this pull request? This pr port [HIVE-10415](https://issues.apache.org/jira/browse/HIVE-10415): `hive.start.cleanup.scratchdir` configuration is not taking effect. ### Why are the changes needed? I encountered this issue: ![image](https://user-images.githubusercontent.com/5399861/80869375-aeafd080-8cd2-11ea-8573-93ec4b422be1.png) I'd like to make `hive.start.cleanup.scratchdir` effective to reduce this issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test Closes #28436 from wangyum/SPARK-31626. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 14:59:33 +09:00
Tianshi Zhu	a222644e1d	[SPARK-31267][SQL] Flaky test: WholeStageCodegenSparkSubmitSuite.Generated code on driver should not embed platform-specific constant ### What changes were proposed in this pull request? Allow customized timeouts for `runSparkSubmit`, which will make flaky tests more likely to pass by using a larger timeout value. I was able to reproduce the test failure on my laptop, which took 1.5 - 2 minutes to finish the test. After increasing the timeout, the test now can pass locally. ### Why are the changes needed? This allows slow tests to use a larger timeout, so they are more likely to succeed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test was able to pass on my local env after the change. Closes #28438 from tianshizz/SPARK-31267. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 14:50:38 +09:00
Max Gekk	2fb85f6b68	[SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in `DateTimeBenchmark` ### What changes were proposed in this pull request? - Changed to the number of rows in benchmark cases from 3 to the actual number `N`. - Regenerated benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? The changes are needed to have: - Correct benchmark results - Base line for other perf improvements that can be checked in the same environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and checking its output. Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-04 09:39:50 +09:00
Michael Chirico	f53d8c63e8	[SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical ### What changes were proposed in this pull request? Internal usages like `{stop,warning,message}({paste,paste0,sprintf}` and `{stop,warning,message}(some_literal_string_as_variable` have been removed and replaced as appropriate. ### Why are the changes needed? CRAN policy recommends against using such constructions to build error messages, in particular because it makes the process of creating portable error messages for the package more onerous. ### Does this PR introduce any user-facing change? There may be some small grammatical changes visible in error messaging. ### How was this patch tested? Not done Closes #28365 from MichaelChirico/r-stop-paste. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-03 12:40:20 +09:00
Max Gekk	13dddee9a8	[MINOR][SQL][TESTS] Disable UI in SQL benchmarks by default ### What changes were proposed in this pull request? Set `spark.ui.enabled` to `false` in `SqlBasedBenchmark.getSparkSession`. This disables UI in all SQL benchmarks by default. ### Why are the changes needed? UI overhead lowers numbers in the `Relative` column and impacts on `Stdev` in benchmark results. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked by running `DateTimeRebaseBenchmark`. Closes #28432 from MaxGekk/ui-off-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-02 17:54:36 +09:00
Huaxin Gao	75da05038b	[MINOR][SQL][DOCS] Remove two leading spaces from sql tables ### What changes were proposed in this pull request? Remove two leading spaces from sql tables. ### Why are the changes needed? Follow the format of other references such as https://docs.snowflake.com/en/sql-reference/constructs/join.html, https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_10002.htm, https://www.postgresql.org/docs/10/sql-select.html. ### Does this PR introduce any user-facing change? before ``` SELECT * FROM test; +-+ ... +-+ ``` after ``` SELECT * FROM test; +-+ ... +-+ ``` ### How was this patch tested? Manually build and check Closes #28348 from huaxingao/sql-format. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-05-01 10:11:43 -07:00
Qianyang Yu	348fd53214	[SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue ### What changes were proposed in this pull request? Add FValue example for ml.stat.FValueTest in python/java/scala ### Why are the changes needed? Improve ML example ### Does this PR introduce any user-facing change? No ### How was this patch tested? manually run the example Closes #28400 from kevinyu98/spark-26111-fvalue-examples. Authored-by: Qianyang Yu <qyu@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-01 09:16:08 -05:00
Pablo Langa	4fecc20f6e	[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-01 22:09:04 +09:00
Xingbo Jiang	b7cde42b04	[SPARK-31619][CORE] Rename config "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout" ### What changes were proposed in this pull request? The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect if "spark.dynamicAllocation.shuffleTracking.enabled" is true, so we should re-namespace that configuration so that it's nested under the "shuffleTracking" one. ### How was this patch tested? Covered by current existing test cases. Closes #28426 from jiangxb1987/confName. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-01 11:46:17 +09:00
Yuanjian Li	aec8b69435	[SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units ### What changes were proposed in this pull request? Add test cases covering all interval units: MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test only. Closes #28418 from xuanyuanking/SPARK-28424. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-01 10:32:37 +09:00
Weichen Xu	ee1de66fe4	[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group ### What changes were proposed in this pull request? I add a new API in pyspark RDD class: def collectWithJobGroup(self, groupId, description, interruptOnCancel=False) This API do the same thing with `rdd.collect`, but it can specify the job group when do collect. The purpose of adding this API is, if we use: ``` sc.setJobGroup("group-id...") rdd.collect() ``` The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in https://issues.apache.org/jira/browse/SPARK-31549 Note: This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0. - `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications. - It is impossible to make it runtime configuration as it has to be set before JVM is launched. - There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability. Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1. ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? A develop API in pyspark: `pyspark.RDD. collectWithJobGroup` ### How was this patch tested? Unit test. Closes #28395 from WeichenXu123/collect_with_job_group. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-01 10:08:16 +09:00
Michael Chirico	c00fe5ef3e	[MINOR][R] small tidying of sh scripts for R ### What changes were proposed in this pull request? Some tidying of `sh` scripts in `R/` ### Why are the changes needed? Not strictly needed, but the `'devtools' %in% installed.packages()` line in particular is "improper" / proabbly slow ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not Closes #28419 from MichaelChirico/r-scripts-cleanup. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-30 16:58:05 -07:00
Huaxin Gao	2410a45703	[SPARK-31612][SQL][DOCS] SQL Reference clean up ### What changes were proposed in this pull request? SQL Reference cleanup ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce _any_ user-facing change? updated sql-ref-syntax-qry.html before <img width="1100" alt="Screen Shot 2020-04-29 at 11 08 25 PM" src="https://user-images.githubusercontent.com/13592258/80677799-70b27280-8a6e-11ea-8e3f-a768f29d0377.png"> after <img width="1100" alt="Screen Shot 2020-04-29 at 11 05 55 PM" src="https://user-images.githubusercontent.com/13592258/80677803-74de9000-8a6e-11ea-880c-aa05c53254a6.png"> ### How was this patch tested? Manually build and check Closes #28417 from huaxingao/cleanup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-01 06:30:35 +09:00
Xiao Li	b5ecc41c73	[SPARK-28806][DOCS][FOLLOW-UP] Remove unneeded HTML from the MD file ### What changes were proposed in this pull request? This PR is to clean up the markdown file in SHOW COLUMNS page. - remove the unneeded embedded inline HTML markup by using the basic markdown syntax. - use the ``` sql for highlighting the SQL syntax. ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Before ![Screen Shot 2020-04-29 at 5 20 11 PM](https://user-images.githubusercontent.com/11567269/80661963-fa4d4a80-8a44-11ea-9dea-c43cda6de010.png) After ![Screen Shot 2020-04-29 at 6 03 50 PM](https://user-images.githubusercontent.com/11567269/80661940-f15c7900-8a44-11ea-9943-a83e8d8618fb.png) Closes #28414 from gatorsmile/cleanupShowColumns. Lead-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-04-30 09:34:56 -07:00
Max Gekk	c09cfb9808	[SPARK-31557][SQL] Fix timestamps rebasing in legacy parsers ### What changes were proposed in this pull request? In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings. ### Why are the changes needed? Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `TimestampFormatterSuite` Closes #28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 12:45:32 +00:00
Wenchen Fan	636119c54b	[SPARK-31607][SQL] Improve the perf of CTESubstitution ### What changes were proposed in this pull request? In `CTESubstitution`, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations. ### Why are the changes needed? Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large. ### Does this PR introduce any user-facing change? No ### How was this patch tested? local perf test ``` scala> :pa // Entering paste mode (ctrl-D to finish) def test(i: Int): Unit = 1.to(i).foreach { _ => spark.sql(""" with t1 as (select 1), t2 as (select 1), t3 as (select 1), t4 as (select 1), t5 as (select 1), t6 as (select 1), t7 as (select 1), t8 as (select 1), t9 as (select 1) select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed() } // Exiting paste mode, now interpreting. test: (i: Int)Unit scala> test(10000) scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent) ``` The result before this patch ``` Rule Effective Time / Total Time Effective Runs / Total Runs CTESubstitution 3328796344 / 3924576425 10000 / 20000 ``` The result after this patch ``` Rule Effective Time / Total Time Effective Runs / Total Runs CTESubstitution 1503085936 / 2091992092 10000 / 20000 ``` About 2 times faster. Closes #28407 from cloud-fan/cte. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 12:11:16 +00:00
Yuanjian Li	7195a18bf2	[SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests ### What changes were proposed in this pull request? - Rephrase the API doc for `Column.as` - Simplify the UTs ### Why are the changes needed? Address comments in https://github.com/apache/spark/pull/28326 ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #28390 from xuanyuanking/SPARK-27340-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 06:24:00 +00:00
gatorsmile	f56c6630fb	[SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table ### What changes were proposed in this pull request? This PR is to clean up the markdown file in datetime-pattern page. - Replace HTML table by MD table ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Before ![Screen Shot 2020-04-29 at 7 59 10 PM](https://user-images.githubusercontent.com/11567269/80668093-c9294600-8a55-11ea-9dca-d558203298f8.png) After ![Screen Shot 2020-04-29 at 8 13 38 PM](https://user-images.githubusercontent.com/11567269/80668146-f1b14000-8a55-11ea-8d47-8dc8a0378271.png) Closes #28415 from gatorsmile/cleanupUDFPage. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 05:47:42 +00:00
beliefer	1d1bb79bc6	[SPARK-31372][SQL][TEST] Display expression schema for double check ### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:58:04 +00:00
Dongjoon Hyun	85dad37f69	[SPARK-31601][K8S] Fix spark.kubernetes.executor.podNamePrefix to work ### What changes were proposed in this pull request? This PR aims to fix `spark.kubernetes.executor.podNamePrefix` to work. ### Why are the changes needed? Currently, the configuration is broken like the following. ``` bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ -c spark.kubernetes.container.image=spark:pr \ -c spark.kubernetes.driver.pod.name=mypod \ -c spark.kubernetes.executor.podNamePrefix=mypod \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar ``` BEFORE SPARK-31601 ``` pod/mypod 1/1 Running 0 9s pod/spark-pi-7469dd71c499fafb-exec-1 1/1 Running 0 4s pod/spark-pi-7469dd71c499fafb-exec-2 1/1 Running 0 4s ``` AFTER SPARK-31601 ``` pod/mypod 1/1 Running 0 8s pod/mypod-exec-1 1/1 Running 0 3s pod/mypod-exec-2 1/1 Running 0 3s ``` ### Does this PR introduce any user-facing change? Yes. This is a bug fix. The conf will work as described in the documentation. ### How was this patch tested? Pass the Jenkins and run the above comment manually. Closes #28401 from dongjoon-hyun/SPARK-31601. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>	2020-04-30 09:15:12 +05:30
Kent Yao	9241f8282f	[SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations ### What changes were proposed in this pull request? Because of `ebc8fa50d0` and `beec8d535f`, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....` This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back. ### Why are the changes needed? restore the SQL string for datetime operations ### Does this PR introduce any user-facing change? NO, we are restoring here ### How was this patch tested? added unit tests Closes #28402 from yaooqinn/SPARK-31586-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:31:29 +00:00
Max Gekk	91648654da	[SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 03:20:10 +00:00
HyukjinKwon	f0c79ad88a	[MINOR][INFRA] Add a guide to clarify release/unreleased Spark versions of user-facing change in the Github PR template ### What changes were proposed in this pull request? This PR proposes to add a guide to clarify the Spark version when describing "Does this PR introduce any user-facing change?". ### Why are the changes needed? It seems confusing to write when the user facing changes happen within unreleased branches. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in Github and it renders find as intended. Closes #28403 from HyukjinKwon/minor-more-guide. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-30 09:22:07 +09:00
DB Tsai	ecfee82fda	[SPARK-31582][YARN] Being able to not populate Hadoop classpath ### What changes were proposed in this pull request? We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`. ### Why are the changes needed? Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster. However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars. One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors. By not populating the Hadoop classpath from the clusters can address this issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop. We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath. Closes #28376 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-04-29 21:10:40 +00:00
Michael Chirico	226301a6bc	[SPARK-29339][R][FOLLOW-UP] Remove requireNamespace1 workaround for arrow ### What changes were proposed in this pull request? `requireNamespace1` was used to get `SparkR` on CRAN while Suggesting `arrow` while `arrow` was not yet available on CRAN. ### Why are the changes needed? Now `arrow` is on CRAN, we can properly use `requireNamespace` without triggering CRAN failures. ### Does this PR introduce any user-facing change? No ### How was this patch tested? AppVeyor will test, and CRAN check in Jenkins build. Closes #28387 from MichaelChirico/r-require-arrow. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-29 18:42:32 +09:00
Max Gekk	73eac7565d	[SPARK-31557][SQL][TESTS][FOLLOWUP] Check rebasing in all legacy formatters ### What changes were proposed in this pull request? - Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345 - Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days. ### Why are the changes needed? Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `DateFormatterSuite`. Closes #28398 from MaxGekk/test-rebasing-in-legacy-date-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:19:34 +00:00
Takeshi Yamamuro	97f2c03d3b	[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema ### What changes were proposed in this pull request? This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ \|rand(7986133828002692830)\| +-------------------------+ \| 0.9524061403696937\| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ \| rand()\| +------------------+ \|0.7137935639522275\| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ \| rand(1)\| +------------------+ \|0.6363787615254752\| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == (1) Project [rand(-2282124938778456838) AS rand()#0] +- (1) Scan OneRowRelation[] ``` Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. ### Why are the changes needed? To make output schema deterministic. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-29 00:14:50 -07:00
Terry Kim	36803031e8	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:06:45 +00:00
Kent Yao	ea525fe8c0	[SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond ### What changes were proposed in this pull request? With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933 I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values. ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 ------- 14 (1 row) Query 20200428_135239_00000_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 ------- 13 (1 row) Query 20200428_135246_00001_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```sql postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part ----------- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part ----------- 13 ``` ``` spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); 0 spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); 0 ``` In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value. ### Why are the changes needed? Both satisfy the ANSI standard and common use cases in modern SQL platforms ### Does this PR introduce any user-facing change? No, it new in 3.0 ### How was this patch tested? add more uts Closes #28396 from yaooqinn/SPARK-31597. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 06:56:33 +00:00
angerszhu	6bc8d84130	[SPARK-29492][SQL] Reset HiveSession's SessionState conf's ClassLoader when sync mode ### What changes were proposed in this pull request? Run sql in spark thrift server, each session 's thrift server about method will be called in one thread, but when running query statement, we have two mode: 1. sync 2. async `5a482e7209/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (L205-L238)` In sync mode, we just submit query in current session's corresponding thread and wait Spark to running query and return result, and the query method will always wait for query return. In async mode, in SparkExecuteStatementOperation, we will submit query in a backend thread pool, and update operation state, after submitted to backend thread poll, ExecuteStatement method will return a OperationHandle to client side, and client side will request operation status continuously. after backend thread running sql and return , it will update corresponding operation status, when client got operation status is final status, it will got error or start fetching result of this operation. When we use pyhive connect to SparkThriftServer, it will run statement in sync mode. When we query data of hive table , it will check serde class in HiveTableScanExec#addColumnMetadataToConf `5a482e7209/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala (L123)` ``` public Class<? extends Deserializer> getDeserializerClass() { try { return Class.forName(this.getSerdeClassName(), true, Utilities.getSessionSpecifiedClassLoader()); } catch (ClassNotFoundException var2) { throw new RuntimeException(var2); } } public static ClassLoader getSessionSpecifiedClassLoader() { SessionState state = SessionState.get(); if (state != null && state.getConf() != null) { ClassLoader sessionCL = state.getConf().getClassLoader(); if (sessionCL != null) { if (LOG.isTraceEnabled()) { LOG.trace("Use session specified class loader"); } return sessionCL; } else { if (LOG.isDebugEnabled()) { LOG.debug("Session specified class loader not found, use thread based class loader"); } return JavaUtils.getClassLoader(); } } else { if (LOG.isDebugEnabled()) { LOG.debug("Hive Conf not found or Session not initiated, use thread based class loader instead"); } return JavaUtils.getClassLoader(); } } ``` Since we run statement in sync mode, it will use HiveSession's SessionState, and use it's conf's classLoader. then error happened. ``` Current operation state RUNNING_STATE, java.lang.RuntimeException: java.lang.ClassNotFoundException: xxx.xxx.xxxJsonSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74) at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101) at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98) at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader$lzycompute(HiveTableScanExec.scala:110) at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader(HiveTableScanExec.scala:105) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192) ``` We should reset it when we start run sql in sync mode. ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? UT Closes #26141 from AngersZhuuuu/add_jar_in_sync_mode. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 06:48:46 +00:00
Max Gekk	86761861c2	[SPARK-31563][SQL][FOLLOWUP] Create literals directly from Catalyst's internal value in InSet.sql ### What changes were proposed in this pull request? In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor. ### Why are the changes needed? This simplifies code and avoids unnecessary conversions to external types. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`. Closes #28399 from MaxGekk/fix-InSet-sql-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 06:44:22 +00:00
Kent Yao	295d866969	[SPARK-31596][SQL][DOCS] Generate SQL Configurations from hive module to configuration doc ### What changes were proposed in this pull request? This PR adds `-Phive` profile to the pre-build phase to build the hive module to dev classpath. Then reflect the HiveUtils object to dump all configurations in the class. ### Why are the changes needed? supply SQL configurations from hive module to doc ### Does this PR introduce any user-facing change? NO ### How was this patch tested? passing Jenkins add verified locally ![image](https://user-images.githubusercontent.com/8326978/80492333-6fae1200-8996-11ea-99fd-595ee18c67e5.png) Closes #28394 from yaooqinn/SPARK-31596. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-29 15:34:45 +09:00
Dongjoon Hyun	62be65efe4	[SPARK-31567][R][TESTS] Update AppVeyor Rtools to 4.0.0 ### What changes were proposed in this pull request? This aims to upgrade Rtools first to prepare R 4.0.0 in AppVeyor for Apache Spark 3.1.0. ### Why are the changes needed? R 4.0.0 is released on April 24th, 2020. It uses Rtools 4.0.0 officially. - https://cran.r-project.org/doc/manuals/r-release/NEWS.html - https://stat.ethz.ch/pipermail/r-announce/2020/000653.html ### Does this PR introduce any user-facing change? No. (This PR aims to test Rtools 4.0.0 in AppVeyor environment.) ### How was this patch tested? See the AppVeyor result. Closes #28358 from dongjoon-hyun/SPARK-31567. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-29 13:10:43 +09:00
Baohe Zhang	3808014a2f	[SPARK-31584][WEBUI] Fix NullPointerException when parsing event log with InMemoryStore ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/27716 introduced parent index for InMemoryStore. When the method "deleteParentIndex(Object key)" in InMemoryStore.java is called and the key is not contained in "NaturalKeys v", A java.lang.NullPointerException will be thrown. This patch fixed the issue by updating the if condition. ### Why are the changes needed? Fixed a minor bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a unit test for deleteParentIndex. Closes #28378 from baohe-zhang/SPARK-31584. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-28 17:27:13 -07:00
Huaxin Gao	d34cb59fb3	[SPARK-31556][SQL][DOCS] Document LIKE clause in SQL Reference ### What changes were proposed in this pull request? Document LIKE clause in SQL Reference ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1050" alt="Screen Shot 2020-04-25 at 5 49 57 PM" src="https://user-images.githubusercontent.com/13592258/80294346-5babab80-871d-11ea-8ac9-51bbab0aca88.png"> <img width="1050" alt="Screen Shot 2020-04-25 at 5 50 24 PM" src="https://user-images.githubusercontent.com/13592258/80294347-5ea69c00-871d-11ea-8c51-7a90ee20f7da.png"> <img width="1050" alt="Screen Shot 2020-04-25 at 5 50 42 PM" src="https://user-images.githubusercontent.com/13592258/80294351-61a18c80-871d-11ea-9e75-e3345d2f52f5.png"> ### How was this patch tested? Manually build and check Closes #28332 from huaxingao/where_clause. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-29 09:17:23 +09:00
Huaxin Gao	dcc09022f1	[SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started ### What changes were proposed in this pull request? Add a paragraph for scalar function in sql getting started ### Why are the changes needed? To make 3.0 doc complete. ### Does this PR introduce any user-facing change? before: <img width="870" alt="Screen Shot 2020-04-21 at 10 11 12 PM" src="https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png"> after: <img width="865" alt="Screen Shot 2020-04-22 at 11 49 59 PM" src="https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png"> <img width="1033" alt="Screen Shot 2020-04-23 at 6 22 53 PM" src="https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png"> ### How was this patch tested? Closes #28290 from huaxingao/scalar. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-28 11:17:45 -05:00
Kent Yao	54996be4d2	[SPARK-31527][SQL][TESTS][FOLLOWUP] Add a benchmark test for datetime add/subtract interval operations ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot. According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth. In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories. ### Why are the changes needed? Performance test coverage, since these operations are missing in the DateTimeBenchmark. ### Does this PR introduce any user-facing change? No, just test ### How was this patch tested? regenerated benchmark results Closes #28369 from yaooqinn/SPARK-31527-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 15:39:28 +00:00
Max Gekk	b7cabc80e6	[SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit `5631a96367`. Closes #28328 ### Why are the changes needed? The PR https://github.com/apache/spark/pull/25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return 'true' because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ \|isInCollection\| +--------------+ \| false\| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:10:50 +00:00
Kent Yao	beec8d535f	[SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r) ### What changes were proposed in this pull request? The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent. Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239 Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR. ### Why are the changes needed? remove redundant and useless code for easy maintenance ### Does this PR introduce any user-facing change? Yes, the SQL string of `datetime - interval` become `datetime + (- interval)` ### How was this patch tested? modified existing unit tests. Closes #28381 from yaooqinn/SPARK-31586. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 14:01:07 +00:00
Michael Chirico	c011502ee3	[SPARK-31573][R] Apply fixed=TRUE as appropriate to regex usage in R ### What changes were proposed in this pull request? For regex functions in base R (`gsub`, `grep`, `grepl`, `strsplit`, `gregexpr`), supplying the `fixed=TRUE` option will be more performant. ### Why are the changes needed? This is a minor fix for performance ### Does this PR introduce any user-facing change? No (although some internal code was applying fixed-as-regex in some cases that could technically have been over-broad and caught unintended patterns) ### How was this patch tested? Not Closes #28367 from MichaelChirico/r-regex-fixed. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 17:24:21 +09:00
Yuanjian Li	6ed2dfbba1	[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ \| b\| fake\| +---+----------+ \| 2\|2020-01-01\| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 08:11:41 +00:00
jiake	079b3623c8	[SPARK-31524][SQL] Add metric to the split task number for skew optimization ### What changes were proposed in this pull request? This is a followup of [#28022](https://github.com/apache/spark/pull/28022), to add the metric info of split task number for skewed optimization. With this PR, we can see the number of splits for the skewed partitions as following: ![image](https://user-images.githubusercontent.com/11972570/80294583-ff886c00-879c-11ea-813c-2db302f99f04.png) ### Why are the changes needed? make UI more friendly ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing ut Closes #28109 from JkSelf/addSplitNumer. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-28 07:21:00 +00:00
Dongjoon Hyun	2d3e9601b5	[SPARK-31589][INFRA] Use `r-lib/actions/setup-r` in GitHub Action ### What changes were proposed in this pull request? This PR aims to use `r-lib/actions/setup-r` because it's more stable and maintained by 3rd party. ### Why are the changes needed? This will recover the current outage. In addition, this will be more robust in the future. As of now, this is tested via https://github.com/dongjoon-hyun/spark/pull/17 . ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the GitHub Actions, especially `Linter R` and `Generate Documents`. Closes #28382 from dongjoon-hyun/SPARK-31589. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 13:22:43 +09:00
Michael Chirico	410fa91321	[SPARK-31578][R] Vectorize schema validation for arrow in types.R ### What changes were proposed in this pull request? Repeated `sapply` avoided in internal `checkSchemaInArrow` ### Why are the changes needed? Current implementation is doubly inefficient: 1. Repeatedly doing the same (95%) `sapply` loop 2. Doing scalar `==` on a vector (`==` should be done over the whole vector for efficiency) ### Does this PR introduce any user-facing change? No ### How was this patch tested? By my trusty friend the CI bots Closes #28372 from MichaelChirico/vectorize-types. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 11:03:51 +09:00
Michael Chirico	a68d98cf4f	[SPARK-31568][R] Replaces paste(sep="") to paste0 ### What changes were proposed in this pull request? All instances of `paste(..., sep = "")` in the code are replaced with `paste0` which is more performant ### Why are the changes needed? Performance & consistency (`paste0` is already used extensively in the R package) ### Does this PR introduce any user-facing change? No ### How was this patch tested? None Closes #28374 from MichaelChirico/r-paste0. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 10:58:48 +09:00
Dongjoon Hyun	79eaaaf6da	[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.5.10. ### Why are the changes needed? Apache ORC 1.5.10 is a maintenance release with the following patches. - [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569 - [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte - [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs - [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file The following is release note. - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing ORC tests and a newly added test case. - The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114) - The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC. - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906 Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 18:56:30 -07:00

1 2 3 4 5 ...

27132 commits