ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
shivusondur	eee45f83c6	[SPARK-28809][DOC][SQL] Document SHOW TABLE in SQL Reference ### What changes were proposed in this pull request? Added the document reference for SHOW TABLE EXTENDED sql command ### Why are the changes needed? For User reference ### Does this PR introduce any user-facing change? yes, it provides document reference for SHOW TABLE EXTENDED sql command ### How was this patch tested? verified in snap <details> <summary> Attached the Snap</summary> ![image](https://user-images.githubusercontent.com/7912929/68142029-b4f80680-ff54-11e9-99a0-f39f2dac09e4.png) ![image](https://user-images.githubusercontent.com/7912929/64019738-95f08900-cb4d-11e9-9769-ee2be926fdc1.png) ![image](https://user-images.githubusercontent.com/7912929/64019775-ab65b300-cb4d-11e9-9e7e-140616af7790.png) ![image](https://user-images.githubusercontent.com/7912929/67963910-65000380-fc25-11e9-9cd0-8ee43bf206b1.png) </details> Closes #25632 from shivusondur/jiraSHOWTABLE. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-04 11:58:41 -06:00
shivusondur	f29a979e42	[SPARK-28798][DOC][SQL] Document DROP TABLE/VIEW statement in SQL Reference ### What changes were proposed in this pull request? Added doc for DROP TABLE and DROP VIEW sql command ### Why are the changes needed? For reference DROP TABLE or DROP VIEW in spark-sql ### Does this PR introduce any user-facing change? It updates DROP TABLE or DROP VIEW reference doc ### How was this patch tested? <details> <summary> Attached the Snap</summary> DROP TABLE ![image](https://user-images.githubusercontent.com/7912929/67884038-2443b400-fb6b-11e9-9773-b21dae398789.png) ![image](https://user-images.githubusercontent.com/7912929/67797387-aa96c200-faa7-11e9-90d4-fa8b7c6a4ec7.png) DROP VIEW ![image](https://user-images.githubusercontent.com/7912929/67797463-c306dc80-faa7-11e9-96ec-e2f2e89d0db8.png) ![image](https://user-images.githubusercontent.com/7912929/67797648-1ed16580-faa8-11e9-9d32-19106326e3d9.png) </details> Closes #25533 from shivusondur/jiraUSEDB. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-04 11:52:19 -06:00
angerszhu	e524a3a223	[SPARK-29742][BUILD] Update checkstyle plugin's check dir scope ### What changes were proposed in this pull request? Current checkstyle checking folder can't cover all folder. Since for support multi version hive, we have some divided hive folder. We should check it too. ### Why are the changes needed? Fix build bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26385 from AngersZhuuuu/SPARK-29742. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-04 09:08:47 -08:00
Kent Yao	44b8fbcc58	[SPARK-29663][SQL] Support sum with interval type values ### What changes were proposed in this pull request? sum support interval values ### Why are the changes needed? Part of SPARK-27764 Feature Parity between PostgreSQL and Spark ### Does this PR introduce any user-facing change? yes, sum can evaluate intervals ### How was this patch tested? add ut Closes #26325 from yaooqinn/SPARK-29663. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-05 01:05:07 +08:00
Terry Kim	d4ea211187	[SPARK-29678][SQL] ALTER TABLE (ADD PARTITION) should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableAddPartitionStatement and make ALTER TABLE ... ADD PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t ADD PARTITION (id=1) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... ADD PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests Closes #26369 from imback82/spark-29678. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-04 23:56:47 +08:00
shahid	9023c69db8	[SPARK-29590][WEBUI] JDBC/ODBC tab in the spark UI support hide tables, to make it consistent with other tabs ### What changes were proposed in this pull request? Currently, JDBC/ODBC tab in the WEBUI doesn't support hiding table. Other tabs in the web ui like, Jobs, stages, SQL etc supports hiding table (refer https://github.com/apache/spark/pull/22592). In this PR, added the support for hide table in the jdbc/odbc tab also. ### Why are the changes needed? Spark ui about the contents of the form need to have hidden and show features, when the table records very much. Because sometimes you do not care about the record of the table, you just want to see the contents of the next table, but you have to scroll the scroll bar for a long time to see the contents of the next table. ### Does this PR introduce any user-facing change? No, except support of hide table ### How was this patch tested? Manually tested ![Screenshot 2019-11-01 at 12 10 05 PM](https://user-images.githubusercontent.com/23054875/68007364-61aa5d80-fca1-11e9-841e-c5a7382871fa.png) ![Screenshot 2019-11-01 at 12 10 43 PM](https://user-images.githubusercontent.com/23054875/68007355-5a834f80-fca1-11e9-844a-f4ba1a333db7.png) Closes #26353 from shahidki31/hideTable. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-04 09:44:10 -06:00
Kent Yao	8cf76f8d61	[SPARK-29285][SHUFFLE] Temporary shuffle files should be able to handle disk failures ### What changes were proposed in this pull request? The `getFile` method in `DiskBlockManager` may return a file with an existing subdirectory. But when a disk failure occurs on that subdirectory. this file is inaccessible. Then the FileNotFoundException like the following usually tear down the entire task, which is a bit heavy. ``` java.io.FileNotFoundException: /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` This change pre-touch the temporary file to check whether the parent directory is available or not. If NOT, we may try another possibly heathy disk util we reach the max attempts. ### Why are the changes needed? Re-running the whole task is much heavier than pick another heathy disk to output the temporary results. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD UT Closes #25962 from yaooqinn/SPARK-29285. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-04 18:21:57 +08:00
Maxim Gekk	50538600ec	[SPARK-29736][TESTS] Improve stability of tests for special datetime values ### What changes were proposed in this pull request? - Retry the tests for special date-time values on failure. The tests can potentially fail when reference values were taken before midnight and test code resolves special values after midnight. The retry can guarantees that the tests run during the same day. - Simplify getting of the current timestamp via `Instant.now()`. This should avoid any issues of converting current local datetime to an instance. For example, the same local time can be mapped to 2 instants when clocks are turned backward 1 hour on daylight saving date. - Extract common code to SQLHelper - Set the tested zoneId to the session time zone in `DateTimeUtilsSuite`. ### Why are the changes needed? To make the tests more stable. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `Date`/`TimestampFormatterSuite` and `DateTimeUtilsSuite`. Closes #26380 from MaxGekk/retry-on-fail. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-04 16:59:32 +08:00
Dongjoon Hyun	c55265cd2d	[SPARK-29739][PYSPARK][TESTS] Use `java` instead of `cc` in test_pipe_functions ### What changes were proposed in this pull request? This PR aims to replace `cc` with `java` in `test_pipe_functions` of `test_rdd.py`. ### Why are the changes needed? Currently, `test_rdd.py` assumes `cc` installation during `rdd.pipe` tests. This requires us to install `gcc` for python testing. If we use `java`, we can have the same test coverage and we don't need to install it because it's already installed in `PySpark` test environment. This will be helpful when we build a dockerized parallel testing environment. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the existing PySpark tests. Closes #26383 from dongjoon-hyun/SPARK-29739. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 23:03:38 -08:00
Liang-Chi Hsieh	afb055ba19	[SPARK-29353][SQL] Fallback AlterTableAlterColumnStatement to v1 AlterTableChangeColumnCommand ### What changes were proposed in this pull request? If the resolved table is v1 table, AlterTableAlterColumnStatement fallbacks to v1 AlterTableChangeColumnCommand. ### Why are the changes needed? To make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? Yes, a ALTER TABLE ALTER COLUMN command previously fails on v1 tables. After this, it falls back to v1 AlterTableChangeColumnCommand. ### How was this patch tested? Unit test. Closes #26354 from viirya/SPARK-29353. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-04 15:02:27 +08:00
Maxim Gekk	fb60c2a170	[SPARK-29671][SQL] Simplify string representation of intervals ### What changes were proposed in this pull request? In the PR, I propose to changed `CalendarInterval.toString`: - to skip the `week` unit - to convert `milliseconds` and `microseconds` as the fractional part of the `seconds` unit. ### Why are the changes needed? To improve readability. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - By `CalendarIntervalSuite` and `IntervalUtilsSuite` - `literals.sql`, `datetime.sql` and `interval.sql` Closes #26367 from MaxGekk/interval-to-string-format. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 22:56:59 -08:00
wangguangxin.cn	83c39d15e1	[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation ### What changes were proposed in this pull request? This is somewhat a complement of https://github.com/apache/spark/pull/21853. The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`. This PR try to remove this kind of `Sort` operator in `SQL Optimizer`. ### Why are the changes needed? For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`. Remove useless `Sort` operator can improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Adding new UT `RemoveSortInSubquerySuite.scala` Closes #26011 from WangGuangxin/remove_sorts. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-04 14:52:19 +08:00
Kent Yao	5ba17d09ac	[SPARK-29722][SQL] Non reversed keywords should be able to be used in high order functions ### What changes were proposed in this pull request? Support non-reversed keywords to be used in high order functions. ### Why are the changes needed? the keywords are non-reversed. ### Does this PR introduce any user-facing change? yes, all non-reversed keywords can be used in high order function correctly ### How was this patch tested? add uts Closes #26366 from yaooqinn/SPARK-29722. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-04 14:52:14 +09:00
Liang-Chi Hsieh	e7263242bd	Revert "[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh" ### What changes were proposed in this pull request? This reverts commit `91d990162f`. ### Why are the changes needed? CRAN check is pretty important for R package, we should enable it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #26381 from viirya/revert-SPARK-24152. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 15:14:58 -08:00
Sean Owen	19b8c71436	[SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+ ### What changes were proposed in this pull request? Update the version of dropwizard metrics that Spark uses for metrics to 4.1.x, from 3.2.x. ### Why are the changes needed? This helps JDK 9+ support, per for example https://github.com/dropwizard/metrics/pull/1236 ### Does this PR introduce any user-facing change? No, although downstream users with custom metrics may be affected. ### How was this patch tested? Existing tests. Closes #26332 from srowen/SPARK-29674. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 15:13:06 -08:00
Maxim Gekk	80a89873b2	[SPARK-29733][TESTS] Fix wrong order of parameters passed to `assertEquals` ### What changes were proposed in this pull request? The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter. ### Why are the changes needed? Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example: ```java assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L)); ``` ``` java.lang.AssertionError: Expected :interval 5 months 5 days 101 hours Actual :interval 5 months 5 days 102 hours ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests. Closes #26377 from MaxGekk/fix-order-in-assert-equals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 11:21:28 -08:00
Dongjoon Hyun	4bcfe5033c	[SPARK-29731][INFRA] Use public JIRA REST API to read-only access ### What changes were proposed in this pull request? This PR replaces `jira_client` API call for read-only access with public Apache JIRA REST API invocation. ### Why are the changes needed? This will reduce the number of authenticated API invocations. I hope this will reduce the chance of CAPCHAR from Apache JIRA site. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. ``` $ echo 26375 > .github-jira-max $ dev/github_jira_sync.py Read largest PR number previously seen: 26375 Retrieved 100 JIRA PR's from Github 1 PR's remain after excluding visted ones Checking issue SPARK-29731 Writing largest PR number seen: 26376 Build PR dictionary SPARK-29731 26376 Set 26376 with labels "PROJECT INFRA" ``` Closes #26376 from dongjoon-hyun/SPARK-29731. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 11:17:53 -08:00
Dongjoon Hyun	1ac6bd9f79	[SPARK-29729][BUILD] Upgrade ASM to 7.2 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 7.2. - https://issues.apache.org/jira/browse/XBEAN-322 (Upgrade to ASM 7.2) - https://asm.ow2.io/versions.html ### Why are the changes needed? This will bring the following patches. - 317875: Infinite loop when parsing invalid method descriptor - 317873: Add support for RET instruction in AdviceAdapter - 317872: Throw an exception if visitFrame used incorrectly - add support for Java 14 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing UTs. Closes #26373 from dongjoon-hyun/SPARK-29729. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 10:42:38 -08:00
Dongjoon Hyun	91d990162f	[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh ### What changes were proposed in this pull request? This PR aims to remove `check-cran` from `run-tests.sh`. We had better add an independent Jenkins job to run `check-cran`. ### Why are the changes needed? CRAN instability has been a blocker for our daily dev process. The following simple check causes consecutive failures in 4 of 9 Jenkins jobs + PR builder. ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` - spark-branch-2.4-test-sbt-hadoop-2.6 - spark-branch-2.4-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-3.2 - PRBuilder ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Currently, PR builder is failing due to the above issue. This PR should pass the Jenkins. Closes #26375 from dongjoon-hyun/SPARK-24152. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-02 21:37:40 -07:00
Eric Meisel	be022d9aee	[SPARK-29677][DSTREAMS] amazon-kinesis-client 1.12.0 ### What changes were proposed in this pull request? Upgrading the amazon-kinesis-client dependency to 1.12.0. ### Why are the changes needed? The current amazon-kinesis-client version is 1.8.10. This version depends on the use of `describeStream`, which has a hard limit on an AWS account (10 reqs / second). Versions 1.9.0 and up leverage `listShards`, which has no such limit. For large customers, this can be a major problem. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26333 from etspaceman/kclUpgrade. Authored-by: Eric Meisel <eric.steven.meisel@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-02 16:42:49 -05:00
Wenchen Fan	31ae446e9c	[SPARK-29623][SQL] do not allow multiple unit TO unit statements in interval literal syntax ### What changes were proposed in this pull request? re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed. ### Why are the changes needed? This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase. ### Does this PR introduce any user-facing change? Yes, and a migration guide item is added. ### How was this patch tested? new tests. Closes #26285 from cloud-fan/syntax. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-02 21:35:56 +08:00
dengziming	28ccd31aee	[SPARK-29611][WEBUI] Sort Kafka metadata by the number of messages ### What changes were proposed in this pull request? Sort metadata by the number of messages in each Kafka partition ### Why are the changes needed? help to find the data skewness problem. ### Does this PR introduce any user-facing change? Yes, add a column count to the metadata and sort by count ![image](https://user-images.githubusercontent.com/26023240/67617886-63e06800-f81a-11e9-8718-be3a0100952e.png) If you set `minPartitions` configurations with structure structured-streaming which doesn't have the Streaming page, my code changes in `DirectKafkaInputDStream` won't affect the WEB UI page just as it shows in the follow image ![image](https://user-images.githubusercontent.com/26023240/68020762-79520800-fcda-11e9-96cd-f0c64a36f505.png) ### How was this patch tested? Manual test Closes #26266 from dengziming/feature_ui_optimize. Lead-authored-by: dengziming <dengziming@growingio.com> Co-authored-by: dengziming <swzmdeng@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-01 22:46:34 -07:00
Matt Stillwell	1e1b7302f4	[MINOR][PYSPARK][DOCS] Fix typo in example documentation ### What changes were proposed in this pull request? I propose that we change the example code documentation to call the proper function . For example, under the `foreachBatch` function, the example code was calling the `foreach()` function by mistake. ### Why are the changes needed? I suppose it could confuse some people, and it is a typo ### Does this PR introduce any user-facing change? No, there is no "meaningful" code being change, simply the documentation ### How was this patch tested? I made the change on a fork and it still worked Closes #26299 from mstill3/patch-1. Authored-by: Matt Stillwell <18670089+mstill3@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-01 11:55:29 -07:00
root1	39fff9258a	[SPARK-29452][WEBUI] Improve Storage tab tooltip ### What changes were proposed in this pull request? Added Tootips for each column in storage tab of Web UI. ### Why are the changes needed? Tooltips will help users in understanding columns of storage tabs. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Manually Tested. Closes #26226 from iRakson/storage_tooltip. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-01 08:27:34 -05:00
DylanGuedes	f53be0a05e	[SPARK-29109][SQL][TESTS] Port window.sql (Part 3) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql#L564-L911 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ### Why are the changes needed? To ensure compatibility with PostgreSQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results. Closes #26274 from DylanGuedes/spark-29109. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-01 22:05:40 +09:00
Huaxin Gao	14337f68e3	[SPARK-29643][SQL] ALTER TABLE/VIEW (DROP PARTITION) should look up catalog/table like v2 commands ###What changes were proposed in this pull request? Add AlterTableDropPartitionStatement and make ALTER TABLE/VIEW ... DROP PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t DROP PARTITION (id=1) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE/VIEW ... DROP PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26303 from huaxingao/spark-29643. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-01 18:29:04 +08:00
Liu,Linhong	a4382f7fe1	[SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds ### What changes were proposed in this pull request? Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it to 3 fields: months, days and microseconds. This is because one logical day interval may have different number of microseconds (daylight saving). ### Why are the changes needed? One logical day interval may have different number of microseconds (daylight saving). For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to 2019-11-3 12:00:00 ### Does this PR introduce any user-facing change? no ### How was this patch tested? unit test and new added test cases Closes #26134 from LinhongLiu/calendarinterval. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-01 18:12:33 +08:00
zhengruifeng	8a4378c6f0	[SPARK-29686][ML] LinearSVC should persist instances if needed ### What changes were proposed in this pull request? persist the input if needed ### Why are the changes needed? training with non-cached dataset will hurt performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26344 from zhengruifeng/linear_svc_cache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-01 12:07:07 +08:00
Huaxin Gao	ae7450d1c9	[SPARK-29676][SQL] ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableRenamePartitionStatement and make ALTER TABLE ... RENAME TO PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... RENAME TO PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26350 from huaxingao/spark_29676. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-31 20:28:31 -07:00
Terry Kim	3175f4bf1b	[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala ### What changes were proposed in this pull request? This PR changes the behavior of `Column.getItem` to call `Column.getItem` on Scala side instead of `Column.apply`. ### Why are the changes needed? The current behavior is not consistent with that of Scala. In PySpark: ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col.getItem(col('id'))).show() # +---+------+ # \| id\|mapped\| # +---+------+ # \| 0\| 100\| # \| 1\| 200\| # +---+------+ ``` In Scala: ```Scala val df = spark.range(2) val map_col = map(lit(0), lit(100), lit(1), lit(200)) // The following getItem results in the following exception, which is the right behavior: // java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id // at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) // at org.apache.spark.sql.Column.getItem(Column.scala:856) // ... 49 elided df.withColumn("mapped", map_col.getItem(col("id"))).show ``` ### Does this PR introduce any user-facing change? Yes. If the use wants to pass `Column` object to `getItem`, he/she now needs to use the indexing operator to achieve the previous behavior. ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col[col('id'))].show() # +---+------+ # \| id\|mapped\| # +---+------+ # \| 0\| 100\| # \| 1\| 200\| # +---+------+ ``` ### How was this patch tested? Existing tests. Closes #26351 from imback82/spark-29664. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-01 12:25:48 +09:00
ulysses	8a8ac00271	[SPARK-29687][SQL] Fix JDBC metrics counter data type ### What changes were proposed in this pull request? Fix JDBC metrics counter data type. Related pull request [26109](https://github.com/apache/spark/pull/26109). ### Why are the changes needed? Avoid overflow. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #26346 from ulysses-you/SPARK-29687. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-01 08:35:00 +09:00
ulysses	888cc4601a	[SPARK-29675][SQL] Add exception when isolationLevel is Illegal ### What changes were proposed in this pull request? Now we use JDBC api and set an Illegal isolationLevel option, spark will throw a `scala.MatchError`, it's not friendly to user. So we should add an IllegalArgumentException. ### Why are the changes needed? Make exception friendly to user. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add UT. Closes #26334 from ulysses-you/SPARK-29675. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-31 09:02:13 -07:00
Jungtaek Lim (HeartSaVioR)	121510cb7b	[SPARK-29604][SQL][FOLLOWUP][test-hadoop3.2] Let SparkSQLEnvSuite to be run in dedicated JVM ### What changes were proposed in this pull request? This patch addresses CI build issue on sbt Hadoop-3.2 Jenkins job: SparkSQLEnvSuite are failing. Looks like the reason of test failure is the test checks registered listeners from active SparkSession which could be interfered with other test suites running concurrently. If we isolate test suite the problem should be gone. ### Why are the changes needed? CI builds for "spark-master-test-sbt-hadoop-3.2" are failing. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've run the single test suite with below command and it passed 3 times sequentially: ``` build/sbt "hive-thriftserver/testOnly *.SparkSQLEnvSuite" -Phadoop-3.2 -Phive-thriftserver ``` so we expect the test suite will pass if we isolate the test suite. Closes #26342 from HeartSaVioR/SPARK-29604-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-31 08:34:39 -07:00
Wenchen Fan	faf220aad9	[SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown Bring back https://github.com/apache/spark/pull/25955 ### What changes were proposed in this pull request? This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan. To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set. This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class. ### Why are the changes needed? This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations. Closes #26341 from cloud-fan/back. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-31 08:25:32 -07:00
jiake	cd39cd4bce	[SPARK-28560][SQL][FOLLOWUP] support the build side to local shuffle reader as far as possible in BroadcastHashJoin ### What changes were proposed in this pull request? [PR#25295](https://github.com/apache/spark/pull/25295) already implement the rule of converting the shuffle reader to local reader for the `BroadcastHashJoin` in probe side. This PR support converting the shuffle reader to local reader in build side. ### Why are the changes needed? Improve performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing unit tests Closes #26289 from JkSelf/supportTwoSideLocalReader. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 21:28:15 +08:00
maryannxue	4d302cb7ed	[SPARK-11150][SQL][FOLLOW-UP] Dynamic partition pruning ### What changes were proposed in this pull request? This is code cleanup PR for https://github.com/apache/spark/pull/25600, aiming to remove an unnecessary condition and to correct a code comment. ### Why are the changes needed? For code cleanup only. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing tests. Closes #26328 from maryannxue/dpp-followup. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 15:43:02 +08:00
Maxim Gekk	5e9a155eba	[SPARK-29520][SS] Fix checks of negative intervals ### What changes were proposed in this pull request? - Added `getDuration()` to calculate interval duration in specified time units assuming provided days per months - Added `isNegative()` which return `true` is the interval duration is less than 0 - Fix checking negative intervals by using `isNegative()` in structured streaming classes - Fix checking of `year-months` intervals ### Why are the changes needed? This fixes incorrect checking of negative intervals. An interval is negative when its duration is negative but not if interval's months or microseconds is negative. Also this fixes checking of `year-month` interval support because the `month` field could be negative. ### Does this PR introduce any user-facing change? Should not ### How was this patch tested? - Added tests for the `getDuration()` and `isNegative()` methods to `IntervalUtilsSuite` - By existing SS tests Closes #26177 from MaxGekk/interval-is-positive. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 15:35:04 +08:00
Dongjoon Hyun	095f7b05fd	Revert "[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown" This reverts commit `cfc80d0eb1`.	2019-10-30 23:11:22 -07:00
zhengruifeng	bb478706b5	[SPARK-29645][ML][PYSPARK] ML add param RelativeError ### What changes were proposed in this pull request? 1, add shared param `relativeError` 2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError` ### Why are the changes needed? It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead. `QuantileDiscretizer` had already added this param, while other algs not yet. ### Does this PR introduce any user-facing change? yes, new param is added in `Imputer`/`RobusterScaler` ### How was this patch tested? existing testsutes Closes #26305 from zhengruifeng/add_relative_err. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-31 13:52:28 +08:00
Xianyang Liu	1e599e5005	[SPARK-29582][PYSPARK] Support `TaskContext.get()` in a barrier task from Python side ### What changes were proposed in this pull request? Add support of `TaskContext.get()` in a barrier task from Python side, this makes it easier to migrate legacy user code to barrier execution mode. ### Why are the changes needed? In Spark Core, there is a `TaskContext` object which is a singleton. We set a task context instance which can be TaskContext or BarrierTaskContext before the task function startup, and unset it to none after the function end. So we can both get TaskContext and BarrierTaskContext with the object. However we can only get the BarrierTaskContext with `BarrierTaskContext`, we will get `None` if we get it by `TaskContext.get` in a barrier stage. This is useful when people switch from normal code to barrier code, and only need a little update. ### Does this PR introduce any user-facing change? Yes. Previously: ```python def func(iterator): task_context = TaskContext.get() . # this could be None. barrier_task_context = BarrierTaskContext.get() # get the BarrierTaskContext instance ... rdd.barrier().mapPartitions(func) ``` Proposed: ```python def func(iterator): task_context = TaskContext.get() . # this could also get the BarrierTaskContext instance which is same as barrier_task_context barrier_task_context = BarrierTaskContext.get() # get the BarrierTaskContext instance ... rdd.barrier().mapPartitions(func) ``` ### How was this patch tested? New UT tests. Closes #26239 from ConeyLiu/barrier_task_context. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-31 13:10:44 +09:00
HyukjinKwon	aa3716896f	[SPARK-29668][PYTHON] Add a deprecation warning for Python 3.4 and 3.5 ### What changes were proposed in this pull request? This PR proposes to show a warning for deprecated Python 3.4 and 3.5 in Pyspark. ### Why are the changes needed? It's officially deprecated. ### Does this PR introduce any user-facing change? Yes, it shows a warning message for Python 3.4 and 3.5: ``` ... Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). /.../spark/python/pyspark/context.py:220: DeprecationWarning: Support for Python 2 and Python 3 prior to version 3.6 is deprecated as of Spark 3.0. See also the plan for dropping Python 2 support at https://spark.apache.org/news/plan-for-dropping-python-2-support.html. DeprecationWarning) ... ``` ### How was this patch tested? Manually tested. Closes #26335 from HyukjinKwon/SPARK-29668. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-30 20:36:45 -07:00
Terry Kim	3a06c129f4	[SPARK-29592][SQL] ALTER TABLE (set partition location) should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Update `AlterTableSetLocationStatement` to store `partitionSpec` and make `ALTER TABLE a.b.c PARTITION(...) SET LOCATION 'loc'` fail if `partitionSpec` is set with unsupported message. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t PARTITION(...) SET LOCATION 'loc' // report set location with partition spec is not supported. ``` ### Does this PR introduce any user-facing change? yes. When running ALTER TABLE (set partition location), Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26304 from imback82/alter_table_partition_loc. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 10:47:43 +08:00
Unknown	401a5f7715	[SPARK-29523][SQL] SHOW COLUMNS should do multi-catalog resolution ### What changes were proposed in this pull request? Add ShowColumnsStatement and make SHOW COLUMNS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. USE my_catalog DESC t // success and describe the table t from my_catalog SHOW COLUMNS FROM t // report table not found as there is no table t in the session catalog ### Does this PR introduce any user-facing change? yes. When running SHOW COLUMNS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26182 from planga82/feature/SPARK-29523_SHOW_COLUMNS_datasourceV2. Authored-by: Unknown <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 10:13:12 +08:00
Chris Martin	c29494377b	[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically: - Updated the usage guide for the new `COGROUPED_MAP` Pandas udfs added in https://github.com/apache/spark/pull/24981 - Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by HyukjinKwon in https://github.com/apache/spark/pull/25939 Closes #26110 from d80tb7/SPARK-29126-cogroup-udf-usage-guide. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-31 10:41:57 +09:00
Maxim Gekk	3206a99870	[SPARK-29651][SQL] Fix parsing of interval seconds fraction ### What changes were proposed in this pull request? In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `IntervalUtils` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases: - When the fractional part has less than 9 digits - The seconds unit is negative ### Why are the changes needed? The changes are needed to fix the issues: ```sql spark-sql> select interval '10.123456 seconds'; interval 10 seconds 123 microseconds ``` The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds` ```sql spark-sql> select interval '-10.123456789 seconds'; interval -9 seconds -876 milliseconds -544 microseconds ``` but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds. ### Does this PR introduce any user-facing change? Yes. After changes: ```sql spark-sql> select interval '10.123456 seconds'; interval 10 seconds 123 milliseconds 456 microseconds spark-sql> select interval '-10.123456789 seconds'; interval -10 seconds -123 milliseconds -456 microseconds ``` ### How was this patch tested? By existing and new tests in `ExpressionParserSuite`. Closes #26313 from MaxGekk/fix-interval-nanos-parsing. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-31 09:20:46 +08:00
Ryan Blue	cfc80d0eb1	[SPARK-29277][SQL] Add early DSv2 filter and projection pushdown ### What changes were proposed in this pull request? This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan. To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set. This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class. ### Why are the changes needed? This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations. Closes #25955 from rdblue/move-v2-pushdown. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Ryan Blue <blue@apache.org>	2019-10-30 18:07:34 -07:00
Xingbo Jiang	8207c835b4	Revert "Prepare Spark release v3.0.0-preview-rc2" This reverts commit `007c873ae3`.	2019-10-30 17:45:44 -07:00
Xingbo Jiang	007c873ae3	Prepare Spark release v3.0.0-preview-rc2 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the sparkR version number check logic to allow jvm version like `3.0.0-preview` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A	2019-10-30 17:42:59 -07:00
Xingbo Jiang	155a67d00c	[SPARK-29666][BUILD] Fix the publish release failure under dry-run mode ### What changes were proposed in this pull request? `release-build.sh` fail to publish release under dry run mode with the following error message: ``` /opt/spark-rm/release-build.sh: line 429: pushd: spark-repo-g4MBm/org/apache/spark: No such file or directory ``` We need to at least run the `mvn clean install` command once to create the `$tmp_repo` path, but now those steps are all skipped under dry-run mode. This PR fixes the issue. ### How was this patch tested? Tested locally. Closes #26329 from jiangxb1987/dryrun. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-30 14:57:51 -07:00
Xingbo Jiang	fd6cfb1be3	[SPARK-29646][BUILD] Allow pyspark version name format `${versionNumber}-preview` in release script ### What changes were proposed in this pull request? Update `release-build.sh`, to allow pyspark version name format `${versionNumber}-preview`, otherwise the release script won't generate pyspark release tarballs. ### How was this patch tested? Tested locally. Closes #26306 from jiangxb1987/buildPython. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-30 14:51:50 -07:00

1 2 3 4 5 ...

25615 commits