ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kevin Yu	36f8e53cfa	[SPARK-28802][DOC][SQL] Document DESCRIBE DATABASE statement in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE DATABASE statement in SQL Reference ### Why are the changes needed? To complete the SQL Reference ### Does this PR introduce any user-facing change? Yes #### Before There is no documentation for this command in sql reference #### After ![Screen Shot 2019-09-05 at 12 59 32 PM](https://user-images.githubusercontent.com/7550280/64379235-53aec800-cfe3-11e9-8a51-ea55f0455c47.png) ![Screen Shot 2019-09-05 at 12 59 45 PM](https://user-images.githubusercontent.com/7550280/64379247-58737c00-cfe3-11e9-9a51-f12c5c5bc26a.png) ### How was this patch tested? Used jekyll build and serve to verify Closes #25528 from kevinyu98/sql-ref-describe. Lead-authored-by: Kevin Yu <qyu@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-05 16:23:08 -07:00
Huaxin Gao	e4f70023ad	[SPARK-28830][DOC][SQL] Document UNCACHE TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document UNCACHE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes. After change: ![image](https://user-images.githubusercontent.com/13592258/64299133-e04a7f00-cf2c-11e9-8f39-9b288e46c995.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25540 from huaxingao/spark-28830. Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-04 21:42:01 -07:00
Dilip Biswal	f96486b4aa	[SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference ### What changes were proposed in this pull request? Document SHOW FUNCTIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. ![image](https://user-images.githubusercontent.com/11567269/64281840-e3cc0f00-cf08-11e9-9784-f01392276130.png) <img width="589" alt="Screen Shot 2019-09-04 at 11 41 44 AM" src="https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png"> <img width="572" alt="Screen Shot 2019-09-04 at 11 41 54 AM" src="https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25539 from dilipbiswal/ref-doc-show-functions. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-04 11:47:10 -07:00
Dilip Biswal	b992160eae	[SPARK-28811][DOCS][SQL] Document SHOW TBLPROPERTIES in SQL Reference ### What changes were proposed in this pull request? Document SHOW TBLPROPERTIES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. ![image](https://user-images.githubusercontent.com/11567269/64281442-fdb92200-cf07-11e9-90ba-4699b6e93e23.png) ![Screen Shot 2019-09-04 at 11 32 11 AM](https://user-images.githubusercontent.com/11567269/64281484-188b9680-cf08-11e9-8e42-f130751ca495.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25571 from dilipbiswal/ref-show-tblproperties. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-04 11:36:45 -07:00
Jungtaek Lim (HeartSaVioR)	594c9c5a3e	[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer ## What changes were proposed in this pull request? This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following: * Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks. * It also enables applying different policies on pool, which helps optimization of pooling for each pool. * We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case. * It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter. Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits: * We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer. * We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily. * We can get various statistics for the object pool, and also be able to enable JMX for the pool. FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this. This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module. ## How was this patch tested? Existing unit tests as well as new tests for object pool. Also did some experiment regarding proving concurrent access of consumers for same topic partition. * Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening. * branches * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging * Test query (doing self-join) * https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2 * Ran query from spark-shell, with using `local[]` to maximize the chance to have concurrent access Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile \| wc -l` * Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile \| wc -l` Topic and data distribution is follow: ``` truck_speed_events_stream_spark_25151_v1:0:99440 truck_speed_events_stream_spark_25151_v1:1:99489 truck_speed_events_stream_spark_25151_v1:2:397759 truck_speed_events_stream_spark_25151_v1:3:198917 truck_speed_events_stream_spark_25151_v1:4:99484 truck_speed_events_stream_spark_25151_v1:5:497320 truck_speed_events_stream_spark_25151_v1:6:99430 truck_speed_events_stream_spark_25151_v1:7:397887 truck_speed_events_stream_spark_25151_v1:8:397813 truck_speed_events_stream_spark_25151_v1:9:0 ``` The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier. The result of experiment is below: branch \| create Kafka consumer \| fetch request -- \| -- \| -- master \| 1986 \| 2837 patch \| 8 \| 1706 Closes #22138 from HeartSaVioR/SPARK-25151. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-04 10:17:38 -07:00
yangjie01	a07f795aea	[SPARK-28577][YARN] Resource capability requested for each executor add offHeapMemorySize ## What changes were proposed in this pull request? If MEMORY_OFFHEAP_ENABLED is true, add MEMORY_OFFHEAP_SIZE to resource requested for executor to ensure instance has enough memory to use. In this pr add a helper method `executorOffHeapMemorySizeAsMb` in `YarnSparkHadoopUtil`. ## How was this patch tested? Add 3 new test suite to test `YarnSparkHadoopUtil#executorOffHeapMemorySizeAsMb` Closes #25309 from LuciferYang/spark-28577. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-09-04 09:00:12 -05:00
Huaxin Gao	56f2887dc8	[SPARK-28788][DOC][SQL] Document ANALYZE TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document ANALYZE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL reference ### Does this PR introduce any user-facing change? Yes *Before: There was no documentation for this. After*: ![image](https://user-images.githubusercontent.com/13592258/64046883-f8339480-cb21-11e9-85da-6617d5c96412.png) ![image](https://user-images.githubusercontent.com/13592258/64209526-9a6eb780-ce55-11e9-9004-53c5c5d24567.png) ![image](https://user-images.githubusercontent.com/13592258/64209542-a2c6f280-ce55-11e9-8624-e7349204ec8e.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25524 from huaxingao/spark-28788. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-03 15:26:12 -07:00
Xiao Li	2856398de9	[SPARK-28961][HOT-FIX][BUILD] Upgrade Maven from 3.6.1 to 3.6.2 ### What changes were proposed in this pull request? This PR is to upgrade the maven dependence from 3.6.1 to 3.6.2. ### Why are the changes needed? All the builds are broken because 3.6.1 is not available. http://ftp.wayne.edu/apache//maven/maven-3/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/485/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10536/ ![image](https://user-images.githubusercontent.com/11567269/64196667-36d69100-ce39-11e9-8f93-40eb333d595d.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25665 from gatorsmile/upgradeMVN. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-03 11:06:57 -07:00
Dilip Biswal	94e66744a7	[SPARK-28805][DOCS][SQL] Document DESCRIBE FUNCTION in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE FUNCTION statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 09 PM" src="https://user-images.githubusercontent.com/14225158/64148193-85534380-cdd7-11e9-9c07-5956b5e8276e.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 29 PM" src="https://user-images.githubusercontent.com/14225158/64148201-8a17f780-cdd7-11e9-93d8-10ad9932977c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 14 42 PM" src="https://user-images.githubusercontent.com/14225158/64148208-8dab7e80-cdd7-11e9-97c5-3a4ce12cac7a.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25530 from dilipbiswal/ref-doc-desc-function. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-03 09:45:58 -07:00
Dilip Biswal	92ae271081	[SPARK-28806][DOCS][SQL] Document SHOW COLUMNS in SQL Reference ### What changes were proposed in this pull request? Document SHOW COLUMNS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-09-02 at 11 07 48 PM" src="https://user-images.githubusercontent.com/14225158/64148033-0fe77300-cdd7-11e9-93ee-e5951c7ed33c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 08 08 PM" src="https://user-images.githubusercontent.com/14225158/64148039-137afa00-cdd7-11e9-8bec-634ea9d2594c.png"> <img width="1234" alt="Screen Shot 2019-09-02 at 11 11 45 PM" src="https://user-images.githubusercontent.com/14225158/64148046-17a71780-cdd7-11e9-91c3-95a9c97e7a77.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25531 from dilipbiswal/ref-doc-show-columns. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-03 09:39:26 -07:00
Huaxin Gao	585954dbed	[SPARK-28790][DOC][SQL] Document CACHE TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document CACHE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes. Here is the screen shot: ![image](https://user-images.githubusercontent.com/13592258/64072307-26f45c80-cc41-11e9-8ab3-dc56fe8ff45f.png) ![image](https://user-images.githubusercontent.com/13592258/64072309-2cea3d80-cc41-11e9-9a4d-8cb9eb63569f.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25532 from huaxingao/spark-28790. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-01 17:08:09 -07:00
Huaxin Gao	b85a554487	[SPARK-28786][DOC][SQL][FOLLOW-UP] Change "Related Statements" to bold ### What changes were proposed in this pull request? Change "Related Statements" to bold ### Why are the changes needed? To make doc look nice and consistent. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Before the change: ![image](https://user-images.githubusercontent.com/13592258/63965303-ae797a00-ca4d-11e9-8a85-71fbfdeaaccb.png) After the change: ![image](https://user-images.githubusercontent.com/13592258/63965316-b76a4b80-ca4d-11e9-9a85-48d7a909f0ef.png) Before the change: ![image](https://user-images.githubusercontent.com/13592258/63988989-7c8b0680-ca93-11e9-9352-a9ec5457b279.png) After the change: ![image](https://user-images.githubusercontent.com/13592258/63988996-87459b80-ca93-11e9-9e51-8cb36a632436.png) Closes #25623 from huaxingao/spark-28786-n. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-31 14:58:41 -07:00
Dilip Biswal	b4d7b30aa6	[SPARK-28803][DOCS][SQL] Document DESCRIBE TABLE in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE TABLE statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-08-31 at 1 53 35 PM" src="https://user-images.githubusercontent.com/14225158/64069071-f556a380-cbf6-11e9-985d-13dd37a32bbb.png"> <img width="1234" alt="Screen Shot 2019-08-31 at 1 53 50 PM" src="https://user-images.githubusercontent.com/14225158/64069073-f982c100-cbf6-11e9-925b-eb2fc85c3341.png"> <img width="1234" alt="Screen Shot 2019-08-31 at 1 54 02 PM" src="https://user-images.githubusercontent.com/14225158/64069076-0ef7eb00-cbf7-11e9-8062-9a9fb8700bb3.png"> <img width="1234" alt="Screen Shot 2019-08-31 at 1 54 15 PM" src="https://user-images.githubusercontent.com/14225158/64069077-0f908180-cbf7-11e9-9a31-9b7f122db2d3.png"> <img width="1234" alt="Screen Shot 2019-08-31 at 1 54 30 PM" src="https://user-images.githubusercontent.com/14225158/64069078-0f908180-cbf7-11e9-96ee-438a7b64c961.png"> <img width="1234" alt="Screen Shot 2019-08-31 at 1 54 42 PM" src="https://user-images.githubusercontent.com/14225158/64069079-0f908180-cbf7-11e9-9bae-734a1994f936.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25527 from dilipbiswal/ref-doc-desc-table. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-31 14:46:55 -07:00
Unknown	d573e4c482	[SPARK-28542][DOCS][WEBUI] Stages Tab ### What changes were proposed in this pull request? New documentation to explain in detail Web UI Stages page. New images are included to better explanation. ![image](https://user-images.githubusercontent.com/12819544/63807320-c05bff80-c91d-11e9-986f-e09d0b8d4bbb.png) ![image](https://user-images.githubusercontent.com/12819544/63807343-cd78ee80-c91d-11e9-9e4a-2cef3ff70577.png) ![image](https://user-images.githubusercontent.com/12819544/63807363-d9fd4700-c91d-11e9-9691-1d39b0e2c69e.png) ![image](https://user-images.githubusercontent.com/12819544/63807384-e41f4580-c91d-11e9-92bd-cb01aced3752.png) ### Does this PR introduce any user-facing change? Only documentation ### How was this patch tested? I have generated it using "jekyll build" to ensure that it's ok Closes #25598 from planga82/feature/SPARK-28542_ImproveWebUIStagesPage. Lead-authored-by: Unknown <soypab@gmail.com> Co-authored-by: Pablo <soypab@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-31 13:33:44 -05:00
Dilip Biswal	a08f33be68	[SPARK-28804][DOCS][SQL] Document DESCRIBE QUERY in SQL Reference ### What changes were proposed in this pull request? Document DESCRIBE QUERY statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-08-29 at 5 47 51 PM" src="https://user-images.githubusercontent.com/14225158/63985609-43e43080-ca85-11e9-8a1a-c9c15d988e24.png"> <img width="1234" alt="Screen Shot 2019-08-29 at 5 48 06 PM" src="https://user-images.githubusercontent.com/14225158/63985610-46468a80-ca85-11e9-882a-7163784f72c6.png"> <img width="1234" alt="Screen Shot 2019-08-29 at 5 48 18 PM" src="https://user-images.githubusercontent.com/14225158/63985617-49da1180-ca85-11e9-9e77-a6d6c7042a85.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25529 from dilipbiswal/ref-doc-desc-query. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-30 16:05:16 -07:00
Dilip Biswal	fb1053d14a	[SPARK-28807][DOCS][SQL] Document SHOW DATABASES in SQL Reference ### What changes were proposed in this pull request? Document SHOW DATABASES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-08-28 at 11 43 36 PM" src="https://user-images.githubusercontent.com/14225158/63916727-dd600380-c9ed-11e9-8372-789110c9d2dc.png"> <img width="1234" alt="Screen Shot 2019-08-28 at 11 43 57 PM" src="https://user-images.githubusercontent.com/14225158/63916734-e0f38a80-c9ed-11e9-8ad4-d854febeaab8.png"> <img width="1234" alt="Screen Shot 2019-08-28 at 11 44 13 PM" src="https://user-images.githubusercontent.com/14225158/63916740-e4871180-c9ed-11e9-9cfc-199cd8a64852.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25526 from dilipbiswal/ref-doc-show-db. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-29 09:04:27 -07:00
Huaxin Gao	3e09a0fce9	[SPARK-28786][DOC][SQL] Document INSERT statement in SQL Reference ### What changes were proposed in this pull request? Document INSERT statement in SQL Reference ### Why are the changes needed? To complete SQL reference. ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Manually checked newly added doc. Here are the screen shots: ![image](https://user-images.githubusercontent.com/13592258/63490232-0a01a180-c469-11e9-82de-cfdc7c2343e7.png) ![image](https://user-images.githubusercontent.com/13592258/63903006-cce56400-c9c0-11e9-9f24-badd586227a2.png) <img width="1100" alt="Screen Shot 2019-08-27 at 5 01 48 PM" src="https://user-images.githubusercontent.com/13592258/63816303-845c7680-c8ec-11e9-8c36-1b8e4d3e6286.png"> <img width="1100" alt="Screen Shot 2019-08-27 at 5 03 22 PM" src="https://user-images.githubusercontent.com/13592258/63816347-ac4bda00-c8ec-11e9-9470-fa99522e6f14.png"> ![image](https://user-images.githubusercontent.com/13592258/63817393-fc2ca000-c8f0-11e9-9d66-dd9b22a9d900.png) <img width="1102" alt="Screen Shot 2019-08-27 at 5 05 13 PM" src="https://user-images.githubusercontent.com/13592258/63816423-ea48fe00-c8ec-11e9-8f66-5b226a1ff693.png"> ![image](https://user-images.githubusercontent.com/13592258/63903080-0e760f00-c9c1-11e9-966a-f45b0b1c1ea6.png) <img width="1100" alt="Screen Shot 2019-08-27 at 5 07 19 PM" src="https://user-images.githubusercontent.com/13592258/63816494-37c56b00-c8ed-11e9-88e1-27a9101eb09d.png"> ![image](https://user-images.githubusercontent.com/13592258/63816712-131dc300-c8ee-11e9-8ee7-d83b8ad07bf2.png) ![image](https://user-images.githubusercontent.com/13592258/63817479-5a598300-c8f1-11e9-8789-adae7df5535a.png) ![image](https://user-images.githubusercontent.com/13592258/63817900-4adb3980-c8f3-11e9-94fe-d60f7d61c4b4.png) ![image](https://user-images.githubusercontent.com/13592258/63903155-4da46000-c9c1-11e9-88dd-609d4fe685a9.png) ![image](https://user-images.githubusercontent.com/13592258/63817157-d652cb80-c8ef-11e9-944c-99391cf2fb0a.png) ![image](https://user-images.githubusercontent.com/13592258/63903259-aa077f80-c9c1-11e9-982f-b8590ce0270d.png) ![image](https://user-images.githubusercontent.com/13592258/63903270-b1c72400-c9c1-11e9-85c6-6d8e8cd7f006.png) Closes #25525 from huaxingao/spark-28786. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-29 09:00:42 -07:00
Dilip Biswal	74527868b2	[SPARK-28789][DOCS][SQL] Document ALTER DATABASE command ### What changes were proposed in this pull request? Document ALTER DATABSE statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1234" alt="Screen Shot 2019-08-28 at 1 51 13 PM" src="https://user-images.githubusercontent.com/14225158/63891854-fc817580-c99a-11e9-918e-6b305edf92e6.png"> <img width="1234" alt="Screen Shot 2019-08-28 at 1 51 27 PM" src="https://user-images.githubusercontent.com/14225158/63891869-0acf9180-c99b-11e9-91a4-04d870474a40.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25523 from dilipbiswal/ref-doc-alterdb. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-28 15:30:38 -07:00
Yuming Wang	1b404b9b99	[SPARK-28890][SQL] Upgrade Hive Metastore Client to the 3.1.2 for Hive 3.1 ### What changes were proposed in this pull request? Hive 3.1.2 has been released. This PR upgrades the Hive Metastore Client to 3.1.2 for Hive 3.1. Hive 3.1.2 release notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843 ### Why are the changes needed? This is an improvement to support a newly release 3.1.2. Otherwise, it will throws `UnsupportedOperationException` if user `set spark.sql.hive.metastore.version=3.1.2`: ```scala Exception in thread "main" java.lang.UnsupportedOperationException: Unsupported Hive Metastore version (3.1.2). Please set spark.sql.hive.metastore.version with a valid version. at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:109) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT Closes #25604 from wangyum/SPARK-28890. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-28 09:16:54 -07:00
zhengruifeng	3e7b0e1dd6	[SPARK-28539][WEBUI][DOC] Document Executors page ### What changes were proposed in this pull request? 1, add a basic doc for executor page 2, btw, move the version number in the document of SQL page outside ### Why are the changes needed? Spark web UIs are being used to monitor the status and resource consumption of your Spark applications and clusters. However, we do not have the corresponding document. It is hard for end users to use and understand them. ### Does this PR introduce any user-facing change? yes, the doc is changed ### How was this patch tested? locally build <img width="468" alt="图片" src="https://user-images.githubusercontent.com/7322292/63758724-d2727980-c8ee-11e9-8380-cbae51453629.png"> Closes #25596 from zhengruifeng/doc_ui_exe. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-28 08:34:24 -05:00
WeichenXu	7f605f5559	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:53:37 +08:00
cyq89051127	4cf81285da	[SPARK-28871][MINOR][DOCS] WaterMark doc fix ### What changes were proposed in this pull request? The code style in the 'Policy for handling multiple watermarks' in structured-streaming-programming-guide.md ### Why are the changes needed? Making it look friendly to user. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? cd docs SKIP_API=1 jekyll build Closes #25580 from cyq89051127/master. Authored-by: cyq89051127 <chaiyq@asiainfo.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-27 08:13:39 -05:00
Yuming Wang	02a0cdea13	[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile ### What changes were proposed in this pull request? This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`. Hive 2.3.6 release notes: - [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader) - [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations) - [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version ### Why are the changes needed? Make Spark support JDK 11. ### Does this PR introduce any user-facing change? Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details. ### How was this patch tested? Existing unit test and manual test. Closes #25443 from wangyum/test-on-jenkins. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-23 21:34:30 -07:00
zhengruifeng	bdef7125b7	[SPARK-28540][WEBUI] Document Environment page ## What changes were proposed in this pull request? Document Environment page ## How was this patch tested? locally building ![图片](https://user-images.githubusercontent.com/7322292/63237759-e3c7e000-c275-11e9-8e1f-57ed1b0e86e8.png) Closes #25430 from zhengruifeng/doc_ui_conf. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-21 10:48:48 -05:00
zhengruifeng	c4257b18a1	[SPARK-28541][WEBUI] Document Storage page ## What changes were proposed in this pull request? add an example for storage tab ## How was this patch tested? locally building Closes #25445 from zhengruifeng/doc_ui_storage. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-20 20:05:13 -05:00
Dhruve Ashar	a50959a7f6	[SPARK-27937][CORE] Revert partial logic for auto namespace discovery ## What changes were proposed in this pull request? This change reverts the logic which was introduced as a part of SPARK-24149 and a subsequent followup PR. With existing logic: - Spark fails to launch with HDFS federation enabled while trying to get a path to a logical nameservice. - It gets tokens for unrelated namespaces if they are used in HDFS Federation - Automatic namespace discovery is supported only if these are on the same cluster. Rationale for change: - For accessing data from related namespaces, viewfs should handle getting tokens for spark - For accessing data from unrelated namespaces(user explicitly specifies them using existing configs) as these could be on the same or different cluster. (Please fill in changes proposed in this fix) Revert the changes. ## How was this patch tested? Ran few manual tests and unit test. Closes #24785 from dhruve/bug/SPARK-27937. Authored-by: Dhruve Ashar <dhruveashar@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-20 12:42:35 -07:00
Jungtaek Lim (HeartSaVioR)	b37c8d5cea	[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter # What changes were proposed in this pull request? This patch modifies the explanation of guarantee for ForeachWriter as it doesn't guarantee same output for `(partitionId, epochId)`. Refer the description of [SPARK-28650](https://issues.apache.org/jira/browse/SPARK-28650) for more details. Spark itself still guarantees same output for same epochId (batch) if the preconditions are met, 1) source is always providing the same input records for same offset request. 2) the query is idempotent in overall (indeterministic calculation like now(), random() can break this). Assuming breaking preconditions as an exceptional case (the preconditions are implicitly required even before), we still can describe the guarantee with `epochId`, though it will be harder to leverage the guarantee: 1) ForeachWriter should implement a feature to track whether all the partitions are written successfully for given `epochId` 2) There's pretty less chance to leverage the fact, as the chance for Spark to successfully write all partitions and fail to checkpoint the batch is small. Credit to zsxwing on discovering the broken guarantee. ## How was this patch tested? This is just a documentation change, both on javadoc and guide doc. Closes #25407 from HeartSaVioR/SPARK-28650. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-08-20 00:56:53 -07:00
Dilip Biswal	a5df5ff0fd	[SPARK-28734][DOC] Initial table of content in the left hand side bar for SQL doc ## What changes were proposed in this pull request? This is a initial PR that creates the table of content for SQL reference guide. The left side bar will displays additional menu items corresponding to supported SQL constructs. One this PR is merged, we will fill in the content incrementally. Additionally this PR contains a minor change to make the left sidebar scrollable. Currently it is not possible to scroll in the left hand side window. ## How was this patch tested? Used jekyll build and serve to verify. Closes #25459 from dilipbiswal/ref-doc. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-18 23:17:50 -07:00
Yizhong Zhang	c097c555ac	[SPARK-21067][DOC] Fix Thrift Server - CTAS fail with Unable to move source ## What changes were proposed in this pull request? This PR aims to fix CTAS fails after we closed a session of ThriftServer. - sql-distributed-sql-engine.md ![image](https://user-images.githubusercontent.com/25916266/62509628-6f854980-b83e-11e9-9bea-daaf76c8f724.png) It seems the simplest way to fix [[SPARK-21067]](https://issues.apache.org/jira/browse/SPARK-21067). For example : If we use HDFS, we can set the following property in hive-site.xml. `<property>` ` <name>fs.hdfs.impl.disable.cache</name>` ` <value>true</value>` `</property>` ## How was this patch tested Manual. Closes #25364 from Deegue/fix_add_doc_file_system. Authored-by: Yizhong Zhang <zyzzxycj@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-18 15:55:43 -05:00
Dongjoon Hyun	123eb58d61	[MINOR][DOC] Use `Java 8` instead of `Java 8+` as a running environment ## What changes were proposed in this pull request? After Apache Spark 3.0.0 supports JDK11 officially, people will try JDK11 on old Spark releases (especially 2.4.4/2.3.4) in the same way because our document says `Java 8+`. We had better avoid that misleading situation. This PR aims to remove `+` from `Java 8+` in the documentation (master/2.4/2.3). Especially, 2.4.4 release and 2.3.4 release (cc kiszk ) On master branch, we will add JDK11 after [SPARK-24417.](https://issues.apache.org/jira/browse/SPARK-24417) ## How was this patch tested? This is a documentation only change. <img width="923" alt="java8" src="https://user-images.githubusercontent.com/9700541/63116589-e1504800-bf4e-11e9-8904-b160ec7a42c0.png"> Closes #25466 from dongjoon-hyun/SPARK-DOC-JDK8. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-15 11:22:57 -07:00
Steve Loughran	2ac6163a5d	[SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2] This patch adds the binding classes to enable spark to switch dataframe output to using the S3A zero-rename committers shipping in Hadoop 3.1+. It adds a source tree into the hadoop-cloud-storage module which only compiles with the hadoop-3.2 profile, and contains a binding for normal output and a specific bridge class for Parquet (as the parquet output format requires a subclass of `ParquetOutputCommitter`. Commit algorithms are a critical topic. There's no formal proof of correctness, but the algorithms are documented an analysed in [A Zero Rename Committer](https://github.com/steveloughran/zero-rename-committer/releases). This also reviews the classic v1 and v2 algorithms, IBM's swift committer and the one from EMRFS which they admit was based on the concepts implemented here. Test-wise * There's a public set of scala test suites [on github](https://github.com/hortonworks-spark/cloud-integration) * We have run integration tests against Spark on Yarn clusters. * This code has been shipping for ~12 months in HDP-3.x. Closes #24970 from steveloughran/cloud/SPARK-23977-s3a-committer. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-15 09:39:26 -07:00
Unknown	3f35440304	[SPARK-28543][DOCS][WEBUI] Document Spark Jobs page ## What changes were proposed in this pull request? New documentation to explain in detail Web UI Jobs page and link it to monitoring page. New images are included to better explanation ![image](https://user-images.githubusercontent.com/12819544/62898145-2741bc00-bd55-11e9-89f7-175a4fd81009.png) ![image](https://user-images.githubusercontent.com/12819544/62898187-39235f00-bd55-11e9-9f03-a4d179e197fe.png) ## How was this patch tested? This pull request contains only documentation. I have generated it using "jekyll build" to ensure that it's ok Closes #25424 from planga82/feature/SPARK-28543_ImproveWebUIDocs. Lead-authored-by: Unknown <soypab@gmail.com> Co-authored-by: Pablo <soypab@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-15 08:52:23 -05:00
Dilip Biswal	331f2657d9	[SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively ## What changes were proposed in this pull request? Here is the problem description from the JIRA. ``` When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results. SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('1')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('infinity')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('-infinity'), ('infinity')) v(x); The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way. Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html ``` In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner. ## How was this patch tested? Added tests in CastSuite and modified existing test suites. Closes #25331 from dilipbiswal/double_infinity. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 16:48:30 -07:00
zhengruifeng	ae4edd5489	[SPARK-28538][UI] Document SQL page ## What changes were proposed in this pull request? 1, add basic doc for each page; 2, doc SQL page with an exmple; ## How was this patch tested? locally built ![图片](https://user-images.githubusercontent.com/7322292/62421626-86f5f280-b6d7-11e9-8057-8be3a4afb611.png) ![图片](https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png) Closes #25349 from zhengruifeng/doc_ui_sql. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-12 08:36:01 -05:00
Kousuke Saruta	31ef268bae	[SPARK-28639][CORE][DOC] Configuration doc for Barrier Execution Mode ## What changes were proposed in this pull request? SPARK-24817 and SPARK-24819 introduced new 3 non-internal properties for barrier-execution mode but they are not documented. So I've added a section into configuration.md for barrier-mode execution. ## How was this patch tested? Built using jekyll and confirm the layout by browser. Closes #25370 from sarutak/barrier-exec-mode-conf-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-11 08:13:19 -05:00
wuyi	cbad616d4c	[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone ## What changes were proposed in this pull request? In this PR, we implements a complete process of GPU-aware resources scheduling in Standalone. The whole process looks like: Worker sets up isolated resources when it starts up and registers to master along with its resources. And, Master picks up usable workers according to driver/executor's resource requirements to launch driver/executor on them. Then, Worker launches the driver/executor after preparing resources file, which is created under driver/executor's working directory, with specified resource addresses(told by master). When driver/executor finished, their resources could be recycled to worker. Finally, if a worker stops, it should always release its resources firstly. For the case of Workers and Drivers in client mode run on the same host, we introduce a config option named `spark.resources.coordinate.enable`(default true) to indicate whether Spark should coordinate resources for user. If `spark.resources.coordinate.enable=false`, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers. The solution for Spark to coordinate resources among Workers and Drivers is: Generally, use a shared file named ____allocated_resources____.json to sync allocated resources info among Workers and Drivers on the same host. After a Worker or Driver found all resources using the configured resourcesFile and/or discovery script during launching, it should filter out available resources by excluding resources already allocated in ____allocated_resources____.json and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into ____allocated_resources____.json. Pid (proposed by tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in ____allocated_resources____.json. Note that we'll always get a file lock before any access to file ____allocated_resources____.json and release the lock finally. Futhermore, we appended resources info in `WorkerSchedulerStateResponse` to work around master change behaviour in HA mode. ## How was this patch tested? Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite. Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone. Closes #25047 from Ngone51/SPARK-27371. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-08-09 07:49:03 -05:00
Anton Yanchenko	bda5b51576	[SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)` ## What changes were proposed in this pull request? Add missing validation for `LongType` in `pyspark.sql.types._make_type_verifier`. ## How was this patch tested? Doctests / unittests / manual tests. Unpatched version: ``` In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() Out[23]: [Row(x=None)] ``` Patched: ``` In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-c1740fcadbf9> in <module> ----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 689 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 690 else: --> 691 rdd, schema = self._createFromLocal(map(prepare, data), schema) 692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 405 # make sure data could consumed multiple times 406 if not isinstance(data, list): --> 407 data = list(data) 408 409 if schema is None or isinstance(schema, (list, tuple)): /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj) 671 672 def prepare(obj): --> 673 verify_func(obj) 674 return obj 675 elif isinstance(schema, DataType): /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj) 1397 if isinstance(obj, dict): 1398 for f, verifier in verifiers: -> 1399 verifier(obj.get(f)) 1400 elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): 1401 # the order in obj could be different than dataType.fields /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj) 1356 if obj < -9223372036854775808 or obj > 9223372036854775807: 1357 raise ValueError( -> 1358 new_msg("object of LongType out of range, got: %s" % obj)) 1359 1360 verify_value = verify_long ValueError: field x: object of LongType out of range, got: 18446744073709551616 ``` Closes #25117 from simplylizz/master. Authored-by: Anton Yanchenko <simplylizz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-08 11:47:25 +09:00
Wenchen Fan	6fb79af48c	[SPARK-28344][SQL] detect ambiguous self-join and fail the query ## What changes were proposed in this pull request? This is an alternative solution of https://github.com/apache/spark/pull/24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by #24442 is mostly a heuristic. ### background of the self-join problem: This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list. A typical example: ``` val df1 = … val df2 = df1.filter(...) df1.join(df2, df1("a") > df2("a")) // returns empty result ``` The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false. ### The rule to detect ambiguous column reference caused by self join: We can reuse the infra in #24442 : 1. each Dataset has a globally unique id. 2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata. 3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag` When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is: 1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan. 2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference. 3. get the corresponding output attribute of the sub-plan by the col position in the column reference. 4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined. ## How was this patch tested? existing tests and new test cases Closes #25107 from cloud-fan/new-self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-06 10:06:36 +08:00
Dongjoon Hyun	4856c0e33a	[SPARK-28609][DOC] Fix broken styles/links and make up-to-date ## What changes were proposed in this pull request? This PR aims to fix the broken styles/links and make the doc up-to-date for Apache Spark 2.4.4 and 3.0.0 release. - `building-spark.md` ![Screen Shot 2019-08-02 at 10 33 51 PM](https://user-images.githubusercontent.com/9700541/62407962-a248ec80-b575-11e9-8a16-532e9bc421f8.png) - `configuration.md` ![Screen Shot 2019-08-02 at 10 34 52 PM](https://user-images.githubusercontent.com/9700541/62407969-c7d5f600-b575-11e9-9b1a-a76c6cc095c5.png) - `sql-pyspark-pandas-with-arrow.md` ![Screen Shot 2019-08-02 at 10 36 14 PM](https://user-images.githubusercontent.com/9700541/62407979-18e5ea00-b576-11e9-99af-7ad9264656ae.png) - `streaming-programming-guide.md` ![Screen Shot 2019-08-02 at 10 37 11 PM](https://user-images.githubusercontent.com/9700541/62407981-213e2500-b576-11e9-8bc5-a925df7e98a7.png) - `structured-streaming-programming-guide.md` (1/2) ![Screen Shot 2019-08-02 at 10 38 20 PM](https://user-images.githubusercontent.com/9700541/62408001-49c61f00-b576-11e9-9519-f699775ceecd.png) - `structured-streaming-programming-guide.md` (2/2) ![Screen Shot 2019-08-02 at 10 40 05 PM](https://user-images.githubusercontent.com/9700541/62408017-7f6b0800-b576-11e9-9341-52664ba6b460.png) - `submitting-applications.md` ![Screen Shot 2019-08-02 at 10 41 13 PM](https://user-images.githubusercontent.com/9700541/62408027-b2ad9700-b576-11e9-910e-8f22173e1251.png) ## How was this patch tested? Manual. Build the doc. ``` SKIP_API=1 jekyll build ``` Closes #25345 from dongjoon-hyun/SPARK-28609. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-04 09:42:47 -07:00
Jungtaek Lim (HeartSaVioR)	7ffc00ccc3	[MINOR][DOC][SS] Correct description of minPartitions in Kafka option ## What changes were proposed in this pull request? `minPartitions` has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value. `d67b98ea01/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala (L32-L46)` This patch makes clear the configuration is a hint, and actual partitions could be less or more. ## How was this patch tested? Just a documentation change. Closes #25332 from HeartSaVioR/MINOR-correct-kafka-structured-streaming-doc-minpartition. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-02 09:12:54 -07:00
Sean Owen	b148bd5ccb	[SPARK-28519][SQL] Use StrictMath log, pow functions for platform independence ## What changes were proposed in this pull request? See discussion on the JIRA (and dev). At heart, we find that math.log and math.pow can actually return slightly different results across platforms because of hardware optimizations. For the actual SQL log and pow functions, I propose that we should use StrictMath instead to ensure the answers are already the same. (This should have the benefit of helping tests pass on aarch64.) Further, the atanh function (which is not part of java.lang.Math) can be implemented in a slightly different and more accurate way. ## How was this patch tested? Existing tests (which will need to be changed). Some manual testing locally to understand the numeric issues. Closes #25279 from srowen/SPARK-28519. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:55:44 -05:00
Nick Karpov	6d32deeecc	[SPARK-28475][CORE] Add regex MetricFilter to GraphiteSink ## What changes were proposed in this pull request? Today all registered metric sources are reported to GraphiteSink with no filtering mechanism, although the codahale project does support it. GraphiteReporter (ScheduledReporter) from the codahale project requires you implement and supply the MetricFilter interface (there is only a single implementation by default in the codahale project, MetricFilter.ALL). Propose to add an additional regex config to match and filter metrics to the GraphiteSink ## How was this patch tested? Included a GraphiteSinkSuite that tests: 1. Absence of regex filter (existing default behavior maintained) 2. Presence of `regex=<regexexpr>` correctly filters metric keys Closes #25232 from nkarpov/graphite_regex. Authored-by: Nick Karpov <nick@nickkarpov.com> Signed-off-by: jerryshao <jerryshao@tencent.com>	2019-08-02 17:50:15 +08:00
zhengruifeng	b29829e2ab	[SPARK-25584][ML][DOC] datasource for libsvm user guide ## What changes were proposed in this pull request? it seems that doc for libsvm datasource is not added in https://github.com/apache/spark/pull/22675. This pr is to add it. ## How was this patch tested? doc built locally ![图片](https://user-images.githubusercontent.com/7322292/62044350-4ad51480-b235-11e9-8f09-cbcbe9d3b7f9.png) Closes #25286 from zhengruifeng/doc_libsvm_data_source. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-01 09:15:42 -05:00
gengjiaan	d03ec65f01	[SPARK-27924][SQL] Support ANSI SQL Boolean-Predicate syntax ## What changes were proposed in this pull request? This PR aims to support ANSI SQL `Boolean-Predicate` syntax. ```sql expression IS [NOT] TRUE expression IS [NOT] FALSE expression IS [NOT] UNKNOWN ``` There are some mainstream database support this syntax. - PostgreSQL: https://www.postgresql.org/docs/9.1/functions-comparison.html - Hive: https://issues.apache.org/jira/browse/HIVE-13583 - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html - Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm For example: ```sql spark-sql> select null is true, null is not true; false true spark-sql> select false is true, false is not true; false true spark-sql> select true is true, true is not true; true false spark-sql> select null is false, null is not false; false true spark-sql> select false is false, false is not false; true false spark-sql> select true is false, true is not false; false true spark-sql> select null is unknown, null is not unknown; true false spark-sql> select false is unknown, false is not unknown; false true spark-sql> select true is unknown, true is not unknown; false true ``` Note: A null input is treated as the logical value "unknown". ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #25074 from beliefer/ansi-sql-boolean-test. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-30 23:59:50 -07:00
gengjiaan	dba4375359	[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress \| true \| Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- \| -- \| -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 12:17:44 +09:00
zhengruifeng	44c28d7515	[SPARK-28399][ML][PYTHON] implement RobustScaler ## What changes were proposed in this pull request? Implement `RobustScaler` Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers. ## How was this patch tested? existing and added tests Closes #25160 from zhengruifeng/robust_scaler. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-30 10:24:33 -05:00
Junjie Chen	780d176136	[SPARK-28042][K8S] Support using volume mount as local storage ## What changes were proposed in this pull request? This pr is used to support using hostpath/PV volume mounts as local storage. In KubernetesExecutorBuilder.scala, the LocalDrisFeatureStep is built before MountVolumesFeatureStep which means we cannot use any volumes mount later. This pr adjust the order of feature building steps which moves localDirsFeature at last so that we can check if directories in SPARK_LOCAL_DIRS are set to volumes mounted such as hostPath, PV, or others. ## How was this patch tested? Unit tests Closes #24879 from chenjunjiedada/SPARK-28042. Lead-authored-by: Junjie Chen <jimmyjchen@tencent.com> Co-authored-by: Junjie Chen <cjjnjust@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-29 10:44:17 -07:00
Lee Dongjin	d98aa2a184	[MINOR] Trivial cleanups These are what I found during working on #22282. - Remove unused value: `UnsafeArraySuite#defaultTz` - Remove redundant new modifier to the case class, `KafkaSourceRDDPartition` - Remove unused variables from `RDD.scala` - Remove trailing space from `structured-streaming-kafka-integration.md` - Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default. - Remove leading empty line: `UnsafeRow` - Remove trailing empty line: `KafkaTestUtils` - Remove unthrown exception type: `UnsafeMapData` - Replace unused declarations: `expressions` - Remove duplicated default parameter: `AnalysisErrorSuite` - `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable Closes #25251 from dongjinleekr/cleanup/201907. Authored-by: Lee Dongjin <dongjin@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 23:38:02 +09:00
Luca Canali	f2a2d980ed	[SPARK-25285][CORE] Add startedTasks and finishedTasks to the metrics system in the executor instance ## What changes were proposed in this pull request? The motivation for these additional metrics is to help in troubleshooting and monitoring task execution workload when running on a cluster. Currently available metrics include executor threadpool metrics for task completed and for active tasks. The addition of threadpool taskStarted metric will allow for example to collect info on the (approximate) number of failed tasks by computing the difference thread started – (active threads + completed tasks and/or successfully finished tasks). The proposed metric finishedTasks is also intended for this type of troubleshooting. The difference between finshedTasks and threadpool.completeTasks, is that the latter is a (dropwizard library) gauge taken from the threadpool, while the former is a (dropwizard) counter computed in the [[Executor]] class, when a task successfully finishes, together with several other task metrics counters. Note, there are similarities with some of the metrics introduced in SPARK-24398, however there are key differences, coming from the fact that this PR concerns the executor source, therefore providing metric values per executor + metric values do not require to pass through the listerner bus in this case. ## How was this patch tested? Manually tested on a YARN cluster Closes #22290 from LucaCanali/AddMetricExecutorStartedTasks. Lead-authored-by: Luca Canali <luca.canali@cern.ch> Co-authored-by: LucaCanali <luca.canali@cern.ch> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-26 14:03:57 -07:00
Douglas R Colkitt	8fc5cb6285	[SPARK-28473][DOC] Stylistic consistency of build command in README ## What changes were proposed in this pull request? Change the format of the build command in the README to start with a `./` prefix ./build/mvn -DskipTests clean package This increases stylistic consistency across the README- all the other commands have a `./` prefix. Having a visible `./` prefix also makes it clear to the user that the shell command requires the current working directory to be at the repository root. ## How was this patch tested? README.md was reviewed both in raw markdown and in the Github rendered landing page for stylistic consistency. Closes #25231 from Mister-Meeseeks/master. Lead-authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com> Co-authored-by: Mister-Meeseeks <douglas.colkitt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-23 16:29:46 -07:00
HyukjinKwon	e3f7ca37db	[SPARK-28321][DOCS][FOLLOW-UP] Update migration guide by 0-args Java UDF's internal behaviour change ## What changes were proposed in this pull request? This PR proposes to add a note in the migration guide. See https://github.com/apache/spark/pull/25108#issuecomment-513526585 ## How was this patch tested? N/A Closes #25224 from HyukjinKwon/SPARK-28321-doc. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-22 16:33:31 +08:00
Dongjoon Hyun	c97f06de94	[SPARK-25705][DOC][FOLLOWUP] Recover links to structured-streaming-kafka-integration ## What changes were proposed in this pull request? This PR is a follow-up PR to recover three links from [the previous commit](https://github.com/apache/spark/pull/22703/files#diff-21245da8f8dbfef6401c5500f559f0bc). Currently, those three are broken. ``` $ git grep structured-streaming-kafka-0-10-integration structured-streaming-programming-guide.md: - Kafka source - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-0-10-integration.html) for more details. structured-streaming-programming-guide.md: See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a>. structured-streaming-programming-guide.md: <td>See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a></td> ``` It's because we have `structured-streaming-kafka-integration.html` instead of `structured-streaming-kafka-0-10-integration.html`. ``` $ find . -name structured-streaming-kafka-0-10-integration.md $ find . -name structured-streaming-kafka-integration.md ./structured-streaming-kafka-integration.md ``` ## How was this patch tested? Manual. Closes #25221 from dongjoon-hyun/SPARK-25705. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-22 11:22:06 +09:00
Arun Pandian	a0a58cf2ef	[SPARK-28464][DOC][SS] Document Kafka source minPartitions option Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes #25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <apandian@groupon.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-21 13:13:30 -07:00
HyukjinKwon	0512af1668	[SPARK-28389][SQL][FOLLOW-UP] Use one example in 'add_months' behavior change at migration guide ## What changes were proposed in this pull request? This PR proposes to add one example to describe 'add_months' behaviour change by https://github.com/apache/spark/pull/25153. Spark 2.4: ```sql select add_months(DATE'2019-02-28', 1) ``` ``` +--------------------------------+ \|add_months(DATE '2019-02-28', 1)\| +--------------------------------+ \| 2019-03-31\| +--------------------------------+ ``` Current master: ```sql select add_months(DATE'2019-02-28', 1) ``` ``` +--------------------------------+ \|add_months(DATE '2019-02-28', 1)\| +--------------------------------+ \| 2019-03-28\| +--------------------------------+ ``` ## How was this patch tested? Manually tested on Spark 2.4.1 and the current master. Closes #25199 from HyukjinKwon/SPARK-28389. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-19 14:29:16 +09:00
Marcelo Vanzin	2ddeff97d7	[SPARK-27963][CORE] Allow dynamic allocation without a shuffle service. This change adds a new option that enables dynamic allocation without the need for a shuffle service. This mode works by tracking which stages generate shuffle files, and keeping executors that generate data for those shuffles alive while the jobs that use them are active. A separate timeout is also added for shuffle data; so that executors that hold shuffle data can use a separate timeout before being removed because of being idle. This allows the shuffle data to be kept around in case it is needed by some new job, or allow users to be more aggressive in timing out executors that don't have shuffle data in active use. The code also hooks up to the context cleaner so that shuffles that are garbage collected are detected, and the respective executors not held unnecessarily. Testing done with added unit tests, and also with TPC-DS workloads on YARN without a shuffle service. Closes #24817 from vanzin/SPARK-27963. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 16:37:38 -07:00
Thomas Graves	43d68cd4ff	[SPARK-27959][YARN] Change YARN resource configs to use .amount ## What changes were proposed in this pull request? we are adding in generic resource support into spark where we have suffix for the amount of the resource so that we could support other configs. Spark on yarn already had added configs to request resources via the configs spark.yarn.{executor/driver/am}.resource=<some amount>, where the <some amount> is value and unit together. We should change those configs to have a `.amount` suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. it would allow for a spark.yarn.{executor/driver/am}.resource.{resource}.tag= type config. ## How was this patch tested? Tested via unit tests and manually on a yarn 3.x cluster with GPU resources configured on. Closes #24989 from tgravescs/SPARK-27959-yarn-resourceconfigs. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 10:56:07 -07:00
Maxim Gekk	f241fc7776	[SPARK-28389][SQL] Use Java 8 API in add_months ## What changes were proposed in this pull request? In the PR, I propose to use the `plusMonths()` method of `LocalDate` to add months to a date. This method adds the specified amount to the months field of `LocalDate` in three steps: 1. Add the input months to the month-of-year field 2. Check if the resulting date would be invalid 3. Adjust the day-of-month to the last valid day if necessary The difference between current behavior and propose one is in handling the last day of month in the original date. For example, adding 1 month to `2019-02-28` will produce `2019-03-28` comparing to the current implementation where the result is `2019-03-31`. The proposed behavior is implemented in MySQL and PostgreSQL. ## How was this patch tested? By existing test suites `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`. Closes #25153 from MaxGekk/add-months. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-15 20:49:39 +08:00
Peter Toth	1a26126d8c	[SPARK-28228][SQL] Fix substitution order of nested WITH clauses ## What changes were proposed in this pull request? This PR adds compatibility of handling a `WITH` clause within another `WITH` cause. Before this PR these queries retuned `1` while after this PR they return `2` as PostgreSQL does: ``` WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t ) SELECT * FROM t2 ``` ``` WITH t AS (SELECT 1) SELECT ( WITH t AS (SELECT 2) SELECT * FROM t ) ``` As this is an incompatible change, the PR introduces the `spark.sql.legacy.cte.substitution.enabled` flag as an option to restore old behaviour. ## How was this patch tested? Added new UTs. Closes #25029 from peter-toth/SPARK-28228. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-12 07:17:33 -07:00
Gabor Somogyi	f83000597f	[SPARK-23472][CORE] Add defaultJavaOptions for driver and executor. ## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:37:26 -07:00
Gabor Somogyi	d47c219f94	[SPARK-28055][SS][DSTREAMS] Add delegation token custom AdminClient configurations. ## What changes were proposed in this pull request? At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations. In this PR I've added similar this possibility to `AdminClient`. ## How was this patch tested? Existing + added unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24875 from gaborgsomogyi/SPARK-28055. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:36:24 -07:00
Zhu, Lipeng	b89c3de1a4	[SPARK-28310][SQL] Support (FIRST_VALUE\|LAST_VALUE)(expr[ (IGNORE\|RESPECT) NULLS]?) syntax ## What changes were proposed in this pull request? According to the ANSI SQL 2011 ![image](https://user-images.githubusercontent.com/698621/60855327-d01c6900-a235-11e9-9a1b-d438615a4673.png) Below are Teradata, Oracle, Redshift which already support this grammar. - Teradata - https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA - Oracle - https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC - Redshift – https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html - Postgresql didn't implement this grammar: https://www.postgresql.org/docs/devel/functions-window.html >The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag, first_value, last_value, and nth_value. This is not implemented in PostgreSQL: the behavior is always the same as the standard's default, namely RESPECT NULLS. ## How was this patch tested? UT. Closes #25082 from lipzhu/SPARK-28310. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 07:41:05 -07:00
Dongjoon Hyun	bbc2be4f42	[SPARK-28294][CORE] Support `spark.history.fs.cleaner.maxNum` configuration ## What changes were proposed in this pull request? Up to now, Apache Spark maintains the given event log directory by time policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues. 1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default. https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 2. Spark is sometimes unable to to clean up some old log files due to permission issues (mainly, security policy). To handle both (1) and (2), this PR aims to support an additional policy configuration for the maximum number of files in the event log directory, `spark.history.fs.cleaner.maxNum`. Spark will try to keep the number of files in the event log directory according to this policy. ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #25072 from dongjoon-hyun/SPARK-28294. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 07:19:47 -07:00
Yuming Wang	90c64ea419	[SPARK-28267][DOC] Update building-spark.md(support build with hadoop-3.2) ## What changes were proposed in this pull request? Since [SPARK-23710](https://issues.apache.org/jira/browse/SPARK-23710), Hadoop 3.x can support Hive. This PR add _build with `hadoop-3.2`_ to building-spark.md. ## How was this patch tested? manual tests ``` cd docs SKIP_API=1 jekyll build ``` ![image](https://user-images.githubusercontent.com/5399861/60942057-cf5a0480-a313-11e9-9534-4765520e799f.png) Closes #25063 from wangyum/SPARK-28267. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-10 08:51:08 -05:00
HyukjinKwon	cdbc30213b	[SPARK-28226][PYTHON] Document Pandas UDF mapInPandas ## What changes were proposed in this pull request? This PR proposes to document `MAP_ITER` with `mapInPandas`. ## How was this patch tested? Manually checked the documentation. ![Screen Shot 2019-07-05 at 1 52 30 PM](https://user-images.githubusercontent.com/6477701/60698812-26cf2d80-9f2c-11e9-8295-9c00c28f5569.png) ![Screen Shot 2019-07-05 at 1 48 53 PM](https://user-images.githubusercontent.com/6477701/60698710-ac061280-9f2b-11e9-8521-a4f361207e06.png) Closes #25025 from HyukjinKwon/SPARK-28226. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-07 09:07:52 +09:00
Yuming Wang	4caf81a48f	[SPARK-28093][SQL][FOLLOW-UP] Update trim function behavior changes to migration guide ## What changes were proposed in this pull request? We changed our non-standard syntax for `trim` function in #24902 from `TRIM(trimStr, str)` to `TRIM(str, trimStr)` to be compatible with other databases. This pr update the migration guide. I checked various databases(PostgreSQL, Teradata, Vertica, Oracle, DB2, SQL Server 2019, MySQL, Hive, Presto) and it seems that only PostgreSQL and Presto support this non-standard syntax. PostgreSQL: ```sql postgres=# select substr(version(), 0, 16), trim('yxTomxx', 'x'); substr \| btrim -----------------+------- PostgreSQL 11.3 \| yxTom (1 row) ``` Presto: ```sql presto> select trim('yxTomxx', 'x'); _col0 ------- yxTom (1 row) ``` ## How was this patch tested? manual tests Closes #24948 from wangyum/SPARK-28093-FOLLOW-UP-DOCS. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-05 17:55:54 -07:00
zhengruifeng	443b158182	[SPARK-26970][DOC][FOLLOWUP] link doc & example of Interaction ## What changes were proposed in this pull request? link doc & example of Interaction ## How was this patch tested? existing tests Closes #25027 from zhengruifeng/py_doc_interaction. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-02 17:30:57 -05:00
gengjiaan	832ff87918	[SPARK-28077][SQL] Support ANSI SQL OVERLAY function. ## What changes were proposed in this pull request? The `OVERLAY` function is a `ANSI` `SQL`. For example: ``` SELECT OVERLAY('abcdef' PLACING '45' FROM 4); SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5); SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0); SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4); ``` The results of the above four `SQL` are: ``` abc45f yabadaba yabadabadoo bubba ``` Note: If the input string is null, then the result is null too. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/functions-string.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/OVERLAY.htm?zoom_highlight=overlay Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/UTL_RAW.html#GUID-342E37E7-FE43-4CE1-A0E9-7DAABD000369 DB2: https://www.ibm.com/support/knowledgecenter/SSGMCP_5.3.0/com.ibm.cics.rexx.doc/rexx/overlay.html There are some show of the PR on my production environment. ``` spark-sql> SELECT OVERLAY('abcdef' PLACING '45' FROM 4); abc45f Time taken: 6.385 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5); yabadaba Time taken: 0.191 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0); yabadabadoo Time taken: 0.186 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4); bubba Time taken: 0.151 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING '45' FROM 4); NULL Time taken: 0.22 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'daba' FROM 5); NULL Time taken: 0.157 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'daba' FROM 5 FOR 0); NULL Time taken: 0.254 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'ubb' FROM 2 FOR 4); NULL Time taken: 0.159 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Exists UT and new UT. Closes #24918 from beliefer/ansi-sql-overlay. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-06-28 19:13:08 +09:00
Josh Rosen	d83f84a122	[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles ## What changes were proposed in this pull request? Spark's `InMemoryFileIndex` contains two places where `FileNotFound` exceptions are caught and logged as warnings (during [directory listing](`bcd3b61c4b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (L274)`) and [block location lookup](`bcd3b61c4b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (L333)`)). This logic was added in #15153 and #21408. I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-27676)). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's still beneficial to throw exceptions and fail-fast on the subset of inconsistencies / races that we _can_ detect because that increases the likelihood that an end user will notice the problem and investigate further. There may be some cases where users _do_ want to ignore missing files, but I think that should be an opt-in behavior via the existing `spark.sql.files.ignoreMissingFiles` flag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that _does_ respect `ignoreMissingFIles`)). This PR updates `InMemoryFileIndex` to guard the log-and-ignore-FileNotFoundException behind the existing `spark.sql.files.ignoreMissingFiles` flag. Note: this is a change of default behavior, so I think it needs to be mentioned in release notes. ## How was this patch tested? New unit tests to simulate file-deletion race conditions, tested with both values of the `ignoreMissingFIles` flag. Closes #24668 from JoshRosen/SPARK-27676. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@stripe.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-26 09:11:28 +09:00
Gabor Somogyi	1a915bf20f	[MINOR][SQL][DOCS] failOnDataLoss has effect on batch queries so fix the doc ## What changes were proposed in this pull request? According to the [Kafka integration document](https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html) `failOnDataLoss` has effect only on streaming queries. While I was implementing the DSv2 Kafka batch sources I've realized it's not true. This feature is covered in [KafkaDontFailOnDataLossSuite](`54da3bbfb2/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala (L180)`). In this PR I've updated the doc to reflect this behavior. ## How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24932 from gaborgsomogyi/failOnDataLoss. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-23 19:23:57 -05:00
Dongjoon Hyun	47f54b1ec7	[SPARK-28118][CORE] Add `spark.eventLog.compression.codec` configuration ## What changes were proposed in this pull request? Event logs are different from the other data in terms of the lifetime. It would be great to have a new configuration for Spark event log compression like `spark.eventLog.compression.codec` . This PR adds this new configuration as an optional configuration. So, if `spark.eventLog.compression.codec` is not given, `spark.io.compression.codec` will be used. ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24921 from dongjoon-hyun/SPARK-28118. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-06-21 00:43:38 +00:00
Yuming Wang	fe5145ede2	[SPARK-28109][SQL] Fix TRIM(type trimStr FROM str) returns incorrect value ## What changes were proposed in this pull request? [SPARK-28093](https://issues.apache.org/jira/browse/SPARK-28093) fixed `TRIM/LTRIM/RTRIM('str', 'trimStr')` returns an incorrect value, but that fix introduced a new bug, `TRIM(type trimStr FROM str)` returns an incorrect value. This pr fix this issue. ## How was this patch tested? unit tests and manual tests: Before this PR: ```sql spark-sql> SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); Tom z spark-sql> SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); bar spark-sql> SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); test xyz spark-sql> SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); testxyz spark-sql> SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); XxyLAST WORD spark-sql> SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); test xy spark-sql> SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); xyztest spark-sql> SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); TURNERyxX ``` After this PR: ```sql spark-sql> SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); Tom Tom spark-sql> SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); bar bar spark-sql> SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); test test spark-sql> SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); testxyz testxyz spark-sql> SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); XxyLAST WORD XxyLAST WORD spark-sql> SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); test test spark-sql> SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); xyztest xyztest spark-sql> SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); TURNERyxX TURNERyxX ``` And PostgreSQL: ```sql postgres=# SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); btrim \| btrim -------+------- Tom \| Tom (1 row) postgres=# SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); btrim \| btrim -------+------- bar \| bar (1 row) postgres=# SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); ltrim \| ltrim -------+------- test \| test (1 row) postgres=# SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); ltrim \| ltrim ---------+--------- testxyz \| testxyz (1 row) postgres=# SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); ltrim \| ltrim --------------+-------------- XxyLAST WORD \| XxyLAST WORD (1 row) postgres=# SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); rtrim \| rtrim -------+------- test \| test (1 row) postgres=# SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); rtrim \| rtrim ---------+--------- xyztest \| xyztest (1 row) postgres=# SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); rtrim \| rtrim -----------+----------- TURNERyxX \| TURNERyxX (1 row) ``` Closes #24911 from wangyum/SPARK-28109. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-19 12:47:18 -07:00
Xiangrui Meng	1b2448bc10	[SPARK-28056][PYTHON] add doc for SCALAR_ITER Pandas UDF ## What changes were proposed in this pull request? Add docs for `SCALAR_ITER` Pandas UDF. cc: WeichenXu123 HyukjinKwon ## How was this patch tested? Tested example code manually. Closes #24897 from mengxr/SPARK-28056. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-06-17 20:51:36 -07:00
Bryan Cutler	90f80395af	[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 ## What changes were proposed in this pull request? This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed. ## How was this patch tested? Existing Tests Closes #24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-18 09:10:58 +09:00
Mellacheruvu Sandeep	b7b4452553	[SPARK-24898][DOC] Adding spark.checkpoint.compress to the docs ## What changes were proposed in this pull request? Adding spark.checkpoint.compress configuration parameter to the documentation ![](https://user-images.githubusercontent.com/3538013/59580409-a7013080-90ee-11e9-9b2c-3d29015f597e.png) ## How was this patch tested? Checked locally for jeykyll html docs. Also validated the html for any issues. Closes #24883 from sandeepvja/SPARK-24898. Authored-by: Mellacheruvu Sandeep <mellacheruvu.sandeep@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-16 22:54:08 -07:00
Takuya UESHIN	5ae1a6bf0d	[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. ## What changes were proposed in this pull request? Currently `ArrayExists` always returns boolean values (if the arguments are not `null`), but it should follow the three-valued boolean logic: - `true` if the predicate holds at least one `true` - otherwise, `null` if the predicate holds `null` - otherwise, `false` This behavior change is made to match Postgres' equivalent function `ANY/SOME (array)`'s behavior: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174 ## How was this patch tested? Modified tests and existing tests. Closes #24873 from ueshin/issues/SPARK-28052/fix_exists. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-15 10:48:06 -07:00
Sean Owen	15462e1a8f	[SPARK-28004][UI] Update jquery to 3.4.1 ## What changes were proposed in this pull request? We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 to keep up in general, but also to keep up with CVEs. In fact, we know of at least one resolved in only 3.4.0+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, but, if the update isn't painful, maybe worthwhile in order to make future 3.x updates easier. jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to maintain compatibility with 1.9+ (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard to evaluate each one, but the most likely area for problems is in ajax(). However, our usage of jQuery (and plugins) is pretty simple. Update jquery to 3.4.1; update jquery blockUI and mustache to latest ## How was this patch tested? Manual testing of docs build (except R docs), worker/master UI, spark application UI. Note: this really doesn't guarantee it works, as our tests can't test javascript, and this is merely anecdotal testing, although I clicked about every link I could find. There's a risk this breaks a minor part of the UI; it does seem to work fine in the main. Closes #24843 from srowen/SPARK-28004. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-14 22:19:20 -07:00
Yesheng Ma	3ddc77d9ac	[SPARK-21136][SQL] Disallow FROM-only statements and show better warnings for Hive-style single-from statements Current Spark SQL parser can have pretty confusing error messages when parsing an incorrect SELECT SQL statement. The proposed fix has the following effect. BEFORE: ``` spark-sql> SELECT * FROM test WHERE x NOT NULL; Error in query: mismatched input 'FROM' expecting {<EOF>, 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LATERAL', 'LIMIT', 'ORDER', 'MINUS', 'SORT', 'UNION', 'WHERE', 'WINDOW'}(line 1, pos 9) == SQL == SELECT * FROM test WHERE x NOT NULL ---------^^^ ``` where in fact the error message should be hinted to be near `NOT NULL`. AFTER: ``` spark-sql> SELECT * FROM test WHERE x NOT NULL; Error in query: mismatched input 'NOT' expecting {<EOF>, 'AND', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LIMIT', 'OR', 'ORDER', 'MINUS', 'SORT', 'UNION', 'WINDOW'}(line 1, pos 27) == SQL == SELECT * FROM test WHERE x NOT NULL ---------------------------^^^ ``` In fact, this problem is brought by some problematic Spark SQL grammar. There are two kinds of SELECT statements that are supported by Hive (and thereby supported in SparkSQL): * `FROM table SELECT blahblah SELECT blahblah` * `SELECT blah FROM table` Reference [HiveQL single-from stmt grammar](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g) It is fine when these two SELECT syntaxes are supported separately. However, since we are currently supporting these two kinds of syntaxes in a single ANTLR rule, this can be problematic and therefore leading to confusing parser errors. This is because when a SELECT clause was parsed, it can't tell whether the following FROM clause actually belongs to it or is just the beginning of a new `FROM table SELECT *` statement. ## What changes were proposed in this pull request? 1. Modify ANTLR grammar to fix the above-mentioned problem. This fix is important because the previous problematic grammar does affect a lot of real-world queries. Due to the previous problematic and messy grammar, we refactored the grammar related to `querySpecification`. 2. Modify `AstBuilder` to have separate visitors for `SELECT ... FROM ...` and `FROM ... SELECT ...` statements. 3. Drop the `FROM table` statement, which is supported by accident and is actually parsed in the wrong code path. Both Hive and Presto do not support this syntax. ## How was this patch tested? Existing UTs and new UTs. Closes #24809 from yeshengm/parser-refactor. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-06-11 18:30:56 -07:00
Zhu, Lipeng	3b37bfde2a	[SPARK-27949][SQL] Support SUBSTRING(str FROM n1 [FOR n2]) syntax ## What changes were proposed in this pull request? Currently, function `substr/substring`'s usage is like `substring(string_expression, n1 [,n2])`. But, the ANSI SQL defined the pattern for substr/substring is like `SUBSTRING(str FROM n1 [FOR n2])`. This gap makes some inconvenient when we switch to the SparkSQL. - ANSI SQL-92: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt Below are the mainly DB engines to support the ANSI standard for substring. - PostgreSQL https://www.postgresql.org/docs/9.1/functions-string.html - MySQL https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring - Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html - Teradata https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/XnePye0Cwexw6Pny_qnxVA Oracle, SQL Server, Hive, Presto don't have this additional syntax. ## How was this patch tested? Pass the Jenkins with the updated test cases. Closes #24802 from lipzhu/SPARK-27949. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-10 09:05:10 -07:00
Yuming Wang	2926890ffb	[SPARK-27970][SQL] Support Hive 3.0 metastore ## What changes were proposed in this pull request? It seems that some users are using Hive 3.0.0. This pr makes it support Hive 3.0 metastore. ## How was this patch tested? unit tests Closes #24688 from wangyum/SPARK-26145. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-07 15:24:07 -07:00
Thomas Graves	d30284b5a5	[SPARK-27760][CORE] Spark resources - change user resource config from .count to .amount ## What changes were proposed in this pull request? Change the resource config spark.{executor/driver}.resource.{resourceName}.count to .amount to allow future usage of containing both a count and a unit. Right now we only support counts - # of gpus for instance, but in the future we may want to support units for things like memory - 25G. I think making the user only have to specify a single config .amount is better then making them specify 2 separate configs of a .count and then a .unit. Change it now since its a user facing config. Amount also matches how the spark on yarn configs are setup. ## How was this patch tested? Unit tests and manually verified on yarn and local cluster mode Closes #24810 from tgravescs/SPARK-27760-amount. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-06-06 14:16:05 -05:00
Jules Damji	b71abd654d	[MINOR][DOC] Avro data source documentation change ## What changes were proposed in this pull request? This is a minor documentation change whereby the https://spark.apache.org/docs/latest/sql-data-sources-avro.html mentions "The date type and naming of record fields should match the input Avro data or Catalyst data," The term Catalyst data is confusing. It should instead say, Spark's internal data type such as String Type or IntegerType. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) There are no code changes; only doc changes. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24787 from dmatrix/br-orc-ds.doc.changes. Authored-by: Jules Damji <dmatrix@comcast.net> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-04 16:17:53 -07:00
Luca Canali	adf72e26d9	[SPARK-27773][FOLLOWUP][DOC] Add numCaughtExceptions metric to monitoring doc ## What changes were proposed in this pull request? SPARK-27773 has introduced a new metric (counter) numCaughtExceptions to the Spark Dropwizard monitoring system. This PR adds an entry in the monitoring documentation to document this. Closes #24790 from LucaCanali/addDocFollowingSPARK27773. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-04 08:40:32 -07:00
HyukjinKwon	d1f3c994c7	[SPARK-27942][DOCS][PYTHON] Note that Python 2.7 is deprecated in Spark documentation ## What changes were proposed in this pull request? This PR adds deprecation notes in Spark documentation. ## How was this patch tested? git grep -r "python 2.6" git grep -r "python 2.6" git grep -r "python 2.7" git grep -r "python 2.7" Closes #24789 from HyukjinKwon/SPARK-27942. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-04 07:59:25 -07:00
HyukjinKwon	db48da87f0	[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations ## What changes were proposed in this pull request? `spark.sql.execution.arrow.enabled` was added when we add PySpark arrow optimization. Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration `spark.sql.execution.arrow.enabled`. There look two issues about this: 1. `spark.sql.execution.arrow.enabled` in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first. 2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally. This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization: - Deprecate `spark.sql.execution.arrow.enabled` - Add `spark.sql.execution.arrow.pyspark.enabled` (fallback to `spark.sql.execution.arrow.enabled`) - Add `spark.sql.execution.arrow.sparkr.enabled` - Deprecate `spark.sql.execution.arrow.fallback.enabled` - Add `spark.sql.execution.arrow.pyspark.fallback.enabled ` (fallback to `spark.sql.execution.arrow.fallback.enabled`) Note that `spark.sql.execution.arrow.maxRecordsPerBatch` is used within JVM side for both. Note that `spark.sql.execution.arrow.fallback.enabled` was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback. ## How was this patch tested? Manually tested and some unittests were added. Closes #24700 from HyukjinKwon/separate-sparkr-arrow. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-03 10:01:37 +09:00
gengjiaan	8feb80ad86	[SPARK-27811][CORE][DOCS] Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. ## What changes were proposed in this pull request? I found the docs of `spark.driver.memoryOverhead` and `spark.executor.memoryOverhead` exists a little ambiguity. For example, the origin docs of `spark.driver.memoryOverhead` start with `The amount of off-heap memory to be allocated per driver in cluster mode`. But `MemoryManager` also managed a memory area named off-heap used to allocate memory in tungsten mode. So I think the description of `spark.driver.memoryOverhead` always make confused. `spark.executor.memoryOverhead` has the same confused with `spark.driver.memoryOverhead`. ## How was this patch tested? Exists UT. Closes #24671 from beliefer/improve-docs-of-overhead. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-01 08:19:50 -05:00
Thomas Graves	1277f8fa92	[SPARK-27362][K8S] Resource Scheduling support for k8s ## What changes were proposed in this pull request? Add ability to map the spark resource configs spark.{executor/driver}.resource.{resourceName} to kubernetes Container builder so that we request resources (gpu,s/fpgas/etc) from kubernetes. Note that the spark configs will overwrite any resource configs users put into a pod template. I added a generic vendor config which is only used by kubernetes right now. I intentionally didn't put it into the kubernetes config namespace just to avoid adding more config prefixes. I will add more documentation for this under jira SPARK-27492. I think it will be easier to do all at once to get cohesive story. ## How was this patch tested? Unit tests and manually testing on k8s cluster. Closes #24703 from tgravescs/SPARK-27362. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-31 15:26:14 -05:00
Marcelo Vanzin	09ed64d795	[SPARK-27868][CORE] Better default value and documentation for socket server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes #24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-29 14:56:36 -07:00
Yuanjian Li	8949bc7a3c	[SPARK-27665][CORE] Split fetch shuffle blocks protocol from OpenBlocks ## What changes were proposed in this pull request? As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks protocol to describe the fetch request for shuffle blocks, and it causes the extension work for shuffle fetching like #19788 and #24110 very awkward. In this PR, we split the fetch request for shuffle blocks from OpenBlocks which named FetchShuffleBlocks. It's a loose bind with ShuffleBlockId and can easily extend by adding new fields in this protocol. ## How was this patch tested? Existing and new added UT. Closes #24565 from xuanyuanking/SPARK-27665. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-27 22:19:31 +08:00
DB Tsai	a12de29c1a	[SPARK-27838][SQL] Support user provided non-nullable avro schema for nullable catalyst schema without any null record ## What changes were proposed in this pull request? When the data is read from the sources, the catalyst schema is always nullable. Since Avro uses Union type to represent nullable, when any non-nullable avro file is read and then written out, the schema will always be changed. This PR provides a solution for users to keep the Avro schema without being forced to use Union type. ## How was this patch tested? One test is added. Closes #24682 from dbtsai/avroNull. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-05-24 21:47:14 +00:00
HyukjinKwon	cc0b9d41cd	[MINOR][DOCS][R] Use actual version in SparkR Arrow guide for copy-and-paste ## What changes were proposed in this pull request? To address https://github.com/apache/spark/pull/24506#discussion_r280964509 ## How was this patch tested? N/A Closes #24701 from HyukjinKwon/minor-arrow-r-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-24 10:38:26 -07:00
Gabor Somogyi	4e7908f2e7	[MINOR][DOC] ForeachBatch doc fix. ## What changes were proposed in this pull request? ForeachBatch doc is wrongly formatted. This PR formats it. ## How was this patch tested? ``` cd docs SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24698 from gaborgsomogyi/foreachbatchdoc. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-25 00:03:59 +09:00
Thomas Graves	74e5e41eeb	[SPARK-27488][CORE] Driver interface to support GPU resources ## What changes were proposed in this pull request? Added the driver functionality to get the resources. The user interface is: SparkContext.resources - I called it this to match the TaskContext.resources api proposed in the other PR. Originally it was going to be called SparkContext.getResources but changed to be consistent, if people have strong feelings I can change it. There are 2 ways the driver can discover what resources it has. 1) user specifies a discoveryScript, this is similar to the executors and is meant for yarn and k8s where they don't tell you what you were allocated but you are running in isolated environment. 2) read the config spark.driver.resource.resourceName.addresses. The config is meant to be used with standalone mode where the Worker will have to assign what GPU addresses the Driver is allowed to use by setting that config. When the user runs a spark application, if they want the driver to have GPU's they would specify the conf spark.driver.resource.gpu.count=X where x is the number they want. If they are running on yarn or k8s they will also have to specify the discoveryScript as specified above, if they are on standalone mode and cluster is setup properly they wouldn't have to specify anything else. We could potentially get rid of the spark.driver.resources.gpu.addresses config which is really meant to be an internal config for worker to set if the standalone mode Worker wanted to write a discoveryScript out and set that for the user. I'll wait for the jira that implements that to decide if we can remove. - This PR also has changes to be consistent about using resourceName everywhere. - change the config names from POSTFIX to SUFFIX to be more consistent with other areas in Spark - Moved the config checks around a bit since now used by both executor and driver. Note those might overlap a bit with https://github.com/apache/spark/pull/24374 so we will have to figure out which one should go in first. ## How was this patch tested? Unit tests and manually test the interface. Closes #24615 from tgravescs/SPARK-27488. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-05-23 11:46:13 -07:00
Stavros Kontopoulos	5e74570c8f	[SPARK-23153][K8S] Support client dependencies with a Hadoop Compatible File System ## What changes were proposed in this pull request? - solves the current issue with --packages in cluster mode (there is no ticket for it). Also note of some [issues](https://issues.apache.org/jira/browse/SPARK-22657) of the past here when hadoop libs are used at the spark submit side. - supports spark.jars, spark.files, app jar. It works as follows: Spark submit uploads the deps to the HCFS. Then the driver serves the deps via the Spark file server. No hcfs uris are propagated. The related design document is [here](https://docs.google.com/document/d/1peg_qVhLaAl4weo5C51jQicPwLclApBsdR1To2fgc48/edit). the next option to add is the RSS but has to be improved given the discussion in the past about it (Spark 2.3). ## How was this patch tested? - Run integration test suite. - Run an example using S3: ``` ./bin/spark-submit \ ... --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.memory=1G \ --conf spark.kubernetes.namespace=spark \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \ --conf spark.driver.memory=1G \ --conf spark.executor.instances=2 \ --conf spark.sql.streaming.metricsEnabled=true \ --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image=skonto/spark:k8s-3.0.0 \ --conf spark.kubernetes.file.upload.path=s3a://fdp-stavros-test \ --conf spark.hadoop.fs.s3a.access.key=... \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.fast.upload=true \ --conf spark.kubernetes.executor.deleteOnTermination=false \ --conf spark.hadoop.fs.s3a.secret.key=... \ --conf spark.files=client:///...resolv.conf \ file:///my.jar ** ``` Added integration tests based on [Ceph nano](https://github.com/ceph/cn). Looks very [active](http://www.sebastien-han.fr/blog/2019/02/24/Ceph-nano-is-getting-better-and-better/). Unfortunately minio needs hadoop >= 2.8. Closes #23546 from skonto/support-client-deps. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Erik Erlandson <eerlands@redhat.com>	2019-05-22 16:15:42 -07:00
Sean Owen	6c5827c723	[SPARK-27794][R][DOCS] Use https URL for CRAN repo ## What changes were proposed in this pull request? Use https URL for CRAN repo (and for a Scala download in a Dockerfile) ## How was this patch tested? Existing tests. Closes #24664 from srowen/SPARK-27794. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-22 14:28:21 -07:00
Wenchen Fan	03c9e8adee	[SPARK-24586][SQL] Upcast should not allow casting from string to other types ## What changes were proposed in this pull request? When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc. As a result, `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter. This PR does 2 things: 1. rename `Cast.canSafeCast` to `Cast.canUpcast`, and support complex typres 2. remove `Cast.mayTruncate` and replace it with `!Cast.canUpcast` Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here. ## How was this patch tested? new tests Closes #21586 from cloud-fan/cast. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:35:51 +08:00
Yuming Wang	6cd1efd0ae	[SPARK-27737][SQL] Upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile ## What changes were proposed in this pull request? This PR aims to upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile. Release Notes - Hive - Version 2.3.5 - [[HIVE-21536](https://issues.apache.org/jira/browse/HIVE-21536)] - Backport HIVE-17764 to branch-2.3 - [[HIVE-21585](https://issues.apache.org/jira/browse/HIVE-21585)] - Upgrade branch-2.3 to ORC 1.3.4 - [[HIVE-21639](https://issues.apache.org/jira/browse/HIVE-21639)] - Spark test failed since HIVE-10632 - [[HIVE-21680](https://issues.apache.org/jira/browse/HIVE-21680)] - Backport HIVE-17644 to branch-2 and branch-2.3 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345394&styleName=Text&projectId=12310843 ## How was this patch tested? This PR is tested in two ways. - Pass the Jenkins with the default configuration for `Hive Metastore Client` testing. - Pass the Jenkins with `test-hadoop3.2` configuration for `Hadoop 3.2` testing. Closes #24620 from wangyum/SPARK-27737. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:24:17 +09:00
williamwong	8442d94fb1	[SPARK-27248][SQL] `refreshTable` should recreate cache with same cache name and storage level If we refresh a cached table, the table cache will be first uncached and then recache (lazily). Currently, the logic is embedded in CatalogImpl.refreshTable method. The current implementation does not preserve the cache name and storage level. As a result, cache name and cache level could be changed after a REFERSH. IMHO, it is not what a user would expect. I would like to fix this behavior by first save the cache name and storage level for recaching the table. Two unit tests are added to make sure cache name is unchanged upon table refresh. Before applying this patch, the test created for qualified case would fail. Closes #24221 from William1104/feature/SPARK-27248. Lead-authored-by: williamwong <william1104@gmail.com> Co-authored-by: William Wong <william1104@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 11:37:16 -07:00
Sean Owen	eed6de1a65	[MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS ## What changes were proposed in this pull request? Tighten up some key links to the project and download pages to use HTTPS ## How was this patch tested? N/A Closes #24665 from srowen/HTTPSURLs. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 10:56:42 -07:00
Arun Mahadevan	1a8c09334d	[SPARK-27754][K8S] Introduce additional config (spark.kubernetes.driver.request.cores) for driver request cores for spark on k8s ## What changes were proposed in this pull request? Spark on k8s supports config for specifying the executor cpu requests (spark.kubernetes.executor.request.cores) but a similar config is missing for the driver. Instead, currently `spark.driver.cores` value is used for integer value. Although `pod spec` can have `cpu` for the fine-grained control like the following, this PR proposes additional configuration `spark.kubernetes.driver.request.cores` for driver request cores. ``` resources: requests: memory: "64Mi" cpu: "250m" ``` ## How was this patch tested? Unit tests Closes #24630 from arunmahadevan/SPARK-27754. Authored-by: Arun Mahadevan <arunm@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-18 21:28:46 -07:00
Gabor Somogyi	efa303581a	[SPARK-27687][SS] Rename Kafka consumer cache capacity conf and document caching ## What changes were proposed in this pull request? Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`. Since Kafka consumer caching is not documented I've added this also. ## How was this patch tested? Existing + added unit test. ``` cd docs SKIP_API=1 jekyll build ``` and manual webpage check. Closes #24590 from gaborgsomogyi/SPARK-27687. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-15 10:42:09 -07:00

1 2 3 4 5 ...

2519 commits