ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00
fornaix	878527d9fa	[SPARK-35612][SQL] Support LZ4 compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support LZ4 compression in the ORC data source. ### Why are the changes needed? Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in the ORC data source BEFORE ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") java.lang.IllegalArgumentException: Codec [lz4] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none, zstd. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") ``` ```bash $ orc-tools meta /tmp/lz4 Processing data file file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc [length: 222] Structure for file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc File Version: 0.12 with ORC_517 Rows: 10 Compression: LZ4 Compression size: 262144 Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 File Statistics: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 Stripes: Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 222 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the newly added test case. Closes #32751 from fornaix/spark-35612. Authored-by: fornaix <foxnaix@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:07:26 -07:00
itholic	e0bccc1831	[SPARK-35528][DOCS] Add more options at Data Source Options pages ### What changes were proposed in this pull request? This PR proposes adding more methods to set data source option to `Data Source Option` page for each data source. For example, Data Source Option page for JSON as below: - Before <img width="322" alt="Screen Shot 2021-06-03 at 10 51 54 AM" src="https://user-images.githubusercontent.com/44108233/120574245-eb13aa00-c459-11eb-9f81-0b356023bcb5.png"> - After <img width="470" alt="Screen Shot 2021-06-03 at 10 52 21 AM" src="https://user-images.githubusercontent.com/44108233/120574253-ed760400-c459-11eb-9008-1f075e0b9267.png"> ### Why are the changes needed? To provide users various options when they set options for data source. ### Does this PR introduce _any_ user-facing change? Yes, now the document provides more methods for setting options than before, as in above screen capture. ### How was this patch tested? Manually built the docs and check one by one. Closes #32757 from itholic/SPARK-35528. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 12:49:10 +09:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
Max Gekk	a59063d544	[SPARK-35581][SQL] Support special datetime values in typed literals only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-01 15:29:05 +03:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
Shiqi Sun	8c69e9cd94	[SPARK-35562][DOC] Fix docs about Kubernetes and Yarn Fixed some places in cluster-overview that are obsolete (i.e. not mentioning Kubernetes), and also fixed the Yarn spark-submit sample command in submitting-applications. ### What changes were proposed in this pull request? This is to fix the docs in "Cluster Overview" and "Submitting Applications" for places where Kubernetes is missed (mostly due to obsolete docs that haven't got updated) and where Yarn sample spark-submit command is incorrectly written. ### Why are the changes needed? To help the Spark users who uses Kubernetes as cluster manager to have a correct idea when reading the "Cluster Overview" doc page. Also to make the sample spark-submit command for Yarn actually runnable in the "Submitting Applications" doc page, by removing the invalid comment after line continuation char `\`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test, as this is doc fix. Closes #32701 from huskysun/doc-fix. Authored-by: Shiqi Sun <s.sun@salesforce.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 02:43:58 -07:00
Dongjoon Hyun	6c4b60f3b3	[SPARK-31168][BUILD] Upgrade Scala to 2.12.14 ### What changes were proposed in this pull request? This PR is the 4th try to upgrade Scala 2.12.x in order to see the feasibility. - https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum ) - https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya ) - https://github.com/apache/spark/pull/31223 (Upgrade Scala to 2.12.13, dongjoon-hyun ) Note that Scala 2.12.14 has the following fix for Apache Spark community. - Fix cyclic error in runtime reflection (protobuf), a regression that prevented Spark upgrading to 2.12.13 REQUIREMENTS: - [x] `silencer` library is released via https://github.com/ghik/silencer/pull/66 - [x] `genjavadoc` library is released via https://github.com/lightbend/genjavadoc/issues/282 ### Why are the changes needed? Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11/2.12.12/2.12.13. This will bring all the bug fixes. - https://github.com/scala/scala/releases/tag/v2.12.14 - https://github.com/scala/scala/releases/tag/v2.12.13 - https://github.com/scala/scala/releases/tag/v2.12.12 - https://github.com/scala/scala/releases/tag/v2.12.11 ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug-fixed version. ### How was this patch tested? Pass the CIs. Closes #32697 from dongjoon-hyun/SPARK-31168. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-30 16:08:13 -07:00
itholic	79a6b0cc8a	[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move text data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Text Files" page <img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png"> - Python <img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png"> - Scala <img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png"> - Java <img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32660 from itholic/SPARK-35509. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 17:12:49 +09:00
Jungtaek Lim	a57afd442c	[SPARK-29223][SQL][SS] New option to specify timestamp on all subscribing topic-partitions in Kafka source ### What changes were proposed in this pull request? This patch is a follow-up of SPARK-26848 (#23747). In SPARK-26848, we decided to open possibility to let end users set individual timestamp per partition. But in many cases, specifying timestamp represents the intention that we would want to go back to specific timestamp and reprocess records, which should be applied to all topics and partitions. This patch proposes to provide a way to set a global timestamp across topic-partitions which the source is subscribing to, so that end users can set all offsets by specific timestamp easily. To provide the way to config the timestamp easier, the new options only receive "a" timestamp for start/end timestamp. New options introduced in this PR: * startingTimestamp * endingTimestamp All two options receive timestamp as string. There're priorities for options regarding starting/ending offset as we will have three options for start offsets and another three options for end offsets. Priorities are following: * starting offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets * ending offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets ### Why are the changes needed? Existing option to specify timestamp as offset is quite verbose if there're a lot of partitions across topics. Suppose there're 100s of partitions in a topic, the json should contain 100s of times of the same timestamp. Also, the number of partitions can also change, which requires either: * fixing the code if the json is statically created * introducing the dependencies on Kafka client and deal with Kafka API on crafting json programmatically Both approaches are even not "acceptable" if we're dealing with ad-hoc query; anyone doesn't want to write the code more complicated than the query itself. Flink [provides the option](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-consumers-start-position-configuration) to specify a timestamp for all topic-partitions like this PR, and even doesn't provide the option to specify the timestamp per topic-partition. With this PR, end users are only required to provide a single timestamp value. No more complicated JSON format end users need to know about the structure. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces two new options, described in above section. Doc changes are following: ![스크린샷 2021-05-21 오후 12 01 02](https://user-images.githubusercontent.com/1317309/119076244-3034e680-ba2d-11eb-8323-0e227932d2e5.png) ![스크린샷 2021-05-21 오후 12 01 12](https://user-images.githubusercontent.com/1317309/119076255-35923100-ba2d-11eb-9d79-538a7f9ee738.png) ![스크린샷 2021-05-21 오후 12 01 24](https://user-images.githubusercontent.com/1317309/119076264-39be4e80-ba2d-11eb-8265-ac158f55c360.png) ![스크린샷 2021-05-21 오후 12 06 01](https://user-images.githubusercontent.com/1317309/119076271-3d51d580-ba2d-11eb-98ea-35fd72b1bbfc.png) ### How was this patch tested? New UTs covering new functionalities. Also manually tested via simple batch & streaming queries. Closes #32609 from HeartSaVioR/SPARK-29223-v2. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-25 21:43:49 +09:00
Kousuke Saruta	6bd6e46aec	[SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0 ### What changes were proposed in this pull request? This PR upgrades Dropwizard metrics to 4.2.0. I also modified the corresponding links in `docs/monitoring.md`. ### Why are the changes needed? The latest version was released last week and it contains some improvements. https://github.com/dropwizard/metrics/releases/tag/v4.2.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build succeeds and all the modified links are reachable. Closes #32628 from sarutak/upgrade-dropwizard-4.2.0. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-21 22:53:32 -07:00
itholic	d2bdd6595e	[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png) - Python ![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png) - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png) - Java ![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:49 +09:00
itholic	419ddcb2a4	[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move JSON data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JSON Files" page <img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png"> - Python <img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png"> - Scala <img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png"> - Java <img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32204 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:13 +09:00
itholic	0fe65b5365	[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move ORC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "ORC Files" page ![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png) - Python ![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png) - Scala ![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png) - Java ![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32546 from itholic/SPARK-35395. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:03:57 +09:00
Vinod KC	bdd8e1dbb1	[SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory ### What changes were proposed in this pull request? CTAS with location clause acts as an insert overwrite. This can cause problems when there are subdirectories within a location directory. This causes some users to accidentally wipe out directories with very important data. We should not allow CTAS with location to a non-empty directory. ### Why are the changes needed? Hive already handled this scenario: HIVE-11319 Steps to reproduce: ```scala sql("""create external table `demo_CTAS`( `comment` string) PARTITIONED BY (`col1` string, `col2` string) STORED AS parquet location '/tmp/u1/demo_CTAS'""") sql("""INSERT OVERWRITE TABLE demo_CTAS partition (col1='1',col2='1') VALUES ('abc')""") sql("select* from demo_CTAS").show sql("""create table ctas1 location '/tmp/u2/ctas1' as select * from demo_CTAS""") sql("select* from ctas1").show sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""") ``` Before the fix: Both create table operations will succeed. But values in table ctas1 will be replaced by ctas2 accidentally. After the fix: `create table ctas2...` will throw `AnalysisException`: ``` org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true. ``` ### Does this PR introduce _any_ user-facing change? Yes, if the location directory is not empty, CTAS with location will throw AnalysisException ``` sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""") ``` ``` org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true. ``` `CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, need to set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `true`. , default value is `false`. Updated SQL migration guide. ### How was this patch tested? Test case added in SQLQuerySuite.scala Closes #32411 from vinodkc/br_fixCTAS_nonempty_dir. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-20 06:13:18 +00:00
Kousuke Saruta	7b942d523c	[SPARK-35425][BUILD] Pin jinja2 in `spark-rm/Dockerfile` and add as a required dependency in the release README.md ### What changes were proposed in this pull request? The following two things are done in this PR. * Add note about Jinja2 as a required dependency for document build. * Add Jinja2 dependency for the document build to `spark-rm/Dockerfile` ### Why are the changes needed? SPARK-35375(#32509) confined the version of Jinja to <3.0.0. So it's good to note about it in `docs/README.md` and add the dependency to `spark-rm/Dockerfile`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confimed that `make html` succeed under `python/docs` with the following command. ``` sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0' ``` Closes #32573 from sarutak/required-module-for-python-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-18 16:48:23 +09:00
Chris Thomas	ceb8122c40	[SPARK-35399][DOCUMENTATION] State is still needed in the event of executor failure ### What changes were proposed in this pull request? Fix incorrect statement that state is no longer needed in the event of executor failure and document that it is needed in the case of a flaky app causing occasional executor failure. SO [discussion](https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of/67507439#67507439). ### Why are the changes needed? To fix the documentation and guide users as to additional use case for the Shuffle Service. ### Does this PR introduce _any_ user-facing change? Documentation only. ### How was this patch tested? N/A. Closes #32538 from chrisheaththomas/shuffle-service-and-executor-failure. Authored-by: Chris Thomas <chrisheaththomas@hotmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-17 08:58:46 -05:00
Oleksandr Shevchenko	d2fbf0dce4	[SPARK-35405][DOC] Submitting Applications documentation has outdated information about K8s client mode support ### What changes were proposed in this pull request? [Submitting Applications doc](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) has outdated information about K8s client mode support. It still says "Client mode is currently unsupported and will be supported in future releases". ![image](https://user-images.githubusercontent.com/31073930/118268920-b5b51580-b4c6-11eb-8eed-975be8d37964.png) Whereas it's already supported and [Running Spark on Kubernetes doc](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode) says that it's supported started from 2.4.0 and has all needed information. ![image](https://user-images.githubusercontent.com/31073930/118268947-bd74ba00-b4c6-11eb-98d5-37961327642f.png) Changes: ![image](https://user-images.githubusercontent.com/31073930/118269179-12b0cb80-b4c7-11eb-8a37-d9d301bbda53.png) JIRA: https://issues.apache.org/jira/browse/SPARK-35405 ### Why are the changes needed? Outdated information in the doc is misleading ### Does this PR introduce _any_ user-facing change? Documentation changes ### How was this patch tested? Documentation changes Closes #32551 from o-shevchenko/SPARK-35405. Authored-by: Oleksandr Shevchenko <oleksandr.shevchenko@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-14 11:26:35 -07:00
Kent Yao	d424771ec2	[MINOR][DOC] ADD toc for monitoring page ### What changes were proposed in this pull request? Add toc tag on monitoring.md ### Why are the changes needed? fix doc ### Does this PR introduce _any_ user-facing change? yes, the table of content of the monitoring page will be shown on the official doc site. ### How was this patch tested? pass GA doc build Closes #32545 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-05-14 14:19:15 +08:00
Pablo Langa	9ea55fe771	[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) ### What changes were proposed in this pull request? Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types. ``` scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show +-------------------------+--------------------------+ \|hash(CAST(0.0 AS DOUBLE))\|hash(CAST(-0.0 AS DOUBLE))\| +-------------------------+--------------------------+ \| -1670924195\| -853646085\| +-------------------------+--------------------------+ scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show +--------------------------------------------+ \|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))\| +--------------------------------------------+ \| true\| +--------------------------------------------+ ``` Here is an extract from IEEE 754: > The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases From this, I deduce that the hash function must produce the same result for 0 and -0. ### Why are the changes needed? It is a correctness issue ### Does this PR introduce _any_ user-facing change? This changes only affect to the hash function applied to -0 value in float and double types ### How was this patch tested? Unit testing and manual testing Closes #32496 from planga82/feature/spark35207_hashnegativezero. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 12:40:36 +08:00
Gabor Somogyi	b6a0a7ea53	[SPARK-35311][SS][UI][DOCS] Structured Streaming Web UI state information documentation ### What changes were proposed in this pull request? In this PR I'm adding Structured Streaming Web UI state information documentation. ### Why are the changes needed? Missing documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 bundle exec jekyll build ``` Manual webpage check. Closes #32433 from gaborgsomogyi/SPARK-35311. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-14 10:40:12 +09:00
Luca Canali	ae0579a945	[SPARK-35369][DOC] Document ExecutorAllocationManager metrics ### What changes were proposed in this pull request? This proposes to document the available metrics for ExecutorAllocationManager in the Spark monitoring documentation. ### Why are the changes needed? The ExecutorAllocationManager is instrumented with metrics using the Spark metrics system. The relevant work is in SPARK-7007 and SPARK-33763 ExecutorAllocationManager metrics are currently undocumented. ### Does this PR introduce _any_ user-facing change? This PR adds documentation only. ### How was this patch tested? na Closes #32500 from LucaCanali/followupMetricsDocSPARK33763. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-12 13:02:00 -07:00
Ludovic Henry	b52d47a920	[SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0 ### What changes were proposed in this pull request? Bump to `dev.ludovic.netlib:2.0` which provides JNI-based wrappers for BLAS, ARPACK, and LAPACK. Theseare not taking dependencies on GPL or LGPL libraries, allowing to provide out-of-the-box support for hardware acceleration when a native library is present (this is still up to the end-user to install such library on their system, like OpenBLAS, Intel MKL, and libarpack2). ### Why are the changes needed? Great performance improvement for ML-related workload on vanilla-distributions of Spark. ### Does this PR introduce _any_ user-facing change? Users now take advantage of hardware acceleration as long as a native library is installed (like OpenBLAS, Intel MKL and libarpack2). ### How was this patch tested? Spark test-suite + dev.ludovic.netlib testsuite. #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 220 226 6 454.9 2.2 1.0X [info] java 221 228 5 451.9 2.2 1.0X [info] native 209 215 5 478.7 2.1 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 121 125 3 823.3 1.2 1.0X [info] java 121 125 3 824.3 1.2 1.0X [info] native 101 105 3 988.4 1.0 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 212 219 6 470.9 2.1 1.0X [info] java 208 212 4 481.0 2.1 1.0X [info] native 209 215 5 478.5 2.1 1.0X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 114 119 3 878.9 1.1 1.0X [info] java 99 105 3 1011.4 1.0 1.2X [info] native 97 103 3 1026.7 1.0 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 111 2 925.9 1.1 1.0X [info] java 71 73 2 1414.9 0.7 1.5X [info] native 54 56 2 1847.0 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 2 1046.8 1.0 1.0X [info] java 47 48 1 2129.8 0.5 2.0X [info] native 29 30 1 3404.7 0.3 3.3X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 139 143 2 718.2 1.4 1.0X [info] java 46 47 1 2171.2 0.5 3.0X [info] native 44 46 2 2261.8 0.4 3.1X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 154 157 4 651.0 1.5 1.0X [info] java 40 42 1 2469.3 0.4 3.8X [info] native 26 27 1 3787.6 0.3 5.8X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 185 195 8 541.0 1.8 1.0X [info] java 186 196 7 538.5 1.9 1.0X [info] native 177 187 7 564.1 1.8 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 98 102 3 1016.2 1.0 1.0X [info] java 98 102 3 1017.8 1.0 1.0X [info] native 87 91 3 1143.2 0.9 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 68 70 1 1474.7 0.7 1.0X [info] java 51 52 1 1973.0 0.5 1.3X [info] native 30 32 1 3298.8 0.3 2.2X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 99 2 1037.9 1.0 1.0X [info] java 50 51 1 1999.6 0.5 1.9X [info] native 30 31 1 3368.1 0.3 3.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 59 61 1 1688.7 0.6 1.0X [info] java 41 42 1 2461.9 0.4 1.5X [info] native 15 16 1 6593.0 0.2 3.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 90 92 1 1116.2 0.9 1.0X [info] java 39 40 1 2565.8 0.4 2.3X [info] native 15 16 1 6594.2 0.2 5.9X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 192 202 7 520.5 1.9 1.0X [info] java 203 214 7 491.9 2.0 0.9X [info] native 176 187 7 568.8 1.8 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 59 61 1 846.1 1.2 1.0X [info] java 38 39 1 1313.5 0.8 1.6X [info] native 24 27 1 2047.8 0.5 2.4X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 97 101 3 515.4 1.9 1.0X [info] java 97 101 2 515.1 1.9 1.0X [info] native 88 91 3 569.1 1.8 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 169 174 3 295.4 3.4 1.0X [info] java 169 174 3 295.4 3.4 1.0X [info] native 160 165 4 312.2 3.2 1.1X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 561 577 13 1782.3 0.6 1.0X [info] java 225 231 4 4446.2 0.2 2.5X [info] native 31 32 3 32473.1 0.0 18.2X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 570 584 9 1754.8 0.6 1.0X [info] java 224 230 4 4457.3 0.2 2.5X [info] native 31 32 1 32493.4 0.0 18.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 855 866 6 1169.2 0.9 1.0X [info] java 224 228 3 4466.9 0.2 3.8X [info] native 31 32 1 32395.5 0.0 27.7X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1328 1344 8 752.8 1.3 1.0X [info] java 224 230 4 4458.9 0.2 5.9X [info] native 31 32 1 32201.8 0.0 42.8X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 534 541 5 1873.0 0.5 1.0X [info] java 220 224 3 4542.8 0.2 2.4X [info] native 15 16 1 66803.1 0.0 35.7X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 544 551 6 1839.6 0.5 1.0X [info] java 220 224 4 4538.2 0.2 2.5X [info] native 15 16 1 65589.9 0.0 35.7X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 833 845 21 1201.0 0.8 1.0X [info] java 220 224 3 4548.7 0.2 3.8X [info] native 15 16 1 66603.2 0.0 55.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 899 907 5 1112.9 0.9 1.0X [info] java 221 224 2 4531.6 0.2 4.1X [info] native 15 16 1 65944.9 0.0 59.3X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 195 200 3 512.2 2.0 1.0X [info] java 197 202 3 507.0 2.0 1.0X [info] native 184 189 4 543.0 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 112 3 921.8 1.1 1.0X [info] java 101 105 3 989.4 1.0 1.1X [info] native 87 91 3 1147.1 0.9 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 187 191 3 535.1 1.9 1.0X [info] java 182 188 3 548.8 1.8 1.0X [info] native 178 182 3 562.2 1.8 1.1X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 110 114 3 909.3 1.1 1.0X [info] java 86 93 4 1159.3 0.9 1.3X [info] native 86 90 3 1162.4 0.9 1.3X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 106 108 2 943.6 1.1 1.0X [info] java 70 71 2 1426.8 0.7 1.5X [info] native 54 56 2 1835.4 0.5 1.9X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 1 1047.1 1.0 1.0X [info] java 43 44 1 2331.9 0.4 2.2X [info] native 29 30 1 3392.1 0.3 3.2X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 114 115 2 880.7 1.1 1.0X [info] java 42 43 1 2398.1 0.4 2.7X [info] native 45 46 1 2233.3 0.4 2.5X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 140 143 2 714.6 1.4 1.0X [info] java 28 29 1 3531.0 0.3 4.9X [info] native 26 27 1 3820.0 0.3 5.3X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 156 166 7 641.3 1.6 1.0X [info] java 158 167 6 633.2 1.6 1.0X [info] native 150 160 7 664.8 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 88 2 1181.7 0.8 1.0X [info] java 85 88 2 1176.0 0.9 1.0X [info] native 75 78 2 1333.2 0.8 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 58 59 1 1731.1 0.6 1.0X [info] java 41 43 1 2415.5 0.4 1.4X [info] native 30 31 1 3293.9 0.3 1.9X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 94 96 1 1063.4 0.9 1.0X [info] java 41 42 1 2435.8 0.4 2.3X [info] native 30 30 1 3379.8 0.3 3.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 44 45 1 2278.9 0.4 1.0X [info] java 37 38 0 2686.8 0.4 1.2X [info] native 15 16 1 6555.4 0.2 2.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 88 89 1 1142.1 0.9 1.0X [info] java 33 34 1 3010.7 0.3 2.6X [info] native 15 16 1 6553.9 0.2 5.7X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 164 172 4 609.4 1.6 1.0X [info] java 163 172 5 612.6 1.6 1.0X [info] native 150 159 4 667.0 1.5 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 49 50 1 1029.4 1.0 1.0X [info] java 41 42 1 1209.4 0.8 1.2X [info] native 25 27 1 2029.2 0.5 2.0X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 80 85 3 622.2 1.6 1.0X [info] java 80 85 3 622.4 1.6 1.0X [info] native 75 79 3 668.7 1.5 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 137 142 3 364.1 2.7 1.0X [info] java 139 142 2 360.4 2.8 1.0X [info] native 131 135 3 380.4 2.6 1.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 517 525 5 1935.5 0.5 1.0X [info] java 213 216 3 4704.8 0.2 2.4X [info] native 31 31 1 32705.6 0.0 16.9X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 589 601 6 1698.6 0.6 1.0X [info] java 213 217 3 4693.3 0.2 2.8X [info] native 31 32 1 32498.9 0.0 19.1X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 851 865 6 1175.3 0.9 1.0X [info] java 212 216 3 4717.0 0.2 4.0X [info] native 30 32 1 32903.0 0.0 28.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1301 1316 6 768.4 1.3 1.0X [info] java 212 216 2 4717.4 0.2 6.1X [info] native 31 32 1 32606.0 0.0 42.4X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 454 460 2 2203.0 0.5 1.0X [info] java 208 212 3 4803.8 0.2 2.2X [info] native 15 16 0 66586.0 0.0 30.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 529 536 4 1889.7 0.5 1.0X [info] java 208 212 3 4798.6 0.2 2.5X [info] native 15 16 1 66751.4 0.0 35.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 830 840 5 1205.1 0.8 1.0X [info] java 208 211 2 4814.1 0.2 4.0X [info] native 15 15 1 67676.4 0.0 56.2X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 894 907 7 1118.7 0.9 1.0X [info] java 208 211 3 4809.6 0.2 4.3X [info] native 15 16 1 66675.2 0.0 59.6X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 193 199 3 517.5 1.9 1.0X [info] java 181 186 4 553.2 1.8 1.1X [info] native 181 185 5 553.6 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 112 2 925.1 1.1 1.0X [info] java 88 91 3 1138.6 0.9 1.2X [info] native 87 91 3 1144.2 0.9 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 184 189 3 542.5 1.8 1.0X [info] java 181 185 3 552.8 1.8 1.0X [info] native 179 183 2 558.0 1.8 1.0X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 97 101 3 1031.6 1.0 1.0X [info] java 86 90 2 1163.7 0.9 1.1X [info] native 85 88 2 1182.9 0.8 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 107 109 2 932.4 1.1 1.0X [info] java 54 56 2 1846.7 0.5 2.0X [info] native 54 56 2 1846.7 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 1 1043.6 1.0 1.0X [info] java 29 30 1 3439.3 0.3 3.3X [info] native 29 30 1 3423.9 0.3 3.3X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 121 123 2 829.8 1.2 1.0X [info] java 32 32 1 3171.3 0.3 3.8X [info] native 45 46 1 2246.2 0.4 2.7X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 142 144 2 705.9 1.4 1.0X [info] java 15 16 1 6585.8 0.2 9.3X [info] native 26 27 1 3839.5 0.3 5.4X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 157 165 5 635.6 1.6 1.0X [info] java 151 159 5 664.0 1.5 1.0X [info] native 151 160 5 663.6 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 89 2 1172.3 0.9 1.0X [info] java 75 79 3 1337.3 0.7 1.1X [info] native 75 79 2 1335.5 0.7 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 58 59 1 1731.5 0.6 1.0X [info] java 28 29 1 3544.2 0.3 2.0X [info] native 30 31 1 3306.2 0.3 1.9X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 90 92 1 1108.3 0.9 1.0X [info] java 28 28 1 3622.5 0.3 3.3X [info] native 30 31 1 3381.3 0.3 3.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 44 45 1 2284.7 0.4 1.0X [info] java 14 15 1 7034.0 0.1 3.1X [info] native 15 16 1 6643.7 0.2 2.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 86 1 1177.4 0.8 1.0X [info] java 15 15 1 6886.1 0.1 5.8X [info] native 15 16 1 6560.1 0.2 5.6X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 164 173 6 608.1 1.6 1.0X [info] java 148 157 5 675.2 1.5 1.1X [info] native 152 160 5 659.9 1.5 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 61 63 1 815.4 1.2 1.0X [info] java 16 17 1 3104.3 0.3 3.8X [info] native 24 27 1 2071.9 0.5 2.5X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 81 85 2 616.4 1.6 1.0X [info] java 81 85 2 614.7 1.6 1.0X [info] native 75 78 2 669.5 1.5 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 138 141 3 362.7 2.8 1.0X [info] java 137 140 2 365.3 2.7 1.0X [info] native 131 134 2 382.9 2.6 1.1X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 525 544 8 1906.2 0.5 1.0X [info] java 61 68 3 16358.1 0.1 8.6X [info] native 31 32 1 32623.7 0.0 17.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 580 598 12 1724.5 0.6 1.0X [info] java 61 68 4 16302.5 0.1 9.5X [info] native 30 32 1 32962.8 0.0 19.1X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 829 838 4 1206.2 0.8 1.0X [info] java 61 69 3 16339.7 0.1 13.5X [info] native 30 31 1 33231.9 0.0 27.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1352 1363 5 739.6 1.4 1.0X [info] java 61 69 3 16347.0 0.1 22.1X [info] native 31 32 1 32740.3 0.0 44.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 482 493 7 2073.1 0.5 1.0X [info] java 35 38 2 28315.3 0.0 13.7X [info] native 15 15 1 67579.7 0.0 32.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 472 482 4 2119.0 0.5 1.0X [info] java 36 38 2 28138.1 0.0 13.3X [info] native 15 16 1 66616.5 0.0 31.4X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 823 830 5 1215.2 0.8 1.0X [info] java 35 38 2 28681.4 0.0 23.6X [info] native 15 15 1 67908.4 0.0 55.9X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 896 908 7 1115.8 0.9 1.0X [info] java 35 38 2 28402.0 0.0 25.5X [info] native 15 16 0 66691.2 0.0 59.8X ``` TODO: - [x] update documentation in `docs/` and `docs/ml-linalg-guide.md` refering `com.github.fommil.netlib` - [ ] merge https://github.com/luhenry/netlib/pull/1 with all feedback from this PR + remove references to snapshot repositories in `pom.xml` and `project/SparkBuild.scala`. Closes #32415 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-12 08:59:36 -05:00
Kousuke Saruta	2b6640a169	[SPARK-35229][WEBUI] Limit the maximum number of items on the timeline view ### What changes were proposed in this pull request? This PR proposes to introduces three new configurations to limit the maximum number of jobs/stages/executors on the timeline view. ### Why are the changes needed? If the number of items on the timeline view grows +1000, rendering can be significantly slow. https://issues.apache.org/jira/browse/SPARK-35229 The maximum number of tasks on the timeline is already limited by `spark.ui.timeline.tasks.maximum` so l proposed to mitigate this issue with the same manner. ### Does this PR introduce _any_ user-facing change? Yes. the maximum number of items shown on the timeline view is limited. I proposed the default value 500 for jobs and stages, and 250 for executors. A executor has at most 2 items (added and removed) 250 is chosen. ### How was this patch tested? I manually confirm this change works with the following procedures. ``` # launch a cluster $ bin/spark-shell --conf spark.ui.retainedDeadExecutors=300 --master "local-cluster[4, 1, 1024]" // Confirm the maximum number of jobs (1 to 1000).foreach { _ => sc.parallelize(List(1)).collect } // Confirm the maximum number of stages var df = sc.parallelize(1 to 2) (1 to 1000).foreach { i => df = df.repartition(i % 5 + 1) } df.collect // Confirm the maximum number of executors (1 to 300).foreach { _ => try sc.parallelize(List(1)).foreach { _ => System.exit(0) } catch { case e => }} ``` Screenshots here. ![jobs_limited](https://user-images.githubusercontent.com/4736016/116386937-3e8c4a00-a855-11eb-8f4c-151cf7ddd3b8.png) ![stages_limited](https://user-images.githubusercontent.com/4736016/116386990-49df7580-a855-11eb-9f71-8e129e3336ab.png) ![executors_limited](https://user-images.githubusercontent.com/4736016/116387009-4f3cc000-a855-11eb-8697-a2eb4c9c99e6.png) Closes #32381 from sarutak/mitigate-timeline-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-11 20:53:11 +08:00
Max Gekk	335f00b19b	[SPARK-35285][SQL] Parse ANSI interval types in SQL schema ### What changes were proposed in this pull request? 1. Extend Spark SQL parser to support parsing of: - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType` - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType` 2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`. ### Why are the changes needed? To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules: ``` <interval type> ::= INTERVAL <interval qualifier> <interval qualifier> ::= <start field> TO <end field> \| <single datetime field> <start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ] <end field> ::= <non-second primary datetime field> \| SECOND [ <left paren> <interval fractional seconds precision> <right paren> ] <primary datetime field> ::= <non-second primary datetime field \| SECOND <non-second primary datetime field> ::= YEAR \| MONTH \| DAY \| HOUR \| MINUTE <interval fractional seconds precision> ::= <unsigned integer> <interval leading field precision> ::= <unsigned integer> ``` Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`. ### Does this PR introduce _any_ user-facing change? Should not since the types has not been released yet. ### How was this patch tested? By running the affected tests such as: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "test:testOnly ExpressionTypeCheckingSuite" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z windowFrameCoercion.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql" ``` Closes #32409 from MaxGekk/parse-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 13:50:35 +09:00
Kousuke Saruta	132cbf0c8c	[SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE commands ### What changes were proposed in this pull request? This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive. ### Why are the changes needed? To make those commands more useful. ### Does this PR introduce _any_ user-facing change? Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths. I've note this incompatibility in the migration guide. ### How was this patch tested? New tests. Closes #32205 from sarutak/add-multiple-files. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:58:51 +09:00
Kousuke Saruta	529b875901	[SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources ### What changes were proposed in this pull request? This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`. ### Why are the changes needed? In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`. So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected. The similar issue happens when we run multiple `KrbIntegrationSuites` at the same time. `MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`. Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail. You can easily confirm with the following command. ``` build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.KrbIntegrationSuite" ``` ### Does this PR introduce _any_ user-facing change? Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration. ### How was this patch tested? New test. Closes #32344 from sarutak/kerberos-refresh-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:55:53 +09:00
Angerszhuuuu	26a5e339a6	[SPARK-33976][SQL][DOCS][FOLLOWUP] Fix syntax error in select doc page ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32257 from AngersZhuuuu/SPARK-33976-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-28 16:47:02 +09:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Daoyuan Wang	26a8d2f908	[SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents ### What changes were proposed in this pull request? Add JindoFS SDK documents link in the cloud integration section of Spark's official document. ### Why are the changes needed? If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested the url manually. Closes #32360 from adrian-wang/jindodoc. Authored-by: Daoyuan Wang <daoyuan.wdy@alibaba-inc.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 09:32:47 -05:00
Angerszhuuuu	1db031f158	[SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited ### What changes were proposed in this pull request? Add note in migration guide about DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited ### Why are the changes needed? Add note ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-26 17:47:47 +03:00
Angerszhuuuu	20d68dc2f4	[SPARK-35159][SQL][DOCS] Extract hive format doc ### What changes were proposed in this pull request? Extract common doc about hive format for `sql-ref-syntax-ddl-create-table-hiveformat.md` and `sql-ref-syntax-qry-select-transform.md` to refer. ![image](https://user-images.githubusercontent.com/46485123/115802193-04641800-a411-11eb-827d-d92544881842.png) ### Why are the changes needed? Improve doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32264 from AngersZhuuuu/SPARK-35159. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-23 05:47:48 +00:00
Angerszhuuuu	b22d54a58a	[SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS ### What changes were proposed in this pull request? PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax. ![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png) In this PR, we support it in Spark SQL too ### Why are the changes needed? Keep consistent with PG and oracle ### Does this PR introduce _any_ user-facing change? User can write grouping analytics like ``` SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b)); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ()); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ())); ``` ### How was this patch tested? Added Test Closes #32201 from AngersZhuuuu/SPARK-35026. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-22 13:08:22 +00:00
Shardul Mahadik	83f753e4e1	[SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode ### What changes were proposed in this pull request? In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths. ### Why are the changes needed? SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at `b26e7b510b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L1280)`. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file. Closes #31591 from shardulm94/SPARK-34472. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-04-20 13:35:57 -05:00
Angerszhuuuu	9c956abb1d	[SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31010 from AngersZhuuuu/SPARK-33976. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-20 10:30:26 +00:00
Dongjoon Hyun	00f06dd267	[SPARK-35131][K8S] Support early driver service clean-up during app termination ### What changes were proposed in this pull request? This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination. ### Why are the changes needed? The K8s service is one of the important resources and sometimes it's controlled by quota. ``` $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` Apache Spark creates a service for driver whose lifecycle is the same with driver pod. It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota. BEFORE ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver 0/1 Completed 0 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver 0/1 Completed 0 78s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 80m org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 80s $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 3 3 $ bin/spark-submit... Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services. Message: Forbidden! User minikube doesn't have permission. services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden: exceeded quota: service, requested: services=1, used: services=3, limited: services=3. ``` AFTER ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver 0/1 Completed 0 26s org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver 0/1 Completed 0 67s org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver 0/1 Completed 0 44s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 172m $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default. The change is documented at the migration guide. ### How was this patch tested? Pass the CIs. This is tested with K8s IT manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 9 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32226 from dongjoon-hyun/SPARK-35131. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 12:11:08 -07:00
Max Gekk	1d1ed3eb25	[SPARK-35107][SQL] Parse unit-to-unit interval literals to ANSI intervals ### What changes were proposed in this pull request? Parse the year-month interval literals like `INTERVAL '1-1' YEAR TO MONTH` to values of `YearMonthIntervalType`, and day-time interval literals to `DayTimeIntervalType` values. Currently, Spark SQL supports: - DAY TO HOUR - DAY TO MINUTE - DAY TO SECOND - HOUR TO MINUTE - HOUR TO SECOND - MINUTE TO SECOND All such interval literals are converted to `DayTimeIntervalType`, and `YEAR TO MONTH` to `YearMonthIntervalType` while loosing info about `from` and `to` units. Note: new behavior is under the SQL config `spark.sql.legacy.interval.enabled` which is `false` by default. When the config is set to `true`, the interval literals are parsed to `CaledarIntervalType` values. Closes #32176 ### Why are the changes needed? To conform the ANSI SQL standard which assumes conversions of interval literals to year-month or day-time interval but not to mixed interval type like Catalyst's `CalendarIntervalType`. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND; 1 days 1 hours 2 minutes 3.123 seconds spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND); interval ``` After: ```sql spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND; 1 01:02:03.123000000 spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND); day-time interval ``` ### How was this patch tested? 1. By running the affected test suites: ``` $ ./build/sbt "test:testOnly .ExpressionParserSuite" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z create_view.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z date.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql" ``` 2. PostgresSQL tests are executed with `spark.sql.legacy.interval.enabled` is set to `true` to keep compatibility with PostgreSQL output: ```sql > SELECT interval '999' second; 0 years 0 mons 0 days 0 hours 16 mins 39.00 secs ``` Closes #32209 from MaxGekk/parse-ansi-interval-literals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-19 16:00:59 +03:00
Angerszhuuuu	a74f601040	[SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform ### What changes were proposed in this pull request? Support no-serde mode script transform use ArrayType/MapType/StructStpe data. ### Why are the changes needed? Make user can process array/map/struct data ### Does this PR introduce _any_ user-facing change? Yes, user can process array/map/struct data in script transform `no-serde` mode ### How was this patch tested? Added UT Closes #30957 from AngersZhuuuu/SPARK-31937. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-19 17:02:32 +09:00
Sean Owen	2e1e1f83e4	[MINOR][DOCS] Soften security warning and keep it in cluster management docs only ### What changes were proposed in this pull request? Soften security warning and keep it in cluster management docs only, not in the main doc page, where it's not necessarily relevant. ### Why are the changes needed? The statement is perhaps unnecessarily 'frightening' as the first section in the main docs page. It applies to clusters not local mode, anyhow. ### Does this PR introduce _any_ user-facing change? Just a docs change. ### How was this patch tested? N/A Closes #32206 from srowen/SecurityStatement. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-17 08:44:00 -05:00
ulysses-you	345c380778	[SPARK-35083][CORE] Support remote scheduler pool files ### What changes were proposed in this pull request? Use hadoop FileSystem instead of FileInputStream. ### Why are the changes needed? Make `spark.scheduler.allocation.file` suport remote file. When using Spark as a server (e.g. SparkThriftServer), it's hard for user to specify a local path as the scheduler pool. ### Does this PR introduce _any_ user-facing change? Yes, a minor feature. ### How was this patch tested? Pass `core/src/test/scala/org/apache/spark/scheduler/PoolSuite.scala` and manul test After add config `spark.scheduler.allocation.file=hdfs:///tmp/fairscheduler.xml`. We intrudoce the configed pool. ![pool1](https://user-images.githubusercontent.com/12025282/114810037-df065700-9ddd-11eb-8d7a-54b59a07ee7b.jpg) Closes #32184 from ulysses-you/SPARK-35083. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-16 00:18:35 -07:00
Angerszhuuuu	71133e1c2a	[SPARK-35070][SQL] TRANSFORM not support alias in inputs ### What changes were proposed in this pull request? Normal function parameters should not support alias, hive not support too ![image](https://user-images.githubusercontent.com/46485123/114645556-4a7ff400-9d0c-11eb-91eb-bc679ea0039a.png) In this pr we forbid use alias in `TRANSFORM`'s inputs ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32165 from AngersZhuuuu/SPARK-35070. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-15 06:51:07 +00:00
Angerszhuuuu	0fc97b5bf4	[SPARK-34577][SQL][FOLLOWUP] Add change of `DESC NAMESPACE`'s schema to migration guide ### What changes were proposed in this pull request? Add change of `DESC NAMESPACE`'s schema to migration guide ### Why are the changes needed? Update doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32155 from AngersZhuuuu/SPARK-34577-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-14 06:30:45 +00:00
Sean Owen	700aa1769c	[SPARK-35050][DOCS][MESOS] Document deprecation of Apache Mesos in 3.2.0 ### What changes were proposed in this pull request? Deprecate Apache Mesos support for Spark 3.2.0 by adding documentation to this effect. ### Why are the changes needed? Apache Mesos is ceasing development (https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E) ; at some point we'll want to drop support, so, deprecate it now. This doesn't mean it'll go away in 3.3.0. ### Does this PR introduce _any_ user-facing change? No, docs only. ### How was this patch tested? N/A Closes #32150 from srowen/SPARK-35050. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:17:58 +09:00
Gengliang Wang	79e55b44f7	[SPARK-35028][SQL] ANSI mode: disallow group by aliases ### What changes were proposed in this pull request? Disallow group by aliases under ANSI mode. ### Why are the changes needed? As per the ANSI SQL standard secion 7.12 <group by clause>: >Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column. By forbidding it, we can avoid ambiguous SQL queries like: ``` SELECT col + 1 as col FROM t GROUP BY col ``` ### Does this PR introduce _any_ user-facing change? Yes, group by aliases is not allowed under ANSI mode. ### How was this patch tested? Unit tests Closes #32129 from gengliangwang/disallowGroupByAlias. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-04-13 10:42:57 +08:00
Yuming Wang	e40fce919a	[SPARK-34562][SQL] Add test and doc for Parquet Bloom filter push down ### What changes were proposed in this pull request? This pr add test and document for Parquet Bloom filter push down. ### Why are the changes needed? Improve document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Generating docs: ![image](https://user-images.githubusercontent.com/5399861/114327472-c131bb80-9b6b-11eb-87a0-6f9a74eb1097.png) Closes #32123 from wangyum/SPARK-34562. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-12 17:07:35 +03:00
Angerszhuuuu	21232377ba	[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics ### What changes were proposed in this pull request? Support GROUP BY use Separate columns and CUBE/ROLLUP In postgres sql, it support ``` select a, b, c, count(1) from t group by a, b, cube (a, b, c); select a, b, c, count(1) from t group by a, b, rollup(a, b, c); select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c); select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ()); ``` In this pr, we have done two things as below: 1. Support partial grouping analytics such as `group by a, cube(a, b)` 2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)` Partial Groupings Partial Groupings means there are both `group_expression` and `CUBE\|ROLLUP\|GROUPING SETS` in GROUP BY clause. For example: `GROUP BY warehouse, CUBE(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`. `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`. `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`. Concatenated Groupings Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, and separating them with commas. For example: `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`. `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse), (product, location, size), (product, location), (product), (location, size), (location), ())`. `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product), (order, warehouse, location, size), (order, warehouse, location), (order, warehouse), (order, product, location, size), (order, product, location), (order, product), (order, location, size), (order, location), (order))`. ### Why are the changes needed? Support more flexible grouping analytics ### Does this PR introduce _any_ user-facing change? User can use sql like ``` select a, b, c, agg_expr() from table group by a, cube(b, c) ``` ### How was this patch tested? Added UT Closes #30144 from AngersZhuuuu/SPARK-33229. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-12 08:23:52 +00:00
hissy	214a46aa88	[MINOR][SS][DOC] Fix wrong Python code sample ### What changes were proposed in this pull request? This patch fixes wrong Python code sample for doc. ### Why are the changes needed? Sample code is wrong. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only. Closes #32119 from Hisssy/ss-doc-typo-1. Authored-by: hissy <aozora@live.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-10 12:33:30 +03:00
itholic	8e15ac1c67	[SPARK-34493][DOCS] Add "TEXT Files" page for Data Source documents ### What changes were proposed in this pull request? This PR aims to add a documentation on how to read and write TEXT files through various APIs such as Scala, Python and JAVA in Spark to [Data Source documents](https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources). ### Why are the changes needed? Documentation on how Spark handles TEXT files is missing. It should be added to the document for user convenience. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new page to Data Sources documents. ### How was this patch tested? Manually build documents and check the page on local as below. ![Screen Shot 2021-04-07 at 4 05 01 PM](https://user-images.githubusercontent.com/44108233/113824674-085e2c00-97bb-11eb-91ae-d2cc19dfd369.png) Closes #32053 from itholic/SPARK-34491-TEXT. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-07 17:11:43 +03:00
Kent Yao	7cffacef18	[SPARK-34935][SQL] CREATE TABLE LIKE should respect the reserved table properties ### What changes were proposed in this pull request? CREATE TABLE LIKE should respect the reserved properties of tables and fail if specified, using `spark.sql.legacy.notReserveProperties` to restore. ### Why are the changes needed? Make DDLs consistently treat reserved properties ### Does this PR introduce _any_ user-facing change? YES, this is a breaking change as using `create table like` w/ reserved properties will fail. ### How was this patch tested? new test Closes #32025 from yaooqinn/SPARK-34935. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-06 08:52:48 +09:00
Wenchen Fan	39d5677ee3	[SPARK-34932][SQL] deprecate GROUP BY ... GROUPING SETS (...) and promote GROUP BY GROUPING SETS (...) ### What changes were proposed in this pull request? GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. This syntax requires users to repeat the expressions inside `GROUPING SETS (...)` after `GROUP BY`, and has a weird null semantic if `GROUP BY` contains extra expressions than `GROUPING SETS (...)`. This PR deprecates this syntax: 1. Do not promote it in the document and only mention it as a Hive compatible sytax. 2. Simplify the code to only keep it for Hive compatibility. ### Why are the changes needed? Deprecate a weird grammar. ### Does this PR introduce _any_ user-facing change? No breaking change, but it removes a check to simplify the code: `GROUP BY a GROUPING SETS(a, b)` fails before and forces users to also put `b` after `GROUP BY`. Now this works just as `GROUP BY GROUPING SETS(a, b)`. ### How was this patch tested? existing tests Closes #32022 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-06 08:49:08 +09:00

1 2 3 4 5 ...

3167 commits