ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Jungtaek Lim (HeartSaVioR)	f55f6b569b	[SPARK-31101][BUILD] Upgrade Janino to 3.0.16 ### What changes were proposed in this pull request? This PR(SPARK-31101) proposes to upgrade Janino to 3.0.16 which is released recently. * Merged pull request janino-compiler/janino#114 "Grow the code for relocatables, and do fixup, and relocate". Please see the commit log. - https://github.com/janino-compiler/janino/commits/3.0.16 You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html / though release note for Janino 3.0.16 is actually incorrect. ### Why are the changes needed? We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details. Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable Janino to succeed to compile user's query properly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27932 from HeartSaVioR/SPARK-31101-janino-3.0.16. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 19:10:23 -07:00
Gabor Somogyi	bf342bafa8	[SPARK-30541][TESTS] Implement KafkaDelegationTokenSuite with testRetry ### What changes were proposed in this pull request? `KafkaDelegationTokenSuite` has been ignored because showed flaky behaviour. In this PR I've changed the approach how the test executed and turning it on again. This PR contains the following: * The test runs in separate JVM in order to avoid modified security context * The body of the test runs in `testRetry` which reties if failed * Additional logs to analyse possible failures * Enhanced clean-up code ### Why are the changes needed? `KafkaDelegationTokenSuite ` is ignored. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Executed the test in loop 1k+ times in jenkins (locally much harder to reproduce). Closes #27877 from gaborgsomogyi/SPARK-30541. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 18:59:29 -07:00
Prashant Sharma	3799d2b9d8	[SPARK-30715][K8S][TESTS][FOLLOWUP] Update k8s client version in IT as well ### What changes were proposed in this pull request? This is a follow up for SPARK-30715 . Kubernetes client version in sync in integration-tests and kubernetes/core ### Why are the changes needed? More than once, the kubernetes client version has gone out of sync between integration tests and kubernetes/core. So brought them up in sync again and added a comment to save us from future need of this additional followup. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27948 from ScrapCodes/follow-up-spark-30715. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 18:26:53 -07:00
Eric Wu	3a48ea1fe0	[SPARK-31184][SQL] Support getTablesByType API of Hive Client ### What changes were proposed in this pull request? Hive 2.3+ supports `getTablesByType` API, which will provide an efficient way to get HiveTable with specific type. Now, we have following mappings when using `HiveExternalCatalog`. ``` CatalogTableType.EXTERNAL => HiveTableType.EXTERNAL_TABLE CatalogTableType.MANAGED => HiveTableType.MANAGED_TABLE CatalogTableType.VIEW => HiveTableType.VIRTUAL_VIEW ``` Without this API, we need to achieve the goal by `getTables` + `getTablesByName` + `filter with type`. This PR add `getTablesByType` in `HiveShim`. For those hive versions don't support this API, `UnsupportedOperationException` will be thrown. And the upper logic should catch the exception and fallback to the filter solution mentioned above. Since the JDK11 related fix in `Hive` is not released yet, manual tests against hive 2.3.7-SNAPSHOT is done by following the instructions of SPARK-29245. ### Why are the changes needed? This API will provide better usability and performance if we want to get a list of hiveTables with specific type. For example `HiveTableType.VIRTUAL_VIEW` corresponding to `CatalogTableType.VIEW`. ### Does this PR introduce any user-facing change? No, this is a support function. ### How was this patch tested? Add tests in VersionsSuite and manually run JDK11 test with following settings: - Hive 2.3.6 Metastore on JDK8 - Hive 2.3.7-SNAPSHOT library build from source of Hive 2.3 branch - Spark build with Hive 2.3.7-SNAPSHOT on jdk-11.0.6 Closes #27952 from Eric5553/GetTableByType. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 17:41:23 -07:00
yan ma	fae981e5f3	[SPARK-30773][ML] Support NativeBlas for level-1 routines ### What changes were proposed in this pull request? Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256 ### Why are the changes needed? In current ML BLAS.scala, all level-1 routines are fixed to use java implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X performance improvement based on performance test which apply direct calls against these methods. We should provide a way to allow user take advantage of NativeBLAS for level-1 routines. Here we do it through switching to NativeBLAS for these methods from f2jBLAS. ### Does this PR introduce any user-facing change? Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system. ### How was this patch tested? Perf test direct calls level-1 routines Closes #27546 from yma11/SPARK-30773. Lead-authored-by: yan ma <yan.ma@intel.com> Co-authored-by: Ma Yan <yan.ma@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:32:58 -05:00
Huaxin Gao	9a990133f6	[SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes #27895 from huaxingao/anova. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:28:00 -05:00
Kent Yao	88ae6c4481	[SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document ### What changes were proposed in this pull request? Fix errors and missing parts for datetime pattern document 1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own. 2. Some pattern letters are missing 3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N') 4. the second fraction pattern different logic for parsing and formatting ### Why are the changes needed? fix and improve doc ### Does this PR introduce any user-facing change? yes, new and updated doc ### How was this patch tested? pass Jenkins viewed locally with `jekyll serve` ![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png) Closes #27956 from yaooqinn/SPARK-31189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 21:59:26 +08:00
Maxim Gekk	b402bc900a	[SPARK-31183][SQL][FOLLOWUP] Move rebase tests to `AvroSuite` and check the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes #27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 19:02:54 +09:00
Maxim Gekk	6a668763b8	[SPARK-31195][SQL] Correct and reuse days rebase functions of `DateTimeUtils` in `DaysWritable` ### What changes were proposed in this pull request? In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`. ### Why are the changes needed? 0. Existing rebasing of days in `DaysWritable` is not correct. 1. To deduplicate code in `DaysWritable` 2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5 ### Does this PR introduce any user-facing change? This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`. ### How was this patch tested? By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`. Closes #27962 from MaxGekk/reuse-rebase-funcs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 15:57:21 +09:00
Maxim Gekk	4766a36647	[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Avro datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into Avro files. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.avro.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds. 2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to Avro files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from Avro files if the SQL config is on. 5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-07\| +----------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-01\| +----------+ ``` ### How was this patch tested? 1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro") scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df2: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> :paste // Entering paste mode (ctrl-D to finish) val timestampSchema = s""" \| { \| "namespace": "logical", \| "type": "record", \| "name": "test", \| "fields": [ \| {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null} \| ] \| } \|""".stripMargin // Exiting paste mode, now interpreting. scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro") ``` 2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. : - `rebasing microseconds timestamps in write` - `rebasing milliseconds timestamps in write` - `rebasing dates in write` Closes #27953 from MaxGekk/rebase-avro-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 13:57:49 +08:00
Dongjoon Hyun	f1cc86792f	[SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases ### What changes were proposed in this pull request? A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side. The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR. ### Why are the changes needed? This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 12:28:57 +08:00
Takeshi Yamamuro	ca499e9409	[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#24L] +- (2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- (1) Range (0, 10, step=1, splits=4) +- (2) Project [id#26L] +- (2) Filter isnotnull(id#26L) +- (2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#3L] +- (2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- (2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- (1) Project [id#5L] +- (1) Filter isnotnull(id#5L) +- (1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from https://github.com/apache/spark/pull/22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 20:11:04 -07:00
Dongjoon Hyun	c6a6d5e006	Revert "[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir" This reverts commit `5bc0d76591`. Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 16:08:51 -07:00
Wenchen Fan	ac262cb272	[SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26933 Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode. ### Why are the changes needed? correct the ANSI cast behavior from string to integral ### Does this PR introduce any user-facing change? Yes under ANSI mode, but ANSI mode is off by default. ### How was this patch tested? new test Closes #27957 from cloud-fan/ansi. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-20 00:52:09 +09:00
Kris Mok	a1776288f4	[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- (1) Range (0, 10, step=1, splits=16) (2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes #27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 20:53:01 +09:00
Maxim Gekk	bb295d80e3	[SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.parquet.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion. 2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on. 5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`. 6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`. 7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`. ### Why are the changes needed? - For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. - It fixes the bug of incorrect saving/loading timestamps of the `INT96` type ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-07 11:32:20.123456\| +--------------------------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-01 01:02:03.123456\| +--------------------------+ ``` ### How was this patch tested? 1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date") scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96") ``` 2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes): ```bash $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Read the saved date/timestamp by Spark 2.4.5: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Closes #27915 from MaxGekk/rebase-parquet-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-19 12:49:51 +08:00
sarthfrey-db	6fd3138e9c	[SPARK-30667][CORE] Change BarrierTaskContext allGather method return type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-19 12:12:39 +09:00
Burak Yavuz	4237251861	[SPARK-31178][SQL] Prevent V2 exec nodes from executing multiple times ### What changes were proposed in this pull request? This PR prevents the execution of V2 DataSource exec nodes multiple times when `collect()` is called on them. For V1 DataSources, commands would be executed as a RunnableCommand, which would cache the result as part of the `ExecutedCommandExec` node. We extend `V2CommandExec` for all the data writing commands so that they only get executed once as well. ### Why are the changes needed? Calling `collect()` on a SQL command that inserts data or creates a table gets executed multiple times otherwise. ### Does this PR introduce any user-facing change? Fixes a bug ### How was this patch tested? Unit tests Closes #27941 from brkyvz/doubleInsert. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-03-18 18:07:24 -07:00
Wenchen Fan	8643e5d9c5	[SPARK-31171][SQL][FOLLOWUP] update document ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/27936 to update document. ### Why are the changes needed? correct document ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27950 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:29:31 +09:00
Kent Yao	3d695954e5	[SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text ### What changes were proposed in this pull request? pattern `''` means literal `'` ```sql select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS"); 5377-02-14 06:27:19'000000519 ``` `0946a9514f` missed this case and this pr add it back. ### Why are the changes needed? bugfix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #27949 from yaooqinn/SPARK-31150-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 07:27:06 +09:00
Huaxin Gao	d22c9f6c0d	[SPARK-30933][ML][DOCS] ML, GraphX 3.0 QA: Update user guide for new features & APIs ### What changes were proposed in this pull request? Change ml-tuning.html. ### Why are the changes needed? Add description for ```MultilabelClassificationEvaluator``` and ```RankingEvaluator```. ### Does this PR introduce any user-facing change? Yes before: ![image](https://user-images.githubusercontent.com/13592258/76437013-2c5ffb80-6376-11ea-8946-f5c2e7379b7c.png) after: ![image](https://user-images.githubusercontent.com/13592258/76437054-397cea80-6376-11ea-867f-fe8d8fa4e5b3.png) ### How was this patch tested? Closes #27880 from huaxingao/spark-30933. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-18 13:21:24 -05:00
yi.wu	8bfaa62f2f	[SPARK-31175][SQL] Avoid creating reverse comparator for each compare in InterpretedOrdering ### What changes were proposed in this pull request? Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering. ### Why are the changes needed? Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27938 from Ngone51/reverse_comparator. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 23:56:48 +08:00
Kent Yao	57fcc49306	[SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar ### What changes were proposed in this pull request? The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g. ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee'); select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee'); ``` Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before. This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class. And 'e' and 'c' char checking in this method. Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g. ```sql spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'"); 20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")] java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'"); NULL ``` ### Why are the changes needed? avoid vagueness bug fix ### Does this PR introduce any user-facing change? no, these are not exposed yet ### How was this patch tested? add ut Closes #27939 from yaooqinn/SPARK-31176. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 20:19:50 +08:00
Kent Yao	f1d27cdd91	[SPARK-31119][SQL] Add interval value support for extract expression as extract source ### What changes were proposed in this pull request? ``` <extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren> <extract source> ::= <datetime value expression> \| <interval value expression> ``` We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval. This pr adds interval value support for `extract` expression as extract source ### Why are the changes needed? For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions. ### Does this PR introduce any user-facing change? yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals ### How was this patch tested? add unit tests Closes #27876 from yaooqinn/SPARK-31119. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-18 12:29:39 +08:00
Qianyang Yu	6f0b0f1655	[SPARK-30954][ML][R] Make file name the same as class name This pr solved the same issue as [pr27919](https://github.com/apache/spark/pull/27919), but this one changes the file names based on comment from previous pr. ### What changes were proposed in this pull request? Make some of file names the same as class name in R package. ### Why are the changes needed? Make the file consistence ### Does this PR introduce any user-facing change? No ### How was this patch tested? run `./R/run-tests.sh` Closes #27940 from kevinyu98/spark-30954-r-v2. Authored-by: Qianyang Yu <qyu@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 16:15:02 -07:00
manuzhang	4e4e08f372	[SPARK-31047][SQL] Improve file listing for ViewFileSystem ### What changes were proposed in this pull request? Use `listLocatedStatus` when `lnMemoryFileIndex` is listing files from a `ViewFileSystem` which should delegate to that of `DistributedFileSystem`. ### Why are the changes needed? When `ViewFileSystem` is used to manage several `DistributedFileSystem`, the change will improve performance of file listing, especially when there are many files. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27801 from manuzhang/spark-31047. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 14:23:28 -07:00
Holden Karau	57d27e900f	[SPARK-31125][K8S] Terminating pods have a deletion timestamp but they are not yet dead ### What changes were proposed in this pull request? Change what we consider a deleted pod to not include "Terminating" ### Why are the changes needed? If we get a new snapshot while a pod is in the process of being cleaned up we shouldn't delete the executor until it is fully terminated. ### Does this PR introduce any user-facing change? No ### How was this patch tested? This should be covered by the decommissioning tests in that they currently are flaky because we sometimes delete the executor instead of allowing it to decommission all the way. I also ran this in a loop locally ~80 times with the only failures being the PV suite because of unrelated minikube mount issues. Closes #27905 from holdenk/SPARK-31125-Processing-state-snapshots-incorrect. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 12:04:06 -07:00
Wenchen Fan	dc5ebc2d5b	[SPARK-31171][SQL] size(null) should return null under ansi mode ### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 11:48:54 -07:00
Adam Binford	9f27a5495d	[SPARK-30860][CORE] Use FileSystem.mkdirs to avoid umask at rolling event log folder and appStatusFile creation ### What changes were proposed in this pull request? This pull request fixes an issue with rolling event logs. The rolling event log directory is created ignoring the dfs umask setting. This allows the history server to prune old rolling logs when run as the group owner of the event log folder. ### Why are the changes needed? For non-rolling event logs, log files are created ignoring the umask setting by calling setPermission after creating the file. The default umask of 022 currently causes rolling log directories to be created without group write permissions, preventing the history server from pruning logs of applications not run as the same user as the history server. This adds the same behavior for rolling event logs so users don't need to worry about the umask setting causing different behavior. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. The folder is created with the correct 770 permission. The status file is still affected by the umask setting, but that doesn't stop the folder from being deleted by the history server. I'm not sure if that causes any other issues. I'm not sure how to test something involving a Hadoop setting. Closes #27764 from Kimahriman/bug/rolling-log-permissions. Authored-by: Adam Binford <adam.binford@radiantsolutions.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 11:20:10 -07:00
Kent Yao	5bc0d76591	[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir ### What changes were proposed in this pull request? In Spark CLI, we create a hive `CliSessionState` and it does not load the `hive-site.xml`. So the configurations in `hive-site.xml` will not take effects like other spark-hive integration apps. Also, the warehouse directory is not correctly picked. If the `default` database does not exist, the `CliSessionState` will create one during the first time it talks to the metastore. The `Location` of the default DB will be neither the value of `spark.sql.warehousr.dir` nor the user-specified value of `hive.metastore.warehourse.dir`, but the default value of `hive.metastore.warehourse.dir `which will always be `/user/hive/warehouse`. ### Why are the changes needed? fix bug for Spark SQL cli to pick right confs ### Does this PR introduce any user-facing change? yes, the non-exists default database will be created in the location specified by the users via `spark.sql.warehouse.dir` or `hive.metastore.warehouse.dir`, or the default value of `spark.sql.warehouse.dir` if none of them specified. ### How was this patch tested? add cli ut Closes #27933 from yaooqinn/SPARK-31170. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 23:03:18 +08:00
Kent Yao	0946a9514f	[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp ### What changes were proposed in this pull request? This PR is to support parsing timestamp values with variable length second fraction parts. e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7 ```sql select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values ('2019-10-06 10:11:12.'), ('2019-10-06 10:11:12.0'), ('2019-10-06 10:11:12.1'), ('2019-10-06 10:11:12.12'), ('2019-10-06 10:11:12.123UTC'), ('2019-10-06 10:11:12.1234'), ('2019-10-06 10:11:12.12345CST'), ('2019-10-06 10:11:12.123456PST') t(v) 2019-10-06 03:11:12.123 2019-10-06 08:11:12.12345 2019-10-06 10:11:12 2019-10-06 10:11:12 2019-10-06 10:11:12.1 2019-10-06 10:11:12.12 2019-10-06 10:11:12.1234 2019-10-06 10:11:12.123456 select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') NULL ``` Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated. ### Why are the changes needed? improve timestamp parsing and more compatible with 2.4.x ### Does this PR introduce any user-facing change? no, the related changes are newly added ### How was this patch tested? add uts Closes #27906 from yaooqinn/SPARK-31150. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 21:53:46 +08:00
Takeshi Yamamuro	124b4ce2e6	[MINOR][SQL] Update the DataFrameWriter.bucketBy comment ### What changes were proposed in this pull request? This PR intends to update the `DataFrameWriter.bucketBy` comment for clearly describing that the bucketBy scheme follows a Spark "specific" one. I saw the questions about the current bucketing compatibility with Hive in [SPARK-31162](https://issues.apache.org/jira/browse/SPARK-31162?focusedCommentId=17060408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17060408) and [SPARK-17495](https://issues.apache.org/jira/browse/SPARK-17495?focusedCommentId=17059847&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17059847) from users and IMHO the comment is a bit confusing to users about the compatibility ### Why are the changes needed? To make users understood smoothly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27930 from maropu/UpdateBucketByComment. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 00:52:45 -07:00
Wenchen Fan	30d95356f1	[SPARK-31134][SQL] optimize skew join after shuffle partitions are coalesced ### What changes were proposed in this pull request? Run the `OptimizeSkewedJoin` rule after the `CoalesceShufflePartitions` rule. ### Why are the changes needed? Remove duplicated coalescing code in `OptimizeSkewedJoin`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27893 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-17 00:23:16 -07:00
Zhenhua Wang	1369a973cd	[SPARK-31164][SQL] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes #27924 from wzhfy/inconsistent_rdd_partitioning. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com>	2020-03-17 14:20:16 +08:00
Pedro Rossi	ed06d98044	[SPARK-25355][K8S] Add proxy user to driver if present on spark-submit ### What changes were proposed in this pull request? This PR adds the proxy user on the spark-submit command to the childArgs, so the proxy user can be retrieved and used in the KubernetesAplication to add the proxy user in the driver container args ### Why are the changes needed? The proxy user when used on the spark submit doesn't work on the Kubernetes environment since it doesn't add the `--proxy-user` argument on the driver container and when I added it manually to the Pod definition it worked just fine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tests were added Closes #27422 from PedroRossi/SPARK-25355. Authored-by: Pedro Rossi <pgrr@cin.ufpe.br> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 21:53:58 -07:00
Wenchen Fan	d7b97a1d0d	[SPARK-31166][SQL] UNION map<null, null> and other maps should not fail ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps. The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it. ### Why are the changes needed? To avoid breaking queries. ### Does this PR introduce any user-facing change? Yes, now some queries that work in 2.x can work in 3.0 as well. ### How was this patch tested? new test Closes #27926 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-17 12:01:29 +08:00
zhengruifeng	93088f79cc	[SPARK-30776][ML][FOLLOWUP] FValue clean up ### What changes were proposed in this pull request? remove unused variables; ### Why are the changes needed? remove unused variables; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27922 from zhengruifeng/test_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-17 11:29:08 +08:00
zero323	01f20394ac	[SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions ### What changes were proposed in this pull request? - Adds following overloaded variants to Scala `o.a.s.sql.functions`: - `percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Double, accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column` (primarily for Python interop). - `percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column` - Adds `percentile_approx` to `pyspark.sql.functions`. - Adds `percentile_approx` function to SparkR. ### Why are the changes needed? Currently we support `percentile_approx` only in SQL expression. It is inconvenient and makes this function relatively unknown. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests for SparkR an PySpark. As for now there are no additional tests in Scala API ‒ `ApproximatePercentile` is well tested and Python (including docstrings) and R tests provide additional tests, so it seems unnecessary. Closes #27278 from zero323/SPARK-30569. Lead-authored-by: zero323 <mszymkiewicz@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-17 10:44:21 +09:00
Nicholas Chammas	b4748ca0ab	[SPARK-31155] Remove pydocstyle tests ### What changes were proposed in this pull request? As discovered here https://github.com/apache/spark/pull/27910#issuecomment-599027190, pydocstyle tests were not running anywhere (not on Jenkins; not on GitHub). ~This PR enables those tests.~ It also seems like a [large hill to climb](https://github.com/apache/spark/pull/27912#issuecomment-599167117) to enable any meaningful checks, so we're going to just rip pydocstyle out for now. ### Why are the changes needed? Presumably, we defined those doc style tests because we care about whatever it is they enforce. Since we're not actually testing anything, though, it's better to clear the cruft. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Will check the GitHub workflow logs on this PR. Closes #27912 from nchammas/SPARK-31155-pydocstyle. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-17 10:41:41 +09:00
yi.wu	cb26f636b0	[SPARK-31163][SQL] TruncateTableCommand with acl/permission should handle non-existed path ### What changes were proposed in this pull request? This fix #26956 Wrap try-catch on `fs.getFileStatus(path)` within acl/permission in case of the path doesn't exist. ### Why are the changes needed? `truncate table` may fail to re-create path in case of interruption or something else. As a result, next time we `truncate table` on the same table with acl/permission, it will fail due to `FileNotFoundException`. And it also brings behavior change compares to previous Spark version, which could still `truncate table` successfully even if the path doesn't exist. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #27923 from Ngone51/fix_truncate. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 11:45:25 -07:00
HyukjinKwon	6704103499	[SPARK-31146][SQL] Leverage the helper method for aliasing in built-in SQL expressions ### What changes were proposed in this pull request? This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable. - `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default. - When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`, `FormatString` and `XPathDouble`. This PR fixes the aliases of the functions below: \| class \| alias \| \|--------------------------\|------------------\| \|`Rand` \|`random` \| \|`Ceil` \|`ceiling` \| \|`Remainder` \|`mod` \| \|`Pow` \|`pow` \| \|`Signum` \|`sign` \| \|`Chr` \|`char` \| \|`Length` \|`char_length` \| \|`Length` \|`character_length`\| \|`FormatString` \|`printf` \| \|`Substring` \|`substr` \| \|`Upper` \|`ucase` \| \|`XPathDouble` \|`xpath_number` \| \|`DayOfMonth` \|`day` \| \|`CurrentTimestamp` \|`now` \| \|`Size` \|`cardinality` \| \|`Sha1` \|`sha` \| \|`CallMethodViaReflection` \|`java_method` \| Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser. Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller. ### Why are the changes needed? To respect expression name. ### Does this PR introduce any user-facing change? Yes, it will change the output column name. ### How was this patch tested? Manually tested, and unittests were added. Closes #27901 from HyukjinKwon/31146. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 11:22:34 -07:00
Huaxin Gao	3ce1dff7ba	[SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations ### What changes were proposed in this pull request? jira link: https://issues.apache.org/jira/browse/SPARK-30930 Remove ML/MLLIB DeveloperApi annotations. ### Why are the changes needed? The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR. ### Does this PR introduce any user-facing change? Yes. DeveloperApi annotations are removed from docs. ### How was this patch tested? existing tests Closes #27859 from huaxingao/spark-30930. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-16 12:41:22 -05:00
Tae-kyeom, Kim	e736c62764	[SPARK-31116][SQL] Fix nested schema case-sensitivity in ParquetRowConverter ### What changes were proposed in this pull request? This PR (SPARK-31116) add caseSensitive parameter to ParquetRowConverter so that it handle materialize parquet properly with respect to case sensitivity ### Why are the changes needed? From spark 3.0.0, below statement throws IllegalArgumentException in caseInsensitive mode because of explicit field index searching in ParquetRowConverter. As we already constructed parquet requested schema and catalyst requested schema during schema clipping in ParquetReadSupport, just follow these behavior. ```scala val path = "/some/temp/path" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.parquet(path) val caseInsensitiveSchema = new StructType() .add( "StructColumn", new StructType() .add("LowerCase", LongType) .add("camelcase", LongType)) spark.read.schema(caseInsensitiveSchema).parquet(path).show() ``` ### Does this PR introduce any user-facing change? No. The changes are only in unreleased branches (`master` and `branch-3.0`). ### How was this patch tested? Passed new test cases that check parquet column selection with respect to schemas and case sensitivities Closes #27888 from kimtkyeom/parquet_row_converter_case_sensitivity. Authored-by: Tae-kyeom, Kim <kimtkyeom@devsisters.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-16 10:31:56 -07:00
jiake	21c02ee5d0	[SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution ### What changes were proposed in this pull request? This PR will add the user guide for AQE and the detailed configurations about the three mainly features in AQE. ### Why are the changes needed? Add the detailed configurations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? only add doc no need ut. Closes #27616 from JkSelf/aqeuserguide. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-16 23:33:56 +08:00
Maxim Gekk	57854c736c	[SPARK-31076][SQL][FOLLOWUP] Incapsulate date rebasing to `DaysWritable` ### What changes were proposed in this pull request? Move the code related to days rebasing from/to Julian calendar from `HiveInspectors` to new class `DaysWritable`. ### Why are the changes needed? To improve maintainability of the `HiveInspectors` trait which is already pretty complex. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `HiveOrcHadoopFsRelationSuite`. Closes #27890 from MaxGekk/replace-DateWritable-by-DaysWritable. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-16 17:06:15 +08:00
Wenchen Fan	50a29672e0	[SPARK-30958][SQL] do not set default era for DateTimeFormatter ### What changes were proposed in this pull request? It's not needed at all as now we replace "y" with "u" if there is no "G". So the era is either explicitly specified (e.g. "yyyy G") or can be inferred from the year (e.g. "uuuu"). ### Why are the changes needed? By default we use "uuuu" as the year pattern, which indicates the era already. If we set a default era, it can get conflicted and fail the parsing. ### Does this PR introduce any user-facing change? yea, now spark can parse date/timestamp with negative year via the "yyyy" pattern, which will be converted to "uuuu" under the hood. ### How was this patch tested? new tests Closes #27707 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-16 16:48:31 +09:00
Gabor Somogyi	b0d2956a35	[SPARK-31135][BUILD][TESTS] Upgrdade docker-client version to 8.14.1 ### What changes were proposed in this pull request? Upgrdade `docker-client` version. ### Why are the changes needed? `docker-client` what Spark uses is super old. Snippet from the project page: ``` Spotify no longer uses recent versions of this project internally. The version of docker-client we're using is whatever helios has in its pom.xml. => 8.14.1 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.DB2IntegrationSuite test` build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite test` build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite test` ``` Closes #27892 from gaborgsomogyi/docker-client. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-15 23:55:04 -07:00
LantaoJin	08bdc9c9b2	[SPARK-31068][SQL] Avoid IllegalArgumentException in broadcast exchange ### What changes were proposed in this pull request? Fix the IllegalArgumentException in broadcast exchange when numRows over 341 million but less than 512 million. Since the maximum number of keys that `BytesToBytesMap` supports is 1 << 29, and only 70% of the slots can be used before growing in `HashedRelation`, So here the limitation should not be greater equal than 341 million (1 << 29 / 1.5(357913941)) instead of 512 million. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually test. Closes #27828 from LantaoJin/SPARK-31068. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Alan Jin <jinlantao@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-15 20:20:23 -05:00
beliefer	f4cd7495f1	[SPARK-31002][CORE][DOC][FOLLOWUP] Add version information to the configuration of Core ### What changes were proposed in this pull request? This PR follows up #27847 and https://github.com/apache/spark/pull/27852. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.metrics.namespace \| 2.1.0 \| SPARK-5847 \| 70f846a313061e4db6174e0dc6c12c8c806ccf78#diff-6bdad48cfc34314e89599655442ff210 \| spark.metrics.conf \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-7ea2624e832b166ca27cd4baca8691d9 \| spark.metrics.executorMetricsSource.enabled \| 3.0.0 \| SPARK-27189 \| 729f43f499f3dd2718c0b28d73f2ca29cc811eac#diff-6bdad48cfc34314e89599655442ff210 \| spark.metrics.staticSources.enabled \| 3.0.0 \| SPARK-30060 \| 60f20e5ea2000ab8f4a593b5e4217fd5637c5e22#diff-6bdad48cfc34314e89599655442ff210 \| spark.pyspark.driver.python \| 2.1.0 \| SPARK-13081 \| 7a9e25c38380e6c62080d62ad38a4830e44fe753#diff-6bdad48cfc34314e89599655442ff210 \| spark.pyspark.python \| 2.1.0 \| SPARK-13081 \| 7a9e25c38380e6c62080d62ad38a4830e44fe753#diff-6bdad48cfc34314e89599655442ff210 \| spark.history.ui.maxApplications \| 2.0.1 \| SPARK-17243 \| 021aa28f439443cda1bc7c5e3eee7c85b40c1a2d#diff-6bdad48cfc34314e89599655442ff210 \| spark.io.encryption.enabled \| 2.1.0 \| SPARK-5682 \| 4b4e329e49f8af28fa6301bd06c48d7097eaf9e6#diff-6bdad48cfc34314e89599655442ff210 \| spark.io.encryption.keygen.algorithm \| 2.1.0 \| SPARK-5682 \| 4b4e329e49f8af28fa6301bd06c48d7097eaf9e6#diff-6bdad48cfc34314e89599655442ff210 \| spark.io.encryption.keySizeBits \| 2.1.0 \| SPARK-5682 \| 4b4e329e49f8af28fa6301bd06c48d7097eaf9e6#diff-6bdad48cfc34314e89599655442ff210 \| spark.io.encryption.commons.config.* \| 2.1.0 \| SPARK-5682 \| `4b4e329e49` \| spark.io.crypto.cipher.transformation \| 2.1.0 \| SPARK-5682 \| 4b4e329e49f8af28fa6301bd06c48d7097eaf9e6#diff-6bdad48cfc34314e89599655442ff210 \| spark.driver.host \| 0.7.0 \| None \| 02a6761589c35f15f1a6e3b63a7964ba057d3ba6#diff-eaf125f56ce786d64dcef99cf446a751 \| spark.driver.port \| 0.7.0 \| None \| 02a6761589c35f15f1a6e3b63a7964ba057d3ba6#diff-eaf125f56ce786d64dcef99cf446a751 \| spark.driver.supervise \| 1.3.0 \| SPARK-5388 \| 6ec0cdc14390d4dc45acf31040f21e1efc476fc0#diff-4d2ab44195558d5a9d5f15b8803ef39d \| spark.driver.bindAddress \| 2.1.0 \| SPARK-4563 \| 2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42#diff-6bdad48cfc34314e89599655442ff210 \| spark.blockManager.port \| 1.1.0 \| SPARK-2157 \| 31090e43ca91f687b0bc6e25c824dc25bd7027cd#diff-2b643ea78c1add0381754b1f47eec132 \| spark.driver.blockManager.port \| 2.1.0 \| SPARK-4563 \| 2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42#diff-6bdad48cfc34314e89599655442ff210 \| spark.files.ignoreCorruptFiles \| 2.1.0 \| SPARK-17850 \| 47776e7c0c68590fe446cef910900b1aaead06f9#diff-6bdad48cfc34314e89599655442ff210 \| spark.files.ignoreMissingFiles \| 2.4.0 \| SPARK-22676 \| ed4101d29f50d54fd7846421e4c00e9ecd3599d0#diff-6bdad48cfc34314e89599655442ff210 \| spark.log.callerContext \| 2.2.0 \| SPARK-16759 \| 3af894511be6fcc17731e28b284dba432fe911f5#diff-6bdad48cfc34314e89599655442ff210 \| In branch-2.2 but pom.xml is 2.1.0-SNAPSHOT spark.files.maxPartitionBytes \| 2.1.0 \| SPARK-16575 \| c8879bf1ee2af9ccd5d5656571d931d2fc1da024#diff-6bdad48cfc34314e89599655442ff210 \| spark.files.openCostInBytes \| 2.1.0 \| SPARK-16575 \| c8879bf1ee2af9ccd5d5656571d931d2fc1da024#diff-6bdad48cfc34314e89599655442ff210 \| spark.hadoopRDD.ignoreEmptySplits \| 2.3.0 \| SPARK-22233 \| 0fa10666cf75e3c4929940af49c8a6f6ea874759#diff-6bdad48cfc34314e89599655442ff210 \| spark.redaction.regex \| 2.1.2 \| SPARK-18535 and SPARK-19720 \| 444cca14d7ac8c5ab5d7e9d080b11f4d6babe3bf#diff-6bdad48cfc34314e89599655442ff210 \| spark.redaction.string.regex \| 2.2.0 \| SPARK-20070 \| 91fa80fe8a2480d64c430bd10f97b3d44c007bcc#diff-6bdad48cfc34314e89599655442ff210 \| spark.authenticate.secret \| 1.0.0 \| SPARK-1189 \| 7edbea41b43e0dc11a2de156be220db8b7952d01#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 \| spark.authenticate.secretBitLength \| 1.6.0 \| SPARK-11073 \| f8d93edec82eedab59d50aec06ca2de7e4cf14f6#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 \| spark.authenticate \| 1.0.0 \| SPARK-1189 \| 7edbea41b43e0dc11a2de156be220db8b7952d01#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 \| spark.authenticate.enableSaslEncryption \| 1.4.0 \| SPARK-6229 \| 38d4e9e446b425ca6a8fe8d8080f387b08683842#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 \| spark.authenticate.secret.file \| 3.0.0 \| SPARK-26239 \| 57d6fbfa8c803ce1791e7be36aba0219a1fcaa63#diff-6bdad48cfc34314e89599655442ff210 \| spark.authenticate.secret.driver.file \| 3.0.0 \| SPARK-26239 \| 57d6fbfa8c803ce1791e7be36aba0219a1fcaa63#diff-6bdad48cfc34314e89599655442ff210 \| spark.authenticate.secret.executor.file \| 3.0.0 \| SPARK-26239 \| 57d6fbfa8c803ce1791e7be36aba0219a1fcaa63#diff-6bdad48cfc34314e89599655442ff210 \| spark.buffer.write.chunkSize \| 2.3.0 \| SPARK-21527 \| 574ef6c987c636210828e96d2f797d8f10aff05e#diff-6bdad48cfc34314e89599655442ff210 \| spark.checkpoint.compress \| 2.2.0 \| SPARK-19525 \| 1405862382185e04b09f84af18f82f2f0295a755#diff-6bdad48cfc34314e89599655442ff210 \| spark.rdd.checkpoint.cachePreferredLocsExpireTime \| 3.0.0 \| SPARK-29182 \| 4ecbdbb6a7bd3908da32c82832e886b4f9f9e596#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.accurateBlockThreshold \| 2.2.1 \| SPARK-20801 \| 81f63c8923416014d5c6bc227dd3c4e2a62bac8e#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.registration.timeout \| 2.3.0 \| SPARK-20640 \| d107b3b910d8f434fb15b663a9db4c2dfe0a9f43#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.registration.maxAttempts \| 2.3.0 \| SPARK-20640 \| d107b3b910d8f434fb15b663a9db4c2dfe0a9f43#diff-6bdad48cfc34314e89599655442ff210 \| spark.reducer.maxBlocksInFlightPerAddress \| 2.2.1 \| SPARK-21243 \| 88dccda393bc79dc6032f71b6acf8eb2b4b152be#diff-6bdad48cfc34314e89599655442ff210 \| spark.network.maxRemoteBlockSizeFetchToMem \| 3.0.0 \| SPARK-26700 \| d8613571bc1847775dd5c1945757279234cb388c#diff-6bdad48cfc34314e89599655442ff210 \| spark.taskMetrics.trackUpdatedBlockStatuses \| 2.3.0 \| SPARK-20923 \| 5b5a69bea9de806e2c39b04b248ee82a7b664d7b#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.sort.io.plugin.class \| 3.0.0 \| SPARK-28209 \| abef84a868e9e15f346eea315bbab0ec8ac8e389#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.file.buffer \| 1.4.0 \| SPARK-7081 \| c53ebea9db418099df50f9adc1a18cee7849cd97#diff-ecdafc46b901740134261d2cab24ccd9 \| spark.shuffle.unsafe.file.output.buffer \| 2.3.0 \| SPARK-20950 \| 565e7a8d4ae7879ee704fb94ae9b3da31e202d7e#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.spill.diskWriteBufferSize \| 2.3.0 \| SPARK-20950 \| 565e7a8d4ae7879ee704fb94ae9b3da31e202d7e#diff-6bdad48cfc34314e89599655442ff210 \| spark.storage.unrollMemoryCheckPeriod \| 2.3.0 \| SPARK-21923 \| a11db942aaf4c470a85f8a1b180f034f7a584254#diff-6bdad48cfc34314e89599655442ff210 \| spark.storage.unrollMemoryGrowthFactor \| 2.3.0 \| SPARK-21923 \| a11db942aaf4c470a85f8a1b180f034f7a584254#diff-6bdad48cfc34314e89599655442ff210 \| spark.yarn.dist.forceDownloadSchemes \| 2.3.0 \| SPARK-21917 \| 8319432af60b8e1dc00f08d794f7d80591e24d0c#diff-6bdad48cfc34314e89599655442ff210 \| spark.extraListeners \| 1.3.0 \| SPARK-5411 \| 47e4d579eb4a9aab8e0dd9c1400394d80c8d0388#diff-364713d7776956cb8b0a771e9b62f82d \| spark.shuffle.spill.numElementsForceSpillThreshold \| 1.6.0 \| SPARK-10708 \| f6d06adf05afa9c5386dc2396c94e7a98730289f#diff-3eedc75de4787b842477138d8cc7f150 \| spark.shuffle.mapOutput.parallelAggregationThreshold \| 2.3.0 \| SPARK-22537 \| efd0036ec88bdc385f5a9ea568d2e2bbfcda2912#diff-6bdad48cfc34314e89599655442ff210 \| spark.driver.maxResultSize \| 1.2.0 \| SPARK-3466 \| 6181577e9935f46b646ba3925b873d031aa3d6ba#diff-d239aee594001f8391676e1047a0381e \| spark.security.credentials.renewalRatio \| 2.4.0 \| SPARK-23361 \| 5fa438471110afbf4e2174df449ac79e292501f8#diff-6bdad48cfc34314e89599655442ff210 \| spark.security.credentials.retryWait \| 2.4.0 \| SPARK-23361 \| 5fa438471110afbf4e2174df449ac79e292501f8#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.sort.initialBufferSize \| 2.1.0 \| SPARK-15958 \| bf665a958631125a1670504ef5966ef1a0e14798#diff-a1d00506391c1c4b2209f9bbff590c5b \| On branch-2.1, but in pom.xml it is 2.0.0-SNAPSHOT spark.shuffle.compress \| 0.6.0 \| None \| efc5423210d1aadeaea78273a4a8f10425753079#diff-76170a9c8f67b542bc58240a0a12fe08 \| spark.shuffle.spill.compress \| 0.9.0 \| None \| c3816de5040e3c48e58ed4762d2f4eb606812938#diff-2b643ea78c1add0381754b1f47eec132 \| spark.shuffle.mapStatus.compression.codec \| 3.0.0 \| SPARK-29939 \| 456cfe6e4693efd26d64f089d53c4e01bf8150a2#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.spill.initialMemoryThreshold \| 1.1.1 \| SPARK-4480 \| 16bf5f3d17624db2a96c921fe8a1e153cdafb06c#diff-31417c461d8901d8e08167b0cbc344c1 \| spark.shuffle.spill.batchSize \| 0.9.0 \| None \| c3816de5040e3c48e58ed4762d2f4eb606812938#diff-a470b9812a5ac8c37d732da7d9fbe39a \| spark.shuffle.sort.bypassMergeThreshold \| 1.1.1 \| SPARK-2787 \| 0f2274f8ed6131ad17326e3fff7f7e093863b72d#diff-31417c461d8901d8e08167b0cbc344c1 \| spark.shuffle.manager \| 1.1.0 \| SPARK-2044 \| 508fd371d6dbb826fd8a00787d347235b549e189#diff-60df49b5d3c59f2c4540fa16a90033a1 \| spark.shuffle.reduceLocality.enabled \| 1.5.0 \| SPARK-2774 \| 96a7c888d806adfdb2c722025a1079ed7eaa2052#diff-6a9ff7fb74fd490a50462d45db2d5e11 \| spark.shuffle.mapOutput.minSizeForBroadcast \| 2.0.0 \| SPARK-1239 \| d98dd72e7baeb59eacec4fefd66397513a607b2f#diff-609c3f8c26150ca96a94cd27146a809b \| spark.shuffle.mapOutput.dispatcher.numThreads \| 2.0.0 \| SPARK-1239 \| d98dd72e7baeb59eacec4fefd66397513a607b2f#diff-609c3f8c26150ca96a94cd27146a809b \| spark.shuffle.detectCorrupt \| 2.2.0 \| SPARK-4105 \| cf33a86285629abe72c1acf235b8bfa6057220a8#diff-eb30a71e0d04150b8e0b64929852e38b \| spark.shuffle.detectCorrupt.useExtraMemory \| 3.0.0 \| SPARK-26089 \| 688b0c01fac0db80f6473181673a89f1ce1be65b#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.sync \| 0.8.0 \| None \| 31da065b1d08c1fad5283e4bcf8e0ed01818c03e#diff-ad46ed23fcc3fa87f30d05204917b917 \| spark.shuffle.unsafe.fastMergeEnabled \| 1.4.0 \| SPARK-7081 \| c53ebea9db418099df50f9adc1a18cee7849cd97#diff-642ce9f439435408382c3ac3b5c5e0a0 \| spark.shuffle.sort.useRadixSort \| 2.0.0 \| SPARK-14724 \| e2b5647ab92eb478b3f7b36a0ce6faf83e24c0e5#diff-3eedc75de4787b842477138d8cc7f150 \| spark.shuffle.minNumPartitionsToHighlyCompress \| 2.4.0 \| SPARK-24519 \| 39dfaf2fd167cafc84ec9cc637c114ed54a331e3#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.useOldFetchProtocol \| 3.0.0 \| SPARK-25341 \| f725d472f51fb80c6ce1882ec283ff69bafb0de4#diff-6bdad48cfc34314e89599655442ff210 \| spark.shuffle.readHostLocalDisk \| 3.0.0 \| SPARK-30812 \| 68d7edf9497bea2f73707d32ab55dd8e53088e7c#diff-6bdad48cfc34314e89599655442ff210 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT Closes #27913 from beliefer/add-version-to-core-config-part-three. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-16 10:08:07 +09:00
Prashant Sharma	3b6da36cd6	[SPARK-31120][BUILD] Support enabling maven profiles for importing vi… …a sbt on Intellij IDEA. ### What changes were proposed in this pull request? Read from java property "sbt.maven.profiles", the maven profiles to be enabled while importing to intellij IDEA via SBT. ### Why are the changes needed? Without this change one needs to set an os-wide environment variable `SBT_MAVEN_PROFILES`, on mac it is even trickier (I have not figured out, what can be done). ### Does this PR introduce any user-facing change? None ### How was this patch tested? Manually tested by applying multiple profiles or a single profile. Please see the attached images to see the steps. <img width="802" alt="Screenshot 2020-03-11 at 4 09 57 PM" src="https://user-images.githubusercontent.com/992952/76411667-46223280-63b8-11ea-9a77-dc014b66d48b.png"> <img width="867" alt="Screenshot 2020-03-11 at 4 18 09 PM" src="https://user-images.githubusercontent.com/992952/76411676-4ae6e680-63b8-11ea-895d-ed9d6cc223c5.png"> Closes #27878 from ScrapCodes/SPARK-31120/idea-load-maven-profiles. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-15 12:39:46 -05:00

... 13 14 15 16 17 ...

27485 commits