Commit graph

26899 commits

Author SHA1 Message Date
gatorsmile 39a6f518cb
[SPARK-29554][SQL][FOLLOWUP] Update Auto-generated Alias Name for Version
### What changes were proposed in this pull request?
The auto-generated alias name of built-in function `version()` is `sparkversion()`. After this PR, it is updated to `version()`.

### Why are the changes needed?
Based on our auto-generated alias name convention for the built-in functions, the alias names should be consistent with the function names.

This built-in function `version` is added in the upcoming Spark 3.0. Thus, we should fix it before the release.

### Does this PR introduce any user-facing change?
Yes. Update the column name in schema if users do not specify the alias.

### How was this patch tested?
Added a test case.

Closes #28131 from gatorsmile/spark-29554followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-05 16:37:03 -07:00
HyukjinKwon 9f58f03857 [SPARK-31231][BUILD] Unset setuptools version in pip packaging test
### What changes were proposed in this pull request?

This PR unsets `setuptools` version in CI. This was fixed in the 0.46.1.2+ `setuptools` - pypa/setuptools#2046. `setuptools` 0.46.1.0 and 0.46.1.1 still have this problem.

### Why are the changes needed?

To test the latest setuptools out to see if users can actually install and use it.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Jenkins will test.

Closes #28111 from HyukjinKwon/SPARK-31231-revert.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-04 08:09:15 +09:00
yi.wu 5d76b12e9b
[SPARK-29154][FOLLOW-UP][CORE] RDD.resourceProfile should not be serialized
### What changes were proposed in this pull request?

Mark `RDD.resourceProfile` as `transient`.

### Why are the changes needed?

`RDD.resourceProfile` should only be used at driver side.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #28108 from Ngone51/spark_29154_followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-03 09:41:03 -07:00
Wenchen Fan 6b1ca886c0 [SPARK-31327][SQL] Write Spark version into Avro file metadata
### What changes were proposed in this pull request?

Write Spark version into Avro file metadata

### Why are the changes needed?

The version info is very useful for backward compatibility. This is also done in parquet/orc.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #28102 from cloud-fan/avro.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-03 12:43:33 +00:00
yi.wu a4fc6a6e98 [SPARK-31249][CORE] Fix flaky CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
### What changes were proposed in this pull request?

In `CoarseGrainedSchedulerBackendSuite.RegisterExecutor`, change it to post `SparkListenerExecutorAdded` before `context.reply(true)`.

### Why are the changes needed?

To fix flaky `CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied`.

In this test, though we use `askSync` to register executor but `askSync` could be finished before we posting the 3rd `SparkListenerExecutorAdded` event to the listener bus due to the reason that `context.reply(true)` comes before `listenerBus.post`.

The error can be reproduced if you:
- loop it for 500 times in one turn
- or, insert a `Thread.sleep(1000)` between `post` and `reply`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Loop the flaky tests for 1000 times without any error.

Closes #28100 from Ngone51/fix_spark_31249.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-03 16:05:07 +09:00
Huaxin Gao 4e45c07f5d [SPARK-31326][SQL][DOCS] Create Function docs structure for SQL Reference
### What changes were proposed in this pull request?
Create Function docs structure for SQL Reference...

### Why are the changes needed?
so the Function docs can be added later, also want to get a consensus about what to document for Functions in SQL Reference.

### Does this PR introduce any user-facing change?
Yes
<img width="1050" alt="Screen Shot 2020-04-02 at 12 09 20 AM" src="https://user-images.githubusercontent.com/13592258/78220451-68b6e100-7476-11ea-9a21-733b41652785.png">

<img width="1051" alt="Screen Shot 2020-04-02 at 12 09 44 AM" src="https://user-images.githubusercontent.com/13592258/78220460-6ce2fe80-7476-11ea-887c-defefd55c19d.png">

<img width="1051" alt="Screen Shot 2020-04-02 at 12 10 05 AM" src="https://user-images.githubusercontent.com/13592258/78220463-6f455880-7476-11ea-81fc-fd4137db7c3f.png">

### How was this patch tested?
Manually build and check

Closes #28099 from huaxingao/function.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-03 14:36:03 +09:00
Maxim Gekk 820bb9985a [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
### What changes were proposed in this pull request?
1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4.
2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case.

### Why are the changes needed?
This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see
<img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png">
Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets.
- Run the following code to check that there is no difference between rebased and original micros for modern timestamps:
```scala
    test("rebasing differences") {
      withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
        val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)
        val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)

        var micros = start
        var diff = Long.MaxValue
        var counter = 0
        while (micros < end) {
          val rebased = rebaseGregorianToJulianMicros(micros)
          val curDiff = rebased - micros
          if (curDiff != diff) {
            counter += 1
            diff = curDiff
            val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
            println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes")
          }
          micros += 30 * MICROS_PER_MINUTE
        }
        println(s"counter = $counter")
      }
    }
```
```
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
counter = 15
```
The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds.
- By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes #28101 from MaxGekk/fix-local-date-overlap.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-03 04:35:31 +00:00
Takeshi Yamamuro d98df7626b [SPARK-31325][SQL][WEB UI] Control a plan explain mode in the events of SQL listeners via SQLConf
### What changes were proposed in this pull request?

This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows;

Before this PR:
![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png)

After this PR:
![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png)

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.

### How was this patch tested?

Added unit tests.

Closes #28097 from maropu/SPARK-31325.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-04-02 21:09:16 -07:00
Wenchen Fan 2c39502e84 [SPARK-31253][SQL][FOLLOWUP] Add metrics to AQE shuffle reader
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
This is a followup of https://github.com/apache/spark/pull/28022, to address three issues:
1. Add an assert in `CustomShuffleReaderExec` to make sure the partitions specs are all `PartialMapperPartitionSpec` or none.
2. Do not use `lazy val` for `partitionDataSizeMetrics` and `skewedPartitionMetrics`, as they will be merged into `metrics`, and `lazy val` will be serialized.
3. mark `metrics` as `transient`, as it's only used at driver-side
4. move `FileUtils.byteCountToDisplaySize` to `logDebug`, to save some calculation if log level is above debug.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
followup improvement

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
no

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
existing tests

Closes #28103 from cloud-fan/ui.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-02 16:02:47 -07:00
Thomas Graves 55dea9be62 [SPARK-29153][CORE] Add ability to merge resource profiles within a stage with Stage Level Scheduling
### What changes were proposed in this pull request?

For the stage level scheduling feature, add the ability to optionally merged resource profiles if they were specified on multiple RDD within a stage.  There is a config to enable this feature, its off by default (spark.scheduler.resourceProfile.mergeConflicts). When the config is set to true, Spark will merge the profiles selecting the max value of each resource (cores, memory, gpu, etc).  further documentation will be added with SPARK-30322.

This also added in the ability to check if an equivalent resource profile already exists. This is so that if a user is running stages and combining the same profiles over and over again we don't get an explosion in the number of profiles.

### Why are the changes needed?

To allow users to specify resource on multiple RDD and not worry as much about if they go into the same stage and fail.

### Does this PR introduce any user-facing change?

Yes, when the config is turned on it now merges the profiles instead of errorring out.

### How was this patch tested?

Unit tests

Closes #28053 from tgravescs/SPARK-29153.

Lead-authored-by: Thomas Graves <tgraves@apache.org>
Co-authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-04-02 08:30:18 -05:00
turbofei ec28925236 [SPARK-31179] Fast fail the connection while last connection failed in fast fail time window
## What changes were proposed in this pull request?

For TransportFactory, the requests sent to the same address share a clientPool.
Specially, when the io.numConnectionPerPeer is 1, these requests would share a same client.
When this address is unreachable, the createClient operation would be still timeout.
And these requests would block each other during createClient, because there is a lock for this shared client.
It would cost connectionNum \* connectionTimeOut \* maxRetry to retry, and then fail the task.

It fact, it is expected that this task could fail in connectionTimeOut * maxRetry.

In this PR, I set a fastFail time window for the clientPool, if the last connection failed in this time window, the new connection would fast fail.

## Why are the changes needed?
It can save time for some cases.
## Does this PR introduce any user-facing change?
No.
## How was this patch tested?
Existing UT.

Closes #27943 from turboFei/SPARK-31179-fast-fail-connection.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-04-02 08:18:14 -05:00
beliefer a9260d0349 [SPARK-31315][SQL] SQLQueryTestSuite: Display the total compile time for generated java code
### What changes were proposed in this pull request?
After my investigation, `SQLQueryTestSuite` spent a lot of time compiling the generated java code.
Take `group-by.sql` as an example.
At first, I added some debug log into `SQLQueryTestSuite`.
Please reference 92b6af740c/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L402)
The execution command is as follows:
`build/sbt "~sql/test-only *SQLQueryTestSuite -- -z group-by.sql"`
The output show below:
```
00:56:06.192 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=true. run time: 20604
00:56:13.719 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY. run time: 7526
00:56:18.786 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN. run time: 5066
```
According to the log, we know.

Config | Run time(ms)
-- | --
spark.sql.codegen.wholeStage=true | 20604
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY | 7526
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN | 5066

We should display the total compile time for generated java code.

This PR will add the following to `SQLQueryTestSuite`'s output.
```
=== Metrics of Whole Codegen ===
Total compile time: 80.564516529 seconds
```

Note: At first, I wanted to use `CodegenMetrics.METRIC_COMPILATION_TIME` to do this. After many experiments, I found that `CodegenMetrics.METRIC_COMPILATION_TIME` is only effective for a single test case, and cannot play a role in the whole life cycle of `SQLQueryTestSuite`.
I checked the type of  ` CodegenMetrics.METRIC_COMPILATION_TIME` is `Histogram` and the latter preserves 1028 elements.` Histogram` is a metric which calculates the distribution of a value.

### Why are the changes needed?
Display the total compile time for generated java code.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28081 from beliefer/output-codegen-compile-time.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-02 09:13:22 +00:00
Kent Yao 1ce584f6b7 [SPARK-31321][SQL] Remove SaveMode check in v2 FileWriteBuilder
### What changes were proposed in this pull request?

The `SaveMode` is resolved before we create `FileWriteBuilder` to build `BatchWrite`.

In https://github.com/apache/spark/pull/25876, we removed save mode for DSV2 from DataFrameWriter. So that the `mode` method is never used which makes `validateInputs` fail determinately without `mode` set.

### Why are the changes needed?
rm dead code.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests.

Closes #28090 from yaooqinn/SPARK-31321.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-02 08:34:36 +00:00
beliefer 50e535c431 [SPARK-31295][DOC][FOLLOWUP] Supplement version for configuration appear in doc
### What changes were proposed in this pull request?
This PR supplements version for configuration appear in docs.
I sorted out some information show below.

**docs/sql-performance-tuning.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.inMemoryColumnarStorage.compressed | 1.0.1 | SPARK-2631 | 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.batchSize | 1.1.1 | SPARK-2650 | 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.files.maxPartitionBytes | 2.0.0 | SPARK-13664 | 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.files.openCostInBytes | 2.0.0 | SPARK-14259 | 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.broadcastTimeout | 1.3.0 | SPARK-4269 | fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.autoBroadcastJoinThreshold | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.shuffle.partitions | 1.1.0 | SPARK-1508 | 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.coalescePartitions.enabled | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.coalescePartitions.minPartitionNum | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.coalescePartitions.initialPartitionNum | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.advisoryPartitionSizeInBytes | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewJoin.enabled | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewJoin.skewedPartitionFactor | 3.0.0 | SPARK-31037 | 46b7f1796bd0b96977ce9b473601033f397a3b18#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes | 3.0.0 | SPARK-31201 | 8d0800a0803d3c47938bddefa15328d654739bc5#diff-9a6b543db706f1a90f790783d6930a13 |  

**docs/sql-ref-ansi-compliance.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.ansi.enabled | 3.0.0 | SPARK-30125 | d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.storeAssignmentPolicy | 3.0.0 | SPARK-28730 | 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 |

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28096 from beliefer/supplement-version-of-performance.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-02 16:01:54 +09:00
Mukul Murthy 34abbb677d [SPARK-31324][SS] Include stream ID in the termination timeout error message
### What changes were proposed in this pull request?

This PR (SPARK-31324) aims to include stream ID in the error thrown when a stream does not stop() in time.

### Why are the changes needed?

https://github.com/apache/spark/pull/26771/ added a conf to set a requested timeout for stopping a stream, after which the stop() method throws. From seeing this in a production use case with several streams running, it's helpful to include which stream failed to stop in the error message.

### Does this PR introduce any user-facing change?

If a stream times out when terminating, the error message now includes the stream ID.

Before:
`Stream Execution thread failed to stop within 2000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`

After:
`Stream Execution thread for stream [id = 8513769d-b9d2-4902-9b36-3668bd022245, runId = 21ed8c35-9bfe-423f-853d-c022d91818bc] failed to stop within 2000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`

### How was this patch tested?

Updated existing unit test

Closes #28095 from mukulmurthy/31324-id.

Authored-by: Mukul Murthy <mukul.murthy@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-02 12:37:58 +09:00
Wenchen Fan 09f036a14c
[SPARK-31322][SQL] rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries
### What changes were proposed in this pull request?

rename `QueryPlan.collectInPlanAndSubqueries` to `collectWithSubqueries`

### Why are the changes needed?

The old name is too verbose. `QueryPlan` is internal but it's the core of catalyst and we'd better make the API name clearer before we release it.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #28092 from cloud-fan/rename.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-01 12:04:40 -07:00
Kousuke Saruta b9b1b549af
[SPARK-31073][DOC][FOLLOWUP] Add description for Shuffle Write Time metric in StagePage to web-ui.md
### What changes were proposed in this pull request?

This PR adds description for `Shuffle Write Time` to `web-ui.md`.

### Why are the changes needed?

#27837 added `Shuffle Write Time` metric to task metrics summary but it's not documented yet.

### Does this PR introduce any user-facing change?

Yes.
We can see the description for `Shuffle Write Time` in the new `web-ui.html`.
<img width="956" alt="shuffle-write-time-description" src="https://user-images.githubusercontent.com/4736016/78175342-a9722280-7495-11ea-9cc6-62c6f3619aa3.png">

### How was this patch tested?

Built docs by `SKIP_API=1 jekyll build` in `doc` directory and then confirmed `web-ui.html`.

Closes #28093 from sarutak/SPARK-31073-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-01 12:03:41 -07:00
ulysses 2c0e15e1d0
[SPARK-31285][CORE] uppercase schedule mode string at config
### What changes were proposed in this pull request?

In `TaskSchedulerImpl`, Spark will upper schedule mode `SchedulingMode.withName(schedulingModeConf.toUpperCase(Locale.ROOT))`.
But at other place, Spark does not. Such as [AllJobsPage](5945d46c11/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala (L304)).
We should have the same behavior and uppercase schema mode string at config.

### Why are the changes needed?

Before this pr, it's ok to set `spark.scheduler.mode=fair` logically.
But Spark will throw warn log
```
java.util.NoSuchElementException: No value found for 'fair'
	at scala.Enumeration.withName(Enumeration.scala:124)
	at org.apache.spark.ui.jobs.AllJobsPage$$anonfun$22.apply(AllJobsPage.scala:314)
	at org.apache.spark.ui.jobs.AllJobsPage$$anonfun$22.apply(AllJobsPage.scala:314)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:314)
	at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:90)
	at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:90)
	at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
```

### Does this PR introduce any user-facing change?

Almost no.

### How was this patch tested?

Exists Test.

Closes #28049 from ulysses-you/SPARK-31285.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-01 11:46:41 -07:00
Wenchen Fan 783852cc2e [SPARK-31320] Fix release script for 3.0.0
### What changes were proposed in this pull request?

The release script stops working after d5865493ae , as we require `mkdocs 1.0.0`.

This PR upgrades `mkdocs` from 0.1.6.3 to 1.0.4. To do that ruby is also upgraded to 2.5.

This PR also fixes some small issues.

### Why are the changes needed?

to make RC

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

tested by 3.0.0-rc1

Closes #28088 from cloud-fan/rc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-01 16:45:28 +09:00
Max Gekk 91af87d34e [SPARK-31311][SQL][TESTS] Benchmark date-time rebasing in ORC datasource
### What changes were proposed in this pull request?
In the PR, I propose to add new benchmarks to `DateTimeRebaseBenchmark` for saving and loading dates/timestamps to/from ORC files. I extracted common code from the benchmark for Parquet datasource and place it to the methods `caseName()` and `getPath()`. Added benchmarks for ORC save/load dates before and after 1582-10-15 because an implementation may have different performance for dates before the Julian calendar cutover day, see #28067 as an example.

### Why are the changes needed?
To have the base line for future optimizations of `fromJavaDate()`/`toJavaDate()` and `toJavaTimestamp()`/`fromJavaTimestamp()` in `DateTimeUtils`. The methods are used while saving/loading dates/timestamps by ORC datasource.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes #28076 from MaxGekk/rebase-benchmark-orc.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-01 07:02:26 +00:00
Maxim Gekk c5323d2e8d [SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
### What changes were proposed in this pull request?
In the PR, I propose to replace the following SQL configs:
1.  `spark.sql.legacy.parquet.rebaseDateTime.enabled` by
    - `spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled` (`false` by default). The config enables rebasing dates/timestamps while saving to Parquet files. If it is set to `true`, dates/timestamps are converted to local date-time in Proleptic Gregorian calendar, date-time fields are extracted, and used in building new local date-time in the hybrid calendar (Julian + Gregorian). The resulted local date-time is converted to days or microseconds since the epoch.
    - `spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled` (`false` by default). The config enables rebasing of dates/timestamps in reading from Parquet files.
2. `spark.sql.legacy.avro.rebaseDateTime.enabled` by
    - `spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled` (`false` by default). It enables dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar via local date/timestamps.
    - `spark.sql.legacy.avro.rebaseDateTimeInRead.enabled` (`false` by default).  It enables rebasing dates/timestamps from the hybrid calendar to Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

### Why are the changes needed?
This allows to load dates/timestamps saved by Spark 2.4, and save to Parquet/Avro files without rebasing. And the reverse use case - load data saved by Spark 3.0, and save it in the form which is compatible with Spark 2.4.

### Does this PR introduce any user-facing change?
Yes, users have to use new SQL configs. Old SQL configs are removed by the PR.

### How was this patch tested?
By existing test suites `AvroV1Suite`, `AvroV2Suite` and `ParquetIOSuite`.

Closes #28082 from MaxGekk/split-rebase-configs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-01 04:56:05 +00:00
Dongjoon Hyun dba525c997 [SPARK-31313][K8S][TEST] Add m01 node name to support Minikube 1.8.x
### What changes were proposed in this pull request?

This PR aims to add `m01` as a node name additionally to `PVTestsSuite`.

### Why are the changes needed?

minikube 1.8.0 ~ 1.8.2 generate a cluster with a nodename `m01` while all the other versions have `minikube`. This causes `PVTestSuite` failure.
```
$ minikube --vm-driver=hyperkit start --memory 6000 --cpus 8
* minikube v1.8.2 on Darwin 10.15.3
  - MINIKUBE_ACTIVE_DOCKERD=minikube
* Using the hyperkit driver based on user configuration
* Creating hyperkit VM (CPUs=8, Memory=6000MB, Disk=20000MB) ...
* Preparing Kubernetes v1.18.0 on Docker 19.03.6 ...
* Launching Kubernetes ...
* Enabling addons: default-storageclass, storage-provisioner
* Waiting for cluster to come online ...
* Done! kubectl is now configured to use "minikube"

$ kubectl get nodes
NAME   STATUS   ROLES    AGE   VERSION
m01    Ready    master   22s   v1.17.3
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This only adds a new node name. So, K8S Jenkins job should passed.
In addition, `K8s` integration test suite should be tested on `minikube 1.8.2` manually.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
- Run SparkR on simple dataframe.R example
Run completed in 10 minutes, 23 seconds.
Total number of tests run: 20
Suites: completed 2, aborted 0
Tests: succeeded 20, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

For the above test, Minikube 1.8.2 and K8s v1.18.0 is used.
```
$ minikube version
minikube version: v1.8.2
commit: eb13446e786c9ef70cb0a9f85a633194e62396a1

$ kubectl version --short
Client Version: v1.18.0
Server Version: v1.18.0
```

Closes #28080 from dongjoon-hyun/SPARK-31313.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-04-01 03:42:26 +00:00
Huaxin Gao fd0b228127 [SPARK-31290][R] Add back the deprecated R APIs
### What changes were proposed in this pull request?
Add back the deprecated R APIs removed by https://github.com/apache/spark/pull/22843/ and https://github.com/apache/spark/pull/22815.

These APIs are

- `sparkR.init`
- `sparkRSQL.init`
- `sparkRHive.init`
- `registerTempTable`
- `createExternalTable`
- `dropTempTable`

No need to port the function such as
```r
createExternalTable <- function(x, ...) {
  dispatchFunc("createExternalTable(tableName, path = NULL, source = NULL, ...)", x, ...)
}
```
because this was for the backward compatibility when SQLContext exists before assuming from https://github.com/apache/spark/pull/9192,  but seems we don't need it anymore since SparkR replaced SQLContext with Spark Session at https://github.com/apache/spark/pull/13635.

### Why are the changes needed?
Amend Spark's Semantic Versioning Policy

### Does this PR introduce any user-facing change?
Yes
The removed R APIs are put back.

### How was this patch tested?
Add back the removed tests

Closes #28058 from huaxingao/r.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-01 10:38:03 +09:00
Liang-Chi Hsieh 20fc6fa839
[SPARK-31308][PYSPARK] Merging pyFiles to files argument for Non-PySpark applications
### What changes were proposed in this pull request?

This PR (SPARK-31308) proposed to add python dependencies even it is not Python applications.

### Why are the changes needed?

For now, we add `pyFiles` argument to `files` argument only for Python applications, in SparkSubmit. Like the reason in #21420, "for some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies.", we need to add `pyFiles` to `files` even it is not Python applications.

### Does this PR introduce any user-facing change?

Yes. After this change, for non-PySpark applications, the Python files specified by `pyFiles` are also added to `files` like PySpark applications.

### How was this patch tested?

Manually test on jupyter notebook or do `spark-submit` with `--verbose`.

```
Spark config:
...
(spark.files,file:/Users/dongjoon/PRS/SPARK-PR-28077/a.py)
(spark.submit.deployMode,client)
(spark.master,local[*])
```

Closes #28077 from viirya/pyfile.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-31 18:08:55 -07:00
yi.wu 5ec1814e22
[SPARK-31248][CORE][TEST] Fix flaky ExecutorAllocationManagerSuite.interleaving add and remove
### What changes were proposed in this pull request?

This PR (SPARK-31248) uses `ManualClock` to disable `ExecutorAllocationManager.schedule()`  in order to avoid unexpected update of target executors.

### Why are the changes needed?

`ExecutorAllocationManager` will call `schedule` periodically, which may update target executors before we checking 496f6ac860/core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala (L864)

And fail the check:

```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 12 did not equal 8
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
	at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$51(ExecutorAllocationManagerSuite.scala:864)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
	at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Update test.

Closes #28084 from Ngone51/spark_31248.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-31 17:26:58 -07:00
Huaxin Gao 1a7f9649b6 [SPARK-31305][SQL][DOCS] Add a page to list all commands in SQL Reference
### What changes were proposed in this pull request?
Add a page to list all commands in SQL Reference...

### Why are the changes needed?
so it's easier for user to find a specific command.

### Does this PR introduce any user-facing change?
before:
![image](https://user-images.githubusercontent.com/13592258/77938658-ec03e700-726a-11ea-983c-7a559cc0aae2.png)

after:
![image](https://user-images.githubusercontent.com/13592258/77937899-d3df9800-7269-11ea-85db-749a9521576a.png)

![image](https://user-images.githubusercontent.com/13592258/77937924-db9f3c80-7269-11ea-9441-7603feee421c.png)

Also move ```use database``` from query category to ddl category.

### How was this patch tested?
Manually build and check

Closes #28074 from huaxingao/list-all.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-01 08:42:15 +09:00
Qianyang Yu e65c21e093 [SPARK-31304][ML][EXAMPLES] Add examples for ml.stat.ANOVATest
### What changes were proposed in this pull request?

Add ANOVATest example for ml.stat.ANOVATest in python/java/scala

### Why are the changes needed?

Improve ML example

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

manually run the example

Closes #28073 from kevinyu98/add-ANOVA-example.

Authored-by: Qianyang Yu <qyu@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-03-31 16:33:26 -05:00
Wenchen Fan 34c7ec8e0c [SPARK-31253][SQL] Add metrics to AQE shuffle reader
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->
Add SQL metrics to the AQE shuffle reader (`CustomShuffleReaderExec`)

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
to be more UI friendly

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
new test

Closes #28022 from cloud-fan/metrics.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-03-31 13:03:52 -07:00
yi.wu 590b9a0132 [SPARK-31010][SQL][FOLLOW-UP] Add Java UDF suggestion in error message of untyped Scala UDF
### What changes were proposed in this pull request?

Added Java UDF suggestion in the in error message of untyped Scala UDF.

### Why are the changes needed?

To help user migrate their use case from deprecate untyped Scala UDF to other supported UDF.

### Does this PR introduce any user-facing change?

No. It haven't been released.

### How was this patch tested?

Pass Jenkins.

Closes #28070 from Ngone51/spark_31010.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 17:35:26 +00:00
Jungtaek Lim (HeartSaVioR) 2a6aa8e87b [SPARK-31312][SQL] Cache Class instance for the UDF instance in HiveFunctionWrapper
### What changes were proposed in this pull request?

This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now.

It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance.

This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests.

Credit to cloud-fan as he discovered the problem and proposed the solution.

### Why are the changes needed?

Above section describes why it's a bug and how it's fixed.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UTs added.

Closes #28079 from HeartSaVioR/SPARK-31312.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 16:17:26 +00:00
Wenchen Fan 8b01473e8b [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2)
### What changes were proposed in this pull request?

Create statement plans in `DataFrameWriter(V2)`, like the SQL API.

### Why are the changes needed?

It's better to leave all the resolution work to the analyzer.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27992 from cloud-fan/statement.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 23:19:46 +08:00
Yuanjian Li 07c50784d3 [SPARK-31314][CORE] Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
### What changes were proposed in this pull request?
This reverts commit 8cf76f8d61. #25962

### Why are the changes needed?
In SPARK-29285, we change to create shuffle temporary eagerly. This is helpful for not to fail the entire task in the scenario of occasional disk failure. But for the applications that many tasks don't actually create shuffle files, it caused overhead. See the below benchmark:
Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 times.
Data: TPC-DS scale=99 generate by spark-tpcds-datagen
Results:
|     | Base                                                                                        | Revert                                                                                      |
|-----|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| Q20 | Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579)  Median 2.722007606  | Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 2.224627274)  Median 2.586498463 |
| Q33 | Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818)  Median 4.568787136 | Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 3.783188024)  Median 4.082311276  |
| Q52 | Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664)  Median 3.225437871 | Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 2.606163423)  Median 3.196025108 |
| Q56 | Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227)  Median 4.609965579 | Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 3.657525982)  Median 4.195202502 |

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #28072 from xuanyuanking/SPARK-29285-revert.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 19:01:08 +08:00
Maxim Gekk bb0b416f0b [SPARK-31297][SQL] Speed up dates rebasing
### What changes were proposed in this pull request?
In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`:

| date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff|
| ---- | ----|----|----|
|0001-01-01|-719162|-719164|-2|
|0100-03-01|-682944|-682945|-1|
|0200-03-01|-646420|-646420|0|
|0300-03-01|-609896|-609895|1|
|0500-03-01|-536847|-536845|2|
|0600-03-01|-500323|-500320|3|
|0700-03-01|-463799|-463795|4|
|0900-03-01|-390750|-390745|5|
|1000-03-01|-354226|-354220|6|
|1100-03-01|-317702|-317695|7|
|1300-03-01|-244653|-244645|8|
|1400-03-01|-208129|-208120|9|
|1500-03-01|-171605|-171595|10|
|1582-10-15|-141427|-141427|0|

For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar.

For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001.

### Why are the changes needed?
To make dates rebasing faster.

### Does this PR introduce any user-facing change?
No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`.

### How was this patch tested?
- Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates.
- Re-run `DateTimeRebaseBenchmark` on:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28067 from MaxGekk/optimize-rebasing.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 17:38:47 +08:00
HyukjinKwon 4d4c3e76f6 Revert "[SPARK-30879][DOCS] Refine workflow for building docs"
This reverts commit 7892f88f84.
2020-03-31 16:11:59 +09:00
Ben Ryves fa37856710 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound
### What changes were proposed in this pull request?
A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`.

### Why are the changes needed?
`rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](a1dbcd13a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala (L71))). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0).

### Does this PR introduce any user-facing change?
Only documentation changes.

### How was this patch tested?
Documentation changes only.

Closes #28071 from Smeb/master.

Authored-by: Ben Ryves <benjamin.ryves@getyourguide.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 15:16:17 +09:00
beliefer 47c810f8ae [SPARK-31279][SQL][DOC] Add version information to the configuration of Hive
### What changes were proposed in this pull request?
Add version information to the configuration of `Hive`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.hive.metastore.version | 1.4.0 | SPARK-6908 | 05454fd8aef75b129cbbd0288f5089c5259f4a15#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.version | 1.1.1 | SPARK-3971 | 64945f868443fbc59cb34b34c16d782dda0fb63d#diff-12fa2178364a810b3262b30d8d48aa2d |  
spark.sql.hive.metastore.jars | 1.4.0 | SPARK-6908 | 05454fd8aef75b129cbbd0288f5089c5259f4a15#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.convertMetastoreParquet.mergeSchema | 1.3.1 | SPARK-6575 | 778c87686af0c04df9dfe144b8f744f271a988ad#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.convertMetastoreOrc | 2.0.0 | SPARK-14070 | 1e886159849e3918445d3fdc3c4cef86c6c1a236#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.convertInsertingPartitionedTable | 3.0.0 | SPARK-28573 | d5688dc732890923c326f272b0c18c329a69459a#diff-842e3447fc453de26c706db1cac8f2c4 |  
spark.sql.hive.convertMetastoreCtas | 3.0.0 | SPARK-25271 | 5ad03607d1487e7ab3e3b6d00eef9c4028ed4975#diff-842e3447fc453de26c706db1cac8f2c4 |  
spark.sql.hive.metastore.sharedPrefixes | 1.4.0 | SPARK-7491 | a8556086d33cb993fab0ae2751e31455e6c664ab#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.metastore.barrierPrefixes | 1.4.0 | SPARK-7491 | a8556086d33cb993fab0ae2751e31455e6c664ab#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.hive.thriftServer.async | 1.5.0 | SPARK-6964 | eb19d3f75cbd002f7e72ce02017a8de67f562792#diff-ff50aea397a607b79df9bec6f2a841db |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Exists UT

Closes #28042 from beliefer/add-version-to-hive-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:35:01 +09:00
beliefer 4fc8ee74fc [SPARK-31295][DOC] Supplement version for configuration appear in doc
### What changes were proposed in this pull request?
This PR supplements version for configuration appear in docs.
I sorted out some information show below.

**docs/spark-standalone.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |  
spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |  
spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |  
spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |  
spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 |  
spark.storage.cleanupFilesAfterExecutorExit | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |  
spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |  
spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |  

**docs/sql-data-sources-avro.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  

**docs/sql-data-sources-orc.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |  

**docs/sql-data-sources-parquet.md**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28064 from beliefer/supplement-doc-for-data-sources.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:33:46 +09:00
beliefer fc5d67fe22 [SPARK-31282][DOC] Supplement version for configuration appear in security doc
### What changes were proposed in this pull request?
This PR supplements version for configuration appear in security doc.
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.network.crypto.keyLength | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.network.crypto.keyFactoryAlgorithm | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.network.crypto.config.* | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.network.crypto.saslFallback | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.authenticate.enableSaslEncryption | 2.2.0 | SPARK-19139 | 8f3f73abc1fe62496722476460c174af0250e3fe#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.network.sasl.serverAlwaysEncrypt | 1.4.0 | SPARK-6229 | 38d4e9e446b425ca6a8fe8d8080f387b08683842#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.ui.filters | 1.0.0 | SPARK-1189 | 7edbea41b43e0dc11a2de156be220db8b7952d01#diff-f79a5ead735b3d0b34b6b94486918e1c |  
spark.acls.enable | 1.1.0 | SPARK-1890 and SPARK-1891 | e3fe6571decfdc406ec6d505fd92f9f2b85a618c#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.ui.view.acls | 1.0.0 | SPARK-1189 | 7edbea41b43e0dc11a2de156be220db8b7952d01#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.ui.view.acls.groups | 2.0.0 | SPARK-4224 | ae79032dcf160796851ca29116cca146c4d86ada#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.admin.acls | 1.1.0 | SPARK-1890 and SPARK-1891 | e3fe6571decfdc406ec6d505fd92f9f2b85a618c#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.admin.acls.groups | 2.0.0 | SPARK-4224 | ae79032dcf160796851ca29116cca146c4d86ada#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.modify.acls | 1.1.0 | SPARK-1890 and SPARK-1891 | e3fe6571decfdc406ec6d505fd92f9f2b85a618c#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.modify.acls.groups | 2.0.0 | SPARK-4224 | ae79032dcf160796851ca29116cca146c4d86ada#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.user.groups.mapping | 2.0.0 | SPARK-4224 | ae79032dcf160796851ca29116cca146c4d86ada#diff-afd88f677ec5ff8b5e96a5cbbe00cd98 |  
spark.history.ui.acls.enable | 1.0.1 | Spark 1489 | c8dd13221215275948b1a6913192d40e0c8cbadd#diff-b49b5b9c31ddb36a9061004b5b723058 |  
spark.history.ui.admin.acls | 2.1.1 | SPARK-19033 | 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.history.ui.admin.acls.groups | 2.1.1 | SPARK-19033 | 4ca1788805e4a0131ba8f0ccb7499ee0e0242837#diff-a7befb99e7bd7e3ab5c46c2568aa5b3e |  
spark.ui.xXssProtection | 2.3.0 | SPARK-22188 | 5a07aca4d464e96d75ea17bf6768e24b829872ec#diff-6bdad48cfc34314e89599655442ff210 |  
spark.ui.xContentTypeOptions.enabled | 2.3.0 | SPARK-22188 | 5a07aca4d464e96d75ea17bf6768e24b829872ec#diff-6bdad48cfc34314e89599655442ff210 |  
spark.ui.strictTransportSecurity | 2.3.0 | SPARK-22188 | 5a07aca4d464e96d75ea17bf6768e24b829872ec#diff-6bdad48cfc34314e89599655442ff210 |  
spark.security.credentials.${service}.enabled | 2.3.0 | SPARK-20434 | a18d637112b97d2caaca0a8324bdd99086664b24#diff-da6c1fd6d8b0c7538a3e77a09e06a083 |  
spark.kerberos.access.hadoopFileSystems | 3.0.0 | SPARK-26766 | d0443a74d185ec72b747fa39994fa9a40ce974cf#diff-6bdad48cfc34314e89599655442ff210 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28044 from beliefer/supplement-version-to-security-doc.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:33:01 +09:00
beliefer 18b73a5b59 [SPARK-31269][DOC] Supplement version for configuration only appear in configuration doc
### What changes were proposed in this pull request?
The `configuration.md` exists some config not organized by `ConfigEntry`.
This PR supplements version for configuration only appear in configuration doc.
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.app.name | 0.9.0 | None | 994f080f8ae3372366e6004600ba791c8a372ff0#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.driver.resource.{resourceName}.amount | 3.0.0 | SPARK-27760 | d30284b5a51dd784f663eb4eea37087b35a54d00#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.driver.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27488 | 74e5e41eebf9ed596b48e6db52a2a9c642e5cbc3#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.driver.resource.{resourceName}.vendor | 3.0.0 | SPARK-27362 | 1277f8fa92da85d9e39d9146e3099fcb75c71a8f#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.executor.resource.{resourceName}.amount | 3.0.0 | SPARK-27760 | d30284b5a51dd784f663eb4eea37087b35a54d00#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.executor.resource.{resourceType}.discoveryScript | 3.0.0 | SPARK-27024 | db2e3c43412e4a7fb4a46c58d73d9ab304a1e949#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.executor.resource.{resourceName}.vendor | 3.0.0 | SPARK-27362 | 1277f8fa92da85d9e39d9146e3099fcb75c71a8f#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.local.dir | 0.5.0 | None | 0e93891d3d7df849cff6442038c111ffd42a5243#diff-17fd275d280b667722664ed833c6402a |  
spark.logConf | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-364713d7776956cb8b0a771e9b62f82d |  
spark.master | 0.9.0 | SPARK-544 | 2573add94cf920a88f74d80d8ea94218d812704d#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.driver.defaultJavaOptions | 3.0.0 | SPARK-23472 | f83000597f250868de9722d8285fed013abc5ecf#diff-a78ecfc6a89edfaf0b60a5eaa0381970 |  
spark.executor.defaultJavaOptions | 3.0.0 | SPARK-23472 | f83000597f250868de9722d8285fed013abc5ecf#diff-a78ecfc6a89edfaf0b60a5eaa0381970 |  
spark.executorEnv.[EnvironmentVariableName] | 0.9.0 | None | 642029e7f43322f84abe4f7f36bb0b1b95d8101d#diff-529fc5c06b9731c1fbda6f3db60b16aa |  
spark.python.profile | 1.2.0 | SPARK-3478 | 1aa549ba9839565274a12c52fa1075b424f138a6#diff-d6fe2792e44f6babc94aabfefc8b9bce |  
spark.python.profile.dump | 1.2.0 | SPARK-3478 | 1aa549ba9839565274a12c52fa1075b424f138a6#diff-d6fe2792e44f6babc94aabfefc8b9bce |  
spark.python.worker.memory | 1.1.0 | SPARK-2538 | 14174abd421318e71c16edd24224fd5094bdfed4#diff-d6fe2792e44f6babc94aabfefc8b9bce |  
spark.jars.packages | 1.5.0 | SPARK-9263 | 34335719a372c1951fdb4dd25b75b086faf1076f#diff-63a5d817d2d45ae24de577f6a1bd80f9 |  
spark.jars.excludes | 1.5.0 | SPARK-9263 | 34335719a372c1951fdb4dd25b75b086faf1076f#diff-63a5d817d2d45ae24de577f6a1bd80f9 |  
spark.jars.ivy | 1.3.0 | SPARK-5341 | 3b7acd22ab4a134c74746e3b9a803dbd34d43855#diff-63a5d817d2d45ae24de577f6a1bd80f9 |  
spark.jars.ivySettings | 2.2.0 | SPARK-17568 | 3bc2eff8880a3ba8d4318118715ea1a47048e3de#diff-4d2ab44195558d5a9d5f15b8803ef39d |  
spark.jars.repositories | 2.3.0 | SPARK-21403 | d8257b99ddae23f702f312640a5335ddb4554403#diff-4d2ab44195558d5a9d5f15b8803ef39d |
spark.shuffle.io.maxRetries | 1.2.0 | SPARK-4188 | c1ea5c542f3267c0b23a7775887e3a6ece793fe3#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.shuffle.io.numConnectionsPerPeer | 1.2.1 | SPARK-4740 | 441ec3451730c7ae3dbef8952e313071d6147ab6#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.shuffle.io.preferDirectBufs | 1.2.0 | SPARK-4188 | c1ea5c542f3267c0b23a7775887e3a6ece793fe3#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.shuffle.io.retryWait | 1.2.1 | None | 5e5d8f469a1bea9bbe606f772ccdcab7c184c651#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.shuffle.io.backLog | 1.1.1 | SPARK-2468 | 66b4c81db7e826c00f7fb449b8a8af810cf7dd9a#diff-bdee8e601924d41e93baa7287189e878 |  
spark.shuffle.service.index.cache.size | 2.3.0 | SPARK-21501 | 1662e93119d68498942386906de309d35f4a135f#diff-97d5edc927a83a678e013ae00343df94 |
spark.shuffle.maxChunksBeingTransferred | 2.3.0 | SPARK-21175 | 799e13161e89f1ea96cb1bc7b507a05af2e89cd0#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.sql.ui.retainedExecutions | 1.5.0 | SPARK-8861 and SPARK-8862 | ebc3aad272b91cf58e2e1b4aa92b49b8a947a045#diff-81764e4d52817f83bdd5336ef1226bd9 |  
spark.streaming.ui.retainedBatches | 1.0.0 | SPARK-1386 | f36dc3fed0a0671b0712d664db859da28c0a98e2#diff-56b8d67d07284cfab165d5363bd3500e |
spark.default.parallelism | 0.5.0 | None | e5c4cd8a5e188592f8786a265c0cd073c69ac886#diff-0544ebf7533fa70ff5103e0fe1f0b036 |  
spark.files.fetchTimeout | 1.0.0 | None | f6f9d02e85d17da2f742ed0062f1648a9293e73c#diff-d239aee594001f8391676e1047a0381e |  
spark.files.useFetchCache | 1.2.2 | SPARK-6313 | a2a94a154bdd00753b8d5e344d712664c7151050#diff-d239aee594001f8391676e1047a0381e |
spark.files.overwrite | 1.0.0 | None | 84670f2715392859624df290c1b52eb4ed4a9cb1#diff-d239aee594001f8391676e1047a0381e | Exists in branch-1.0, but the version of pom is 0.9.0-incubating-SNAPSHOT
spark.hadoop.cloneConf | 1.0.3 | SPARK-2546 | 6d8f1dd15afdc7432b5721c89f9b2b402460322b#diff-83eb37f7b0ebed3c14ccb7bff0d577c2 |  
spark.hadoop.validateOutputSpecs | 1.0.1 | SPARK-1677 | 8100cbdb7546e8438019443cfc00683017c81278#diff-f70e97c099b5eac05c75288cb215e080 |
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version | 2.2.0 | SPARK-20107 | edc87d76efea7b4d19d9d0c4ddba274a3ccb8752#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.rpc.io.backLog | 3.0.0 | SPARK-27868 | 09ed64d795d3199a94e175273fff6fcea6b52131#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.network.io.preferDirectBufs | 3.0.0 | SPARK-24920 | e103c4a5e72bab8862ff49d6d4c1e62e642fc412#diff-0ac65da2bc6b083fb861fe410c7688c2 |  
spark.port.maxRetries | 1.1.1 | SPARK-3565 | 32f2222e915f31422089139944a077e2cbd442f9#diff-d239aee594001f8391676e1047a0381e |  
spark.core.connection.ack.wait.timeout | 1.1.1 | SPARK-2677 | bd3ce2ffb8964abb4d59918ebb2c230fe4614aa2#diff-f748e95f2aa97ed715afa53ddeeac9de |  
spark.scheduler.listenerbus.eventqueue.shared.capacity | 3.0.0 | SPARK-28574 | c212c9d9ed7375cd1ea16c118733edd84037ec0d#diff-eb519ad78cc3cf0b95839cc37413b509 |  
spark.scheduler.listenerbus.eventqueue.appStatus.capacity | 3.0.0 | SPARK-28574 | c212c9d9ed7375cd1ea16c118733edd84037ec0d#diff-eb519ad78cc3cf0b95839cc37413b509 |  
spark.scheduler.listenerbus.eventqueue.executorManagement.capacity | 3.0.0 | SPARK-28574 | c212c9d9ed7375cd1ea16c118733edd84037ec0d#diff-eb519ad78cc3cf0b95839cc37413b509 |  
spark.scheduler.listenerbus.eventqueue.eventLog.capacity | 3.0.0 | SPARK-28574 | c212c9d9ed7375cd1ea16c118733edd84037ec0d#diff-eb519ad78cc3cf0b95839cc37413b509 |  
spark.scheduler.listenerbus.eventqueue.streams.capacity | 3.0.0 | SPARK-28574 | c212c9d9ed7375cd1ea16c118733edd84037ec0d#diff-eb519ad78cc3cf0b95839cc37413b509 |  
spark.task.resource.{resourceName}.amount | 3.0.0 | SPARK-27760 | d30284b5a51dd784f663eb4eea37087b35a54d00#diff-76e731333fb756df3bff5ddb3b731c46 |  
spark.stage.maxConsecutiveAttempts | 2.2.0 | SPARK-13369 | 7b5d873aef672aa0aee41e338bab7428101e1ad3#diff-6a9ff7fb74fd490a50462d45db2d5e11 |  
spark.{driver\|executor}.rpc.io.serverThreads | 1.6.0 | SPARK-10745 | 7c5b641808740ba5eed05ba8204cdbaf3fc579f5#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.{driver\|executor}.rpc.io.clientThreads | 1.6.0 | SPARK-10745 | 7c5b641808740ba5eed05ba8204cdbaf3fc579f5#diff-d2ce9b38bdc38ca9d7119f9c2cf79907 |  
spark.{driver\|executor}.rpc.netty.dispatcher.numThreads | 3.0.0 | SPARK-29398 | 2f0a38cb50e3e8b4b72219c7b2b8b15d51f6b931#diff-a68a21481fea5053848ca666dd3201d8 |  
spark.r.driver.command | 1.5.3 | SPARK-10971 | 9695f452e86a88bef3bcbd1f3c0b00ad9e9ac6e1#diff-025470e1b7094d7cf4a78ea353fb3981 |  
spark.r.shell.command | 2.1.0 | SPARK-17178 | fa6347938fc1c72ddc03a5f3cd2e929b5694f0a6#diff-a78ecfc6a89edfaf0b60a5eaa0381970 |  
spark.graphx.pregel.checkpointInterval | 2.2.0 | SPARK-5484 | f971ce5dd0788fe7f5d2ca820b9ea3db72033ddc#diff-e399679417ffa6eeedf26a7630baca16 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28035 from beliefer/supplement-configuration-version.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:32:04 +09:00
beliefer bed21770af [SPARK-31215][SQL][DOC] Add version information to the static configuration of SQL
### What changes were proposed in this pull request?
Add version information to the static configuration of `SQL`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.warehouse.dir | 2.0.0 | SPARK-14994 | 054f991c4350af1350af7a4109ee77f4a34822f0#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.catalogImplementation | 2.0.0 | SPARK-14720 and SPARK-13643 | 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d#diff-6bdad48cfc34314e89599655442ff210 |  
spark.sql.globalTempDatabase | 2.1.0 | SPARK-17338 | 23ddff4b2b2744c3dc84d928e144c541ad5df376#diff-6bdad48cfc34314e89599655442ff210 |  
spark.sql.sources.schemaStringLengthThreshold | 1.3.1 | SPARK-6024 | 6200f0709c5c8440decae8bf700d7859f32ac9d5#diff-41ef65b9ef5b518f77e2a03559893f4d | 1.3
spark.sql.filesourceTableRelationCacheSize | 2.2.0 | SPARK-19265 | 9d9d67c7957f7cbbdbe889bdbc073568b2bfbb16#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.codegen.cache.maxEntries | 2.4.0 | SPARK-24727 | b2deef64f604ddd9502a31105ed47cb63470ec85#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.codegen.comments | 2.0.0 | SPARK-15680 | f0e8738c1ec0e4c5526aeada6f50cf76428f9afd#diff-8bcc5aea39c73d4bf38aef6f6951d42c |  
spark.sql.debug | 2.1.0 | SPARK-17899 | db8784feaa605adcbd37af4bc8b7146479b631f8#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.hive.thriftServer.singleSession | 1.6.0 | SPARK-11089 | 167ea61a6a604fd9c0b00122a94d1bc4b1de24ff#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.extensions | 2.2.0 | SPARK-18127 | f0de600797ff4883927d0c70732675fd8629e239#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.queryExecutionListeners | 2.3.0 | SPARK-19558 | bd4eb9ce57da7bacff69d9ed958c94f349b7e6fb#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.streamingQueryListeners | 2.4.0 | SPARK-24479 | 7703b46d2843db99e28110c4c7ccf60934412504#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.ui.retainedExecutions | 1.5.0 | SPARK-8861 and SPARK-8862 | ebc3aad272b91cf58e2e1b4aa92b49b8a947a045#diff-81764e4d52817f83bdd5336ef1226bd9 |  
spark.sql.broadcastExchange.maxThreadThreshold | 3.0.0 | SPARK-26601 | 126310ca68f2f248ea8b312c4637eccaba2fdc2b#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.subquery.maxThreadThreshold | 2.4.6 | SPARK-30556 | 2fc562cafd71ec8f438f37a28b65118906ab2ad2#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.event.truncate.length | 3.0.0 | SPARK-27045 | e60d8fce0b0cf2a6d766ea2fc5f994546550570a#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.legacy.sessionInitWithConfigDefaults | 3.0.0 | SPARK-27253 | 83f628b57da39ad9732d1393aebac373634a2eb9#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.defaultUrlStreamHandlerFactory.enabled | 3.0.0 | SPARK-25694 | 8469614c0513fbed87977d4e741649db3fdd8add#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.streaming.ui.enabled | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.ui.retainedProgressUpdates | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.ui.retainedQueries | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Exists UT

Closes #27981 from beliefer/add-version-to-sql-static-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:31:25 +09:00
zhengruifeng 1dce6c1fd4 [SPARK-31222][ML] Make ANOVATest Sparsity-Aware
### What changes were proposed in this pull request?
when input dataset is sparse, make `ANOVATest` only process non-zero value

### Why are the changes needed?
for performance

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27982 from zhengruifeng/anova_sparse.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-31 10:40:17 +08:00
Dongjoon Hyun cda2e30e77
Revert "[SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery"
This reverts commit f376d24ea1.
2020-03-30 19:14:14 -07:00
Luca Canali aa98ac52db
[SPARK-30775][DOC] Improve the description of executor metrics in the monitoring documentation
### What changes were proposed in this pull request?
This PR (SPARK-30775) aims to improve the description of the executor metrics in the monitoring documentation.

### Why are the changes needed?
Improve and clarify monitoring documentation by:
- adding reference to the Prometheus end point, as implemented in [SPARK-29064]
- extending the list and descripion of executor metrics, following up from [SPARK-27157]

### Does this PR introduce any user-facing change?
Documentation update.

### How was this patch tested?
n.a.

Closes #27526 from LucaCanali/docPrometheusMetricsFollowupSpark29064.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-30 18:00:54 -07:00
Đặng Minh Dũng 1d0fc9aa85
[SPARK-29574][K8S][FOLLOWUP] Fix bash comparison error in Docker entrypoint.sh
### What changes were proposed in this pull request?
A small change to fix an error in Docker `entrypoint.sh`

### Why are the changes needed?
When spark running on Kubernetes, I got the following logs:
```log
+ '[' -n ']'
+ '[' -z ']'
++ /bin/hadoop classpath
/opt/entrypoint.sh: line 62: /bin/hadoop: No such file or directory
+ export SPARK_DIST_CLASSPATH=
+ SPARK_DIST_CLASSPATH=
```
This is because you are missing some quotes on bash comparisons.

### Does this PR introduce any user-facing change?
No

## How was this patch tested?
CI

Closes #28075 from dungdm93/patch-1.

Authored-by: Đặng Minh Dũng <dungdm93@live.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-30 15:41:57 -07:00
manuzhang 0d997e5156 [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService
### What changes were proposed in this pull request?
Close idle connections at shuffle server side when an `IdleStateEvent` is triggered after `spark.shuffle.io.connectionTimeout` or `spark.network.timeout` time. It's based on following investigations.

1. We found connections on our clusters building up continuously (> 10k for some nodes). Is that normal ? We don't think so.
2. We looked into the connections on one node and found there were a lot of half-open connections. (connections only existed on one node)
3. We also checked those connections were very old (> 21 hours). (FYI, https://superuser.com/questions/565991/how-to-determine-the-socket-connection-up-time-on-linux)
4. Looking at the code, TransportContext registers an IdleStateHandler which should fire an IdleStateEvent when timeout. We did a heap dump of the YarnShuffleService and checked the attributes of IdleStateHandler. It turned out firstAllIdleEvent of many IdleStateHandlers were already false so IdleStateEvent were already fired.
5. Finally, we realized the IdleStateEvent would not be handled since closeIdleConnections are hardcoded to false for YarnShuffleService.

### Why are the changes needed?
Idle connections to YarnShuffleService could never be closed, and will be accumulating and taking up memory and file descriptors.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27998 from manuzhang/spark-31219.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-03-30 12:44:46 -05:00
Maxim Gekk a1dbcd13a3 [SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource
### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-30 16:46:31 +08:00
zhengruifeng 34d6b90449 [SPARK-31283][ML] Simplify ChiSq by adding a common method
### What changes were proposed in this pull request?
add a common method `computeChiSq` and reuse it in both `chiSquaredDenseFeatures` and `chiSquaredSparseFeatures`

### Why are the changes needed?
to simplify ChiSq

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28045 from zhengruifeng/simplify_chisq.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-30 13:37:56 +08:00
Oleksii Kachaiev 22bb6b0fdd [SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax
### What changes were proposed in this pull request?
`DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably:
* `approxQuantile`
* `freqItems`
* `cov`
* `corr`

(other functions from `DataFrameStatFunctions` already work correctly).

See code examples below.

### Why are the changes needed?
With current implementation some stat functions are impossible to use when joining datasets with similar column names.

### Does this PR introduce any user-facing change?
Yes. Before the change, the following code would fail with `AnalysisException`.

```scala
scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> val dfx = df2.crossJoin(df1)
dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]

scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
res0: Array[Double] = Array(1.0)

scala> dfx.stat.corr("table1.num", "table2.num")
res1: Double = 1.0

scala> dfx.stat.cov("table1.num", "table2.num")
res2: Double = 11.0

scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>]
```

### How was this patch tested?
Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532").

Closes #27916 from kachayev/fix-spark-30532.

Authored-by: Oleksii Kachaiev <kachayev@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-30 13:20:57 +08:00
Maxim Gekk d2ff5c5bfb [SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
### What changes were proposed in this pull request?
In the PR, I propose to update the doc for the `timeZone` option in JSON/CSV datasources and for the `tz` parameter of the `from_utc_timestamp()`/`to_utc_timestamp()` functions, and to restrict format of config's values to 2 forms:
1. Geographical regions, such as `America/Los_Angeles`.
2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`.

### Why are the changes needed?
Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html).

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`, and manual testing.

Closes #28051 from MaxGekk/doc-time-zone-option.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-30 12:20:11 +08:00
Kengo Seki 60dd1a690f
[SPARK-31293][DSTREAMS][KINESIS][DOC] Fix wrong examples and help messages for Kinesis integration
### What changes were proposed in this pull request?

This PR (SPARK-31293) fixes wrong command examples, parameter descriptions and help message format for Amazon Kinesis integration with Spark Streaming.

### Why are the changes needed?

To improve usability of those commands.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

I ran the fixed commands manually and confirmed they worked as expected.

Closes #28063 from sekikn/SPARK-31293.

Authored-by: Kengo Seki <sekikn@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-29 14:27:19 -07:00