Commit graph

2869 commits

Author SHA1 Message Date
Yuanjian Li 8750363c8d [MINOR][DOCS] Emphasize the Streaming tab is for DStream API
### What changes were proposed in this pull request?
Emphasize the Streaming tab is for DStream API.

### Why are the changes needed?
Some users reported that it's a little confusing of the streaming tab and structured streaming tab.

### Does this PR introduce _any_ user-facing change?
Document change.

### How was this patch tested?
N/A

Closes #28854 from xuanyuanking/minor-doc.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-19 12:17:40 +09:00
James Yu ac98a9a07f [MINOR][DOCS] Update running-on-kubernetes.md
### What changes were proposed in this pull request?

Fix executor container name typo. `executor` should be `spark-kubernetes-executor`.

### Why are the changes needed?

The Executor pod container name the users actually get from their Kubernetes clusters is different from that described in the documentation.

For example, below is what a user get from an executor pod.
```
Containers:
  spark-kubernetes-executor:
    Container ID:  docker://aaaabbbbccccddddeeeeffff
    Image:         <imagename>
    Image ID:      docker-pullable://0000.dkr.ecr.us-east-0.amazonaws.com/spark
    Port:          7079/TCP
    Host Port:     0/TCP
    Args:
      executor
    State:          Running
      Started:      Thu, 28 May 2020 05:54:04 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  16Gi
```

### Does this PR introduce _any_ user-facing change?

Document change.

### How was this patch tested?

N/A

Closes #28862 from yuj/patch-1.

Authored-by: James Yu <yuj@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-18 14:36:20 -07:00
DB Tsai 9b792518b2 [SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build
### What changes were proposed in this pull request?
If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`.

### Why are the changes needed?
Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test with two builds, with-hadoop and no-hadoop builds.

Closes #28788 from dbtsai/yarn-classpath.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-06-18 06:08:40 +00:00
GuoPhilipse f0e6d0ec13 [SPARK-31710][SQL] Fail casting numeric to timestamp by default
## What changes were proposed in this pull request?
we fail casting from numeric to timestamp by default.

## Why are the changes needed?
casting from numeric to timestamp is not a  non-standard,meanwhile it may generate different result between spark and other systems,for example hive

## Does this PR introduce any user-facing change?
Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS

## How was this patch tested?
unit test added

Closes #28593 from GuoPhilipse/31710-fix-compatibility.

Lead-authored-by: GuoPhilipse <guofei_ok@126.com>
Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-16 08:35:35 +00:00
Jungtaek Lim (HeartSaVioR) fe68e95a5a [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numRowsDroppedByWatermark"
### What changes were proposed in this pull request?

This PR renames the variable from "numLateInputs" to "numRowsDroppedByWatermark" so that it becomes self-explanation.

### Why are the changes needed?

This is originated from post-review, see https://github.com/apache/spark/pull/28607#discussion_r439853232

### Does this PR introduce _any_ user-facing change?

No, as SPARK-24634 is not introduced in any release yet.

### How was this patch tested?

Existing UTs.

Closes #28828 from HeartSaVioR/SPARK-24634-v3-followup.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-16 16:41:08 +09:00
Takeshi Yamamuro 3698a14204 [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
### What changes were proposed in this pull request?

This PR intends to move keywords `ANTI`, `SEMI`, and `MINUS` from reserved to non-reserved.

### Why are the changes needed?

To comply with the ANSI/SQL standard.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28807 from maropu/SPARK-26905-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-16 00:27:45 +09:00
yi.wu 54e702c0dd [SPARK-31970][CORE] Make MDC configuration step be consistent between setLocalProperty and log4j.properties
### What changes were proposed in this pull request?

This PR proposes to use "mdc.XXX" as the consistent key for both `sc.setLocalProperty` and `log4j.properties` when setting up configurations for MDC.
### Why are the changes needed?

It's weird that we use "mdc.XXX" as key to set MDC value via `sc.setLocalProperty` while we use "XXX" as key to set MDC pattern in log4j.properties. It could also bring extra burden to the user.

### Does this PR introduce _any_ user-facing change?

No, as MDC feature is added in version 3.1, which hasn't been released.

### How was this patch tested?

Tested manually.

Closes #28801 from Ngone51/consistent-mdc.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-14 14:26:11 -07:00
Jungtaek Lim (HeartSaVioR) 84815d0550 [SPARK-24634][SS] Add a new metric regarding number of inputs later than watermark plus allowed delay
### What changes were proposed in this pull request?

Please refer https://issues.apache.org/jira/browse/SPARK-24634 to see rationalization of the issue.

This patch adds a new metric to count the number of inputs arrived later than watermark plus allowed delay. To make changes simpler, this patch doesn't count the exact number of input rows which are later than watermark plus allowed delay. Instead, this patch counts the inputs which are dropped in the logic of operator. The difference of twos are shown in streaming aggregation: to optimize the calculation, streaming aggregation "pre-aggregates" the input rows, and later checks the lateness against "pre-aggregated" inputs, hence the number might be reduced.

The new metric will be provided via two places:

1. On Spark UI: check the metrics in stateful operator nodes in query execution details page in SQL tab
2. On Streaming Query Listener: check "numLateInputs" in "stateOperators" in QueryProcessEvent.

### Why are the changes needed?

Dropping late inputs means that end users might not get expected outputs. Even end users may indicate the fact and tolerate the result (as that's what allowed lateness is for), but they should be able to observe whether the current value of allowed lateness drops inputs or not so that they can adjust the value.

Also, whatever the chance they have multiple of stateful operators in a single query, if Spark drops late inputs "between" these operators, it becomes "correctness" issue. Spark should disallow such possibility, but given we already provided the flexibility, at least we should provide the way to observe the correctness issue and decide whether they should make correction of their query or not.

### Does this PR introduce _any_ user-facing change?

Yes. End users will be able to retrieve the information of late inputs via two ways:

1. SQL tab in Spark UI
2. Streaming Query Listener

### How was this patch tested?

New UTs added & existing UTs are modified to reflect the change.

And ran manual test reproducing SPARK-28094.

I've picked the specific case on "B outer C outer D" which is enough to represent the "intermediate late row" issue due to global watermark.

https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17

Spark logs warning message on the query which means SPARK-28074 is working correctly,

```
20/05/30 17:52:47 WARN UnsupportedOperationChecker: Detected pattern of possible 'correctness' issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are "late rows" in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details.;
Join LeftOuter, ((D_FK#28 = D_ID#87) AND (B_LAST_MOD#26-T30000ms = D_LAST_MOD#88-T30000ms))
:- Join LeftOuter, ((C_FK#27 = C_ID#58) AND (B_LAST_MOD#26-T30000ms = C_LAST_MOD#59-T30000ms))
:  :- EventTimeWatermark B_LAST_MOD#26: timestamp, 30 seconds
:  :  +- Project [v#23.B_ID AS B_ID#25, v#23.B_LAST_MOD AS B_LAST_MOD#26, v#23.C_FK AS C_FK#27, v#23.D_FK AS D_FK#28]
:  :     +- Project [from_json(StructField(B_ID,StringType,false), StructField(B_LAST_MOD,TimestampType,false), StructField(C_FK,StringType,true), StructField(D_FK,StringType,true), value#21, Some(UTC)) AS v#23]
:  :        +- Project [cast(value#8 as string) AS value#21]
:  :           +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider3a7fd18c, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable396d2958, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee61a, [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> B, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
:  +- EventTimeWatermark C_LAST_MOD#59: timestamp, 30 seconds
:     +- Project [v#56.C_ID AS C_ID#58, v#56.C_LAST_MOD AS C_LAST_MOD#59]
:        +- Project [from_json(StructField(C_ID,StringType,false), StructField(C_LAST_MOD,TimestampType,false), value#54, Some(UTC)) AS v#56]
:           +- Project [cast(value#41 as string) AS value#54]
:              +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider3f507373, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable7b6736a4, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee61b, [key#40, value#41, topic#42, partition#43, offset#44L, timestamp#45, timestampType#46], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> C, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#33, value#34, topic#35, partition#36, offset#37L, timestamp#38, timestampType#39]
+- EventTimeWatermark D_LAST_MOD#88: timestamp, 30 seconds
   +- Project [v#85.D_ID AS D_ID#87, v#85.D_LAST_MOD AS D_LAST_MOD#88]
      +- Project [from_json(StructField(D_ID,StringType,false), StructField(D_LAST_MOD,TimestampType,false), value#83, Some(UTC)) AS v#85]
         +- Project [cast(value#70 as string) AS value#83]
            +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider2b90e779, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable36f8cd29, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee620, [key#69, value#70, topic#71, partition#72, offset#73L, timestamp#74, timestampType#75], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> D, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#62, value#63, topic#64, partition#65, offset#66L, timestamp#67, timestampType#68]
```

and we can find the late inputs from the batch 4 as follows:

![Screen Shot 2020-05-30 at 18 02 53](https://user-images.githubusercontent.com/1317309/83324401-058fd200-a2a0-11ea-8bf6-89cf777e9326.png)

which represents intermediate inputs are being lost, ended up with correctness issue.

Closes #28607 from HeartSaVioR/SPARK-24634-v3.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-14 14:37:38 +09:00
Kent Yao 6a424b93e5 [SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions
### What changes were proposed in this pull request?
Currently, `date_format` and `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp`, `to_date`  have different exception handling behavior for formatting datetime values.

In this PR, we apply the exception handling behavior of `date_format` to `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date`.

In the phase of creating the datetime formatted or formating, exceptions will be raised.

e.g.

```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa');
20/05/28 15:25:38 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa')]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-aaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
```

```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA');
20/05/28 15:26:10 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA')]
java.lang.IllegalArgumentException: Illegal pattern character: A
```

```java
spark-sql> select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd');
20/05/28 15:23:23 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd')]
java.lang.ArrayIndexOutOfBoundsException: 11
	at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568)
```
In the phase of parsing, `DateTimeParseException | DateTimeException | ParseException` will be suppressed, but `SparkUpgradeException` will still be raised

e.g.

```java
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy	exception
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
20/05/28 15:31:15 ERROR SparkSQLDriver: Failed in [select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz")]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2020-01-27T20:06:11.847-0800' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
```

```java
spark-sql> set spark.sql.legacy.timeParserPolicy=corrected;
spark.sql.legacy.timeParserPolicy	corrected
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
2020-01-28 12:06:11.847
```

### Why are the changes needed?
Consistency

### Does this PR introduce _any_ user-facing change?

Yes, invalid datetime patterns will fail `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` instead of resulting `NULL`

### How was this patch tested?

add more tests

Closes #28650 from yaooqinn/SPARK-31830.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 16:56:45 +00:00
Akshat Bordia 6befb2d8bd [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode
### What changes were proposed in this pull request?
These changes implement an application wait mechanism which will allow spark-submit to wait until the application finishes in Standalone Spark Mode. This will delay the exit of spark-submit JVM until the job is completed. This implementation will keep monitoring the application until it is either finished, failed or killed. This will be controlled via a flag (spark.submit.waitForCompletion) which will be set to false by default.

### Why are the changes needed?
Currently, Livy API for Standalone Cluster Mode doesn't know when the job has finished. If this flag is enabled, this can be used by Livy API (/batches/{batchId}/state) to find out when the application has finished/failed. This flag is Similar to spark.yarn.submit.waitAppCompletion.

### Does this PR introduce any user-facing change?

Yes, this PR introduces a new flag but it will be disabled by default.

### How was this patch tested?
Couldn't implement unit tests since the pollAndReportStatus method has System.exit() calls. Please provide any suggestions.
Tested spark-submit locally for the following scenarios:
1. With the flag enabled, spark-submit exits once the job is finished.
2. With the flag enabled and job failed, spark-submit exits when the job fails.
3. With the flag disabled, spark-submit exists post submitting the job (existing behavior).
4. Existing behavior is unchanged when the flag is not added explicitly.

Closes #28258 from akshatb1/master.

Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com>
Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-06-09 09:29:37 -05:00
Jungtaek Lim (HeartSaVioR) 8305b77796 [SPARK-28199][SPARK-28199][SS][FOLLOWUP] Mention the change of into the SS migration guide
### What changes were proposed in this pull request?

SPARK-28199 (#24996) made the trigger related public API to be exposed only from static methods of Trigger class. This is backward incompatible change, so some users may experience compilation error after upgrading to Spark 3.0.0.

While we plan to mention the change into release note, it's good to mention the change to the migration guide doc as well, since the purpose of the doc is to collect the major changes/incompatibilities between versions and end users would refer the doc.

### Why are the changes needed?

SPARK-28199 is technically backward incompatible change and we should kindly guide the change.

### Does this PR introduce _any_ user-facing change?

Doc change.

### How was this patch tested?

N/A, as it's just a doc change.

Closes #28763 from HeartSaVioR/SPARK-28199-FOLLOWUP-doc.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 04:52:48 +00:00
Gabor Somogyi 04f66bfd4e [MINOR][SS][DOCS] fileNameOnly parameter description re-unite
### What changes were proposed in this pull request?
`fileNameOnly` parameter is split to 2 pieces in [this](dbb8143501) commit. This PR re-unites it.

### Why are the changes needed?
Parameter description split in doc.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #28739 from gaborgsomogyi/datasettxtfix.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-06 16:49:48 +09:00
Kent Yao 9d5b5d0a58 [SPARK-31879][SQL][TEST-JAVA11] Make week-based pattern invalid for formatting too
# What changes were proposed in this pull request?

After all these attempts https://github.com/apache/spark/pull/28692 and https://github.com/apache/spark/pull/28719 an https://github.com/apache/spark/pull/28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

### Why are the changes needed?

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

### Does this PR introduce _any_ user-facing change?
With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

### How was this patch tested?

add unit tests

Closes #28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-05 08:14:01 +00:00
Enrico Minack 4bbe3c2bb4 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-06-03 18:06:13 -05:00
Kent Yao afe95bd9ad [SPARK-31892][SQL] Disable week-based date filed for parsing
### What changes were proposed in this pull request?

This PR disables week-based date filed for parsing

closes #28674
### Why are the changes needed?

1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at https://github.com/apache/spark/pull/28674

2. The existing behavior itself in 2.4 is confusing, e.g.

```sql
spark-sql> select to_timestamp('1', 'w');
1969-12-28 00:00:00
spark-sql> select to_timestamp('1', 'u');
1970-01-05 00:00:00
```
  The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form.

And, e.g.
```sql
spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy');
2020-01-01 00:00:00
spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY');
2019-12-29 00:00:00
spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w');
NULL
spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w');
2019-12-29 00:00:00
```

  I think we don't need to introduce all the weird behavior from Java

3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it.

4. Avoiding JDK bugs

https://issues.apache.org/jira/browse/SPARK-31880

### Does this PR introduce _any_ user-facing change?

Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions.

### How was this patch tested?

more tests added

Closes #28706 from yaooqinn/SPARK-31892.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-03 06:49:58 +00:00
Eren Avsarogullari 979593d708 [SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation
### What changes were proposed in this pull request?
SQL Rest API exposes query execution details and metrics as Public API. Its documentation will be useful for the end-users.

### Why are the changes needed?
SQL Rest API does not exist under Spark Rest API.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually build and check

Closes #28354 from erenavsarogullari/SPARK-31566.

Lead-authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com>
Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Eren Avsarogullari <erenavsarogullari@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-06-02 12:46:12 -07:00
lipzhu d79a8a88b1 [SPARK-31834][SQL] Improve error message for incompatible data types
### What changes were proposed in this pull request?
We should use dataType.catalogString to unified the data type mismatch message.
Before:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': StringType to IntegerType;
```

After:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': string to int;
```

### How was this patch tested?
UT.

Closes #28654 from lipzhu/SPARK-31834.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-02 21:07:10 +09:00
Kent Yao 547c5bf552 [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
### What changes were proposed in this pull request?

As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075

In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below

he new datetime formatter introduces silent data change like,

```sql
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
00000001970-01-01
spark-sql>
```

For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it

```java
switch (signStyle) {
  case EXCEEDS_PAD:
    if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
      buf.append(decimalStyle.getPositiveSign());
    }
    break;

           ....
```
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is

```java
/**
         * Array of 10 to the power of n.
         */
        static final long[] EXCEED_POINTS = new long[] {
            0L,
            10L,
            100L,
            1000L,
            10000L,
            100000L,
            1000000L,
            10000000L,
            100000000L,
            1000000000L,
            10000000000L,
        };
```

So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised.

 And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue.

### Why are the changes needed?
fix silent data change

### Does this PR introduce _any_ user-facing change?

Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous  'y' or 'Y'

### How was this patch tested?

new tests

Closes #28684 from yaooqinn/SPARK-31867-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-31 12:34:39 +00:00
Huaxin Gao 1b780f364b [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
### What changes were proposed in this pull request?
Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference

### Why are the changes needed?
To make SQL reference complete

### Does this PR introduce _any_ user-facing change?
<img width="1100" alt="Screen Shot 2020-05-29 at 6 46 38 PM" src="https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png">

<img width="1099" alt="Screen Shot 2020-05-29 at 6 43 30 PM" src="https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png">

Only the the above pages are changed. The following two pages are the same as before.

<img width="1100" alt="Screen Shot 2020-05-28 at 10 05 27 PM" src="https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png">

<img width="1099" alt="Screen Shot 2020-05-28 at 10 05 08 PM" src="https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png">

### How was this patch tested?
Manually build and check

Closes #28672 from huaxingao/coalesce_hint.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-30 14:51:45 -05:00
Wenchen Fan 1528fbced8 [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
### What changes were proposed in this pull request?

If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message.

### Why are the changes needed?

to keep backward compatibility with Spark 2.4

### Does this PR introduce _any_ user-facing change?

Yes

Spark 2.4
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```

Spark 3.0 with Java 11
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```

Spark 3.0 with Java 8
```
// before this PR
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                            1|
+---------------------------------------------+
// after this PR
scala> sql("select date_format('1990-1-1', 'LLL')").show
org.apache.spark.SparkUpgradeException
```

### How was this patch tested?

manual test with java 8 and 11

Closes #28646 from cloud-fan/format.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 18:53:19 +00:00
Xingcan Cui 8ba2b47737 [SPARK-31792][SS][DOCS] Introduce the structured streaming UI in the Web UI doc
### What changes were proposed in this pull request?
This PR adds the structured streaming UI introduction to the Web UI doc.

![image](https://user-images.githubusercontent.com/1452518/82642209-92b99380-9bdb-11ea-9a0d-cbb26040b0ef.png)

### Why are the changes needed?
The structured streaming web UI introduced before was missing from the Web UI documentation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N.A.

Closes #28609 from xccui/ss-ui-doc.

Authored-by: Xingcan Cui <xccui@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-26 14:27:42 +09:00
Kent Yao 695cb617d4 [SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
### What changes were proposed in this pull request?

Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style.

In this PR, we explicitly disable Narrow-Text Style for these pattern characters.

### Why are the changes needed?

Without this change, there will be a silent data change.

### Does this PR introduce _any_ user-facing change?

Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.

1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns

2, datetime patterns that are not supported by both the new parser and the legacy, e.g.  "QQQQQ", "qqqqq",  will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.

### How was this patch tested?

add unit tests

Closes #28592 from yaooqinn/SPARK-31771.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:07:41 +00:00
Huaxin Gao ad9532a09c [SPARK-31612][SQL][DOCS][FOLLOW-UP] Fix a few issues in SQL ref
### What changes were proposed in this pull request?
Fix a few issues in SQL Reference

### Why are the changes needed?
To make SQL Reference look better

### Does this PR introduce _any_ user-facing change?
Yes.
before:
<img width="189" alt="Screen Shot 2020-05-21 at 11 41 34 PM" src="https://user-images.githubusercontent.com/13592258/82639052-d0f38a80-9bbc-11ea-81a4-22def4ca5cc0.png">

after:

<img width="195" alt="Screen Shot 2020-05-21 at 11 41 17 PM" src="https://user-images.githubusercontent.com/13592258/82639063-d5b83e80-9bbc-11ea-84d1-8361e6bee949.png">

before:
<img width="763" alt="Screen Shot 2020-05-21 at 11 45 22 PM" src="https://user-images.githubusercontent.com/13592258/82639252-3e9fb680-9bbd-11ea-863c-e6a6c2f83a06.png">

after:

<img width="724" alt="Screen Shot 2020-05-21 at 11 45 02 PM" src="https://user-images.githubusercontent.com/13592258/82639265-42cbd400-9bbd-11ea-8df2-fc5c255b84d3.png">

before:
<img width="437" alt="Screen Shot 2020-05-21 at 11 41 57 PM" src="https://user-images.githubusercontent.com/13592258/82639072-db158900-9bbc-11ea-9963-731881cda4fd.png">

after

<img width="347" alt="Screen Shot 2020-05-21 at 11 42 26 PM" src="https://user-images.githubusercontent.com/13592258/82639082-dfda3d00-9bbc-11ea-9bd2-f922cc91f175.png">

### How was this patch tested?
Manually build and check

Closes #28608 from huaxingao/doc_fix.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-23 08:43:16 +09:00
GuoPhilipse 892b600ce3 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark
### What changes were proposed in this pull request?
add docs for sql migration-guide

### Why are the changes needed?
let user know more about the cast scenarios in which Hive and Spark generate different results

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
no need to test

Closes #28605 from GuoPhilipse/spark-docs.

Lead-authored-by: GuoPhilipse <guofei_ok@126.com>
Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-22 22:01:38 +09:00
Izek Greenfield eaf7a2a4ed [SPARK-8981][CORE][TEST-HADOOP3.2][TEST-JAVA11] Add MDC support in Executor
### What changes were proposed in this pull request?
Added MDC support in all thread pools.
ThreaddUtils create new pools that pass over MDC.

### Why are the changes needed?
In many cases, it is very hard to understand from which actions the logs in the executor come from.
when you are doing multi-thread work in the driver and send actions in parallel.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
No test added because no new functionality added it is thread pull change and all current tests pass.

Closes #26624 from igreenfield/master.

Authored-by: Izek Greenfield <igreenfield@axiomsl.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-20 07:41:00 +00:00
Max Gekk b3686a7622 [SPARK-31738][SQL][DOCS] Describe 'L' and 'M' month pattern letters
### What changes were proposed in this pull request?
1. Describe standard 'M' and stand-alone 'L' text forms
2. Add examples for all supported number of month letters

<img width="1047" alt="Screenshot 2020-05-18 at 08 57 31" src="https://user-images.githubusercontent.com/1580697/82178856-b16f1000-98e5-11ea-87c0-456ef94dcd43.png">

### Why are the changes needed?
To improve docs and show how to use month patterns.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By building docs and checking by eyes.

Closes #28558 from MaxGekk/describe-L-M-date-pattern.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-18 12:07:01 +00:00
Jungtaek Lim (HeartSaVioR) d2bec5e265 [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
### What changes were proposed in this pull request?

This patch effectively reverts SPARK-30098 via below changes:

* Removed the config
* Removed the changes done in parser rule
* Removed the usage of config in tests
  * Removed tests which depend on the config
  * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
* Reflect the change into docs (migration doc, create table syntax)

### Why are the changes needed?

SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.

Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details.

### Does this PR introduce _any_ user-facing change?

No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.

### How was this patch tested?

Existing UTs.

Closes #28517 from HeartSaVioR/revert-SPARK-30098.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:27:23 +00:00
Huaxin Gao 194ac3be8b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector
### What changes were proposed in this pull request?
Add docs and examples for ANOVASelector and FValueSelector

### Why are the changes needed?
Complete the implementation of ANOVASelector and FValueSelector

### Does this PR introduce _any_ user-facing change?
Yes

<img width="850" alt="Screen Shot 2020-05-13 at 5 17 44 PM" src="https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 05 15 PM" src="https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 06 06 PM" src="https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 06 21 PM" src="https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 10 PM" src="https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 33 PM" src="https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 07 47 PM" src="https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png">

<img width="851" alt="Screen Shot 2020-05-13 at 5 19 35 PM" src="https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 08 42 PM" src="https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 09 01 PM" src="https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 09 50 PM" src="https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png">

<img width="851" alt="Screen Shot 2020-05-13 at 5 10 06 PM" src="https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 10 59 PM" src="https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 11 27 PM" src="https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 14 54 PM" src="https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png">

<img width="850" alt="Screen Shot 2020-05-13 at 5 15 17 PM" src="https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png">

### How was this patch tested?
Manually build and check

Closes #28524 from huaxingao/examples.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-15 09:59:14 -05:00
Dongjoon Hyun 7ce3f76af6
[SPARK-31696][DOCS][FOLLOWUP] Update version in documentation
# What changes were proposed in this pull request?

This PR is a follow-up to fix a version of configuration document.

### Why are the changes needed?

The original PR is backported to branch-3.0.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Manual.

Closes #28530 from dongjoon-hyun/SPARK-31696-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-14 10:25:22 -07:00
Dongjoon Hyun c8f3bd861d
[SPARK-31696][K8S] Support driver service annotation in K8S
### What changes were proposed in this pull request?

This PR aims to add `spark.kubernetes.driver.service.annotation` like `spark.kubernetes.driver.service.annotation`.

### Why are the changes needed?

Annotations are used in many ways. One example is that Prometheus monitoring system search metric endpoint via annotation.
- https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations

### Does this PR introduce _any_ user-facing change?

Yes. The documentation is added.

### How was this patch tested?

Pass Jenkins with the updated unit tests.

Closes #28518 from dongjoon-hyun/SPARK-31696.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 13:59:42 -07:00
HyukjinKwon e1315cd656
[SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in SparkR
### What changes were proposed in this pull request?

This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at.

### Why are the changes needed?

It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet.

### Does this PR introduce _any_ user-facing change?

No, it's the change in unreleased branches only.

### How was this patch tested?

0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too.

Closes #28520 from HyukjinKwon/SPARK-31701.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-13 10:03:12 -07:00
Antonin Delpeuch 59d90997a5 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch <antonin@delpeuch.eu>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-12 08:27:43 -05:00
Dongjoon Hyun b80309bdb4
[SPARK-31674][CORE][DOCS] Make Prometheus metric endpoints experimental
### What changes were proposed in this pull request?

This PR aims to new Prometheus-format metric endpoints experimental in Apache Spark 3.0.0.

### Why are the changes needed?

Although the new metrics are disabled by default, we had better make it experimental explicitly in Apache Spark 3.0.0 since the output format is still not fixed. We can finalize it in Apache Spark 3.1.0.

### Does this PR introduce _any_ user-facing change?

Only doc-change is visible to the users.

### How was this patch tested?

Manually check the code since this is a documentation and class annotation change.

Closes #28495 from dongjoon-hyun/SPARK-31674.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-10 22:32:26 -07:00
Huaxin Gao a75dc80a76 [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference
### What changes were proposed in this pull request?
Remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
Please see #28414

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually build and check

Closes #28451 from huaxingao/html_cleanup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 12:57:25 -05:00
Huaxin Gao 08335b651a [SPARK-31659][ML][DOCS] Add VarianceThresholdSelector examples and doc
### What changes were proposed in this pull request?
Add VarianceThresholdSelector examples and doc

### Why are the changes needed?
VarianceThresholdSelector is a new feature selector in 3.1.0. We need to add examples and doc

### Does this PR introduce _any_ user-facing change?
Yes.
add Scala, Python and Java examples for VarianceThresholdSelector. Also add doc

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 01 AM" src="https://user-images.githubusercontent.com/13592258/81321791-e3f84d80-9047-11ea-837b-e39c193bd437.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 20 44 AM" src="https://user-images.githubusercontent.com/13592258/81321806-e8246b00-9047-11ea-8f35-206e330a92ab.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 27 AM" src="https://user-images.githubusercontent.com/13592258/81321822-ea86c500-9047-11ea-8743-99adec7f502b.png">

<img width="860" alt="Screen Shot 2020-05-07 at 9 21 43 AM" src="https://user-images.githubusercontent.com/13592258/81321826-ec508880-9047-11ea-9e7a-22ee5e13f495.png">

### How was this patch tested?
Manually checked

Closes #28478 from huaxingao/variance_doc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-08 10:57:35 +08:00
wang-zhun f3891e377f [SPARK-31235][YARN] Separates different categories of applications
### What changes were proposed in this pull request?
This PR adds `spark.yarn.applicationType` to identify the application type

### Why are the changes needed?
The current application defaults to the SPARK type.
In fact, different types of applications have different characteristics and are suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
I recommend distinguishing them by the parameter `spark.yarn.applicationType` so that we can more easily manage and maintain different types of applications.

### How was this patch tested?
1.add UT
2.Tested by verifying Yarn-UI `ApplicationType` in the following cases:
- client and cluster mode

Need additional explanation:
limit cannot exceed 20 characters, can be empty or space
The reasons are as follows:
```
// org.apache.hadoop.yarn.server.resourcemanager.submitApplication.
 if (submissionContext.getApplicationType() == null) {
      submissionContext
        .setApplicationType(YarnConfiguration.DEFAULT_APPLICATION_TYPE);
} else {
      // APPLICATION_TYPE_LENGTH = 20
      if (submissionContext.getApplicationType().length() > YarnConfiguration.APPLICATION_TYPE_LENGTH) {
        submissionContext.setApplicationType(submissionContext
          .getApplicationType().substring(0,
            YarnConfiguration.APPLICATION_TYPE_LENGTH));
      }
    }
```

Closes #28009 from wang-zhun/SPARK-31235.

Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-05-05 08:40:57 -05:00
Dilip Biswal 5052d9557d [SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table
### What changes were proposed in this pull request?
This PR is to clean up the markdown file in remaining pages in sql reference. The first one was done by gatorsmile in  [28415](https://github.com/apache/spark/pull/28415)

- Replace HTML table by MD table
- **sql-ref-ansi-compliance.md**
<img width="967" alt="Screen Shot 2020-05-01 at 4 36 35 PM" src="https://user-images.githubusercontent.com/14225158/80848981-1cbca080-8bca-11ea-8a5d-63174b31c800.png">
- **sql-ref-datatypes.md (Scala)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 37 30 PM" src="https://user-images.githubusercontent.com/14225158/80849057-6a390d80-8bca-11ea-8866-ab08bab31432.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 39 18 PM" src="https://user-images.githubusercontent.com/14225158/80849061-6c9b6780-8bca-11ea-834c-eb93d3ab47ae.png">
- **sql-ref-datatypes.md (Java)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 41 24 PM" src="https://user-images.githubusercontent.com/14225158/80849138-b3895d00-8bca-11ea-9d3b-555acad2086c.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 41 39 PM" src="https://user-images.githubusercontent.com/14225158/80849140-b6844d80-8bca-11ea-9ca9-1812b6a76c02.png">
- **sql-ref-datatypes.md (Python)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 43 36 PM" src="https://user-images.githubusercontent.com/14225158/80849202-0400ba80-8bcb-11ea-96a5-7caecbf9dbbf.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 43 54 PM" src="https://user-images.githubusercontent.com/14225158/80849205-06fbab00-8bcb-11ea-8f00-6df52b151684.png">
- **sql-ref-datatypes.md (R)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 45 16 PM" src="https://user-images.githubusercontent.com/14225158/80849288-5fcb4380-8bcb-11ea-8277-8589b5bb31bc.png">
<img width="967" alt="Screen Shot 2020-05-01 at 4 45 36 PM" src="https://user-images.githubusercontent.com/14225158/80849294-62c63400-8bcb-11ea-9438-b4f1193bc757.png">
- **sql-ref-datatypes.md (SQL)**
<img width="967" alt="Screen Shot 2020-05-01 at 4 48 02 PM" src="https://user-images.githubusercontent.com/14225158/80849336-986b1d00-8bcb-11ea-9736-5fb40496b681.png">
- **sql-ref-syntax-qry-select-tvf.md**
<img width="967" alt="Screen Shot 2020-05-01 at 4 49 32 PM" src="https://user-images.githubusercontent.com/14225158/80849399-d10af680-8bcb-11ea-8dc2-e3e750e21a59.png">

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually using jekyll serve

Closes #28433 from dilipbiswal/sql-doc-table-cleanup.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-05 15:21:14 +09:00
Max Gekk 372ccba063
[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default
### What changes were proposed in this pull request?
This reverts commit 43a73e387c. It sets `INT96` as the timestamp type while saving timestamps to parquet files.

### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #28450 from MaxGekk/parquet-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-04 17:27:02 -07:00
Kazuaki Ishizaki 35fcc8d5c5 [MINOR][DOCS] Fix typo in documents
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in `project/MimaExcludes.scala`

### Why are the changes needed?
Better readability of documents

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No test needed

Closes #28447 from kiszk/typo_20200504.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 16:53:50 +09:00
Huaxin Gao 75da05038b [MINOR][SQL][DOCS] Remove two leading spaces from sql tables
### What changes were proposed in this pull request?
Remove two leading spaces from sql tables.

### Why are the changes needed?

Follow the format of other references such as https://docs.snowflake.com/en/sql-reference/constructs/join.html, https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_10002.htm, https://www.postgresql.org/docs/10/sql-select.html.

### Does this PR introduce any user-facing change?

before
```
SELECT * FROM  test;
  +-+
  ...
  +-+
```
after
```
SELECT * FROM  test;
+-+
...
+-+
```

### How was this patch tested?
Manually build and check

Closes #28348 from huaxingao/sql-format.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-05-01 10:11:43 -07:00
Xingbo Jiang b7cde42b04 [SPARK-31619][CORE] Rename config "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout"
### What changes were proposed in this pull request?
The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect if "spark.dynamicAllocation.shuffleTracking.enabled" is true, so we should re-namespace that configuration so that it's nested under the "shuffleTracking" one.

### How was this patch tested?
Covered by current existing test cases.

Closes #28426 from jiangxb1987/confName.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-01 11:46:17 +09:00
Huaxin Gao 2410a45703 [SPARK-31612][SQL][DOCS] SQL Reference clean up
### What changes were proposed in this pull request?
SQL Reference cleanup

### Why are the changes needed?
To complete SQL Reference

### Does this PR introduce _any_ user-facing change?
updated sql-ref-syntax-qry.html

before
<img width="1100" alt="Screen Shot 2020-04-29 at 11 08 25 PM" src="https://user-images.githubusercontent.com/13592258/80677799-70b27280-8a6e-11ea-8e3f-a768f29d0377.png">

after
<img width="1100" alt="Screen Shot 2020-04-29 at 11 05 55 PM" src="https://user-images.githubusercontent.com/13592258/80677803-74de9000-8a6e-11ea-880c-aa05c53254a6.png">

### How was this patch tested?
Manually build and check

Closes #28417 from huaxingao/cleanup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-01 06:30:35 +09:00
Xiao Li b5ecc41c73 [SPARK-28806][DOCS][FOLLOW-UP] Remove unneeded HTML from the MD file
### What changes were proposed in this pull request?
This PR is to clean up the markdown file in SHOW COLUMNS page.

- remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
- use the ``` sql for highlighting the SQL syntax.

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
**Before**

![Screen Shot 2020-04-29 at 5 20 11 PM](https://user-images.githubusercontent.com/11567269/80661963-fa4d4a80-8a44-11ea-9dea-c43cda6de010.png)

**After**

![Screen Shot 2020-04-29 at 6 03 50 PM](https://user-images.githubusercontent.com/11567269/80661940-f15c7900-8a44-11ea-9943-a83e8d8618fb.png)

Closes #28414 from gatorsmile/cleanupShowColumns.

Lead-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-30 09:34:56 -07:00
Yuanjian Li 7195a18bf2 [SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests
### What changes were proposed in this pull request?

- Rephrase the API doc for `Column.as`
- Simplify the UTs

### Why are the changes needed?
Address comments in https://github.com/apache/spark/pull/28326

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT added.

Closes #28390 from xuanyuanking/SPARK-27340-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 06:24:00 +00:00
gatorsmile f56c6630fb [SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table
### What changes were proposed in this pull request?

This PR is to clean up the markdown file in datetime-pattern page.

- Replace HTML table by MD table

### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
**Before**
![Screen Shot 2020-04-29 at 7 59 10 PM](https://user-images.githubusercontent.com/11567269/80668093-c9294600-8a55-11ea-9dca-d558203298f8.png)

**After**

![Screen Shot 2020-04-29 at 8 13 38 PM](https://user-images.githubusercontent.com/11567269/80668146-f1b14000-8a55-11ea-8d47-8dc8a0378271.png)

Closes #28415 from gatorsmile/cleanupUDFPage.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 05:47:42 +00:00
DB Tsai ecfee82fda [SPARK-31582][YARN] Being able to not populate Hadoop classpath
### What changes were proposed in this pull request?
We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`.

### Why are the changes needed?
Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster.

However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars.

One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors.

By not populating the Hadoop classpath from the clusters can address this issue.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop.

We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath.

Closes #28376 from dbtsai/yarn-classpath.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-04-29 21:10:40 +00:00
Terry Kim 36803031e8 [SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views
### What changes were proposed in this pull request?

This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.

### Why are the changes needed?

It's a bug.

### Does this PR introduce any user-facing change?

Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key                              |value        |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2            |
|view.query.out.col.0             |c1           |
|view.query.out.numCols           |1            |
|p2                               |v2           |
|view.catalogAndNamespace.part.0  |spark_catalog|
|p1                               |v1           |
|view.catalogAndNamespace.part.1  |default      |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```

### How was this patch tested?

Added tests.

Closes #28375 from imback82/show_tblproperties_followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 07:06:45 +00:00
Kent Yao 295d866969 [SPARK-31596][SQL][DOCS] Generate SQL Configurations from hive module to configuration doc
### What changes were proposed in this pull request?

This PR adds `-Phive` profile to the pre-build phase to build the hive module to dev classpath.
Then reflect the HiveUtils object to dump all configurations in the class.

### Why are the changes needed?

supply SQL configurations from hive module to doc

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

passing Jenkins
 add verified locally

![image](https://user-images.githubusercontent.com/8326978/80492333-6fae1200-8996-11ea-99fd-595ee18c67e5.png)

Closes #28394 from yaooqinn/SPARK-31596.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-29 15:34:45 +09:00
Huaxin Gao d34cb59fb3 [SPARK-31556][SQL][DOCS] Document LIKE clause in SQL Reference
### What changes were proposed in this pull request?
Document LIKE clause in SQL Reference

### Why are the changes needed?
To make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes

<img width="1050" alt="Screen Shot 2020-04-25 at 5 49 57 PM" src="https://user-images.githubusercontent.com/13592258/80294346-5babab80-871d-11ea-8ac9-51bbab0aca88.png">

<img width="1050" alt="Screen Shot 2020-04-25 at 5 50 24 PM" src="https://user-images.githubusercontent.com/13592258/80294347-5ea69c00-871d-11ea-8c51-7a90ee20f7da.png">

<img width="1050" alt="Screen Shot 2020-04-25 at 5 50 42 PM" src="https://user-images.githubusercontent.com/13592258/80294351-61a18c80-871d-11ea-9e75-e3345d2f52f5.png">

### How was this patch tested?
Manually build and check

Closes #28332 from huaxingao/where_clause.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-29 09:17:23 +09:00
Huaxin Gao dcc09022f1 [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started
### What changes were proposed in this pull request?
Add a paragraph for scalar function in sql getting started

### Why are the changes needed?
To make 3.0 doc complete.

### Does this PR introduce any user-facing change?
before:
<img width="870" alt="Screen Shot 2020-04-21 at 10 11 12 PM" src="https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png">

after:
<img width="865" alt="Screen Shot 2020-04-22 at 11 49 59 PM" src="https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png">

<img width="1033" alt="Screen Shot 2020-04-23 at 6 22 53 PM" src="https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png">

### How was this patch tested?

Closes #28290 from huaxingao/scalar.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-28 11:17:45 -05:00