### What changes were proposed in this pull request?
This PR removes the description that `||` and `&&` can be used as logical operators from the migration guide.
### Why are the changes needed?
At the `Compatibility with Apache Hive` section in the migration guide, it describes that `||` and `&&` can be used as logical operators.
But, in fact, they cannot be used as described.
AFAIK, Hive also doesn't support `&&` and `||` as logical operators.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confirmed that `&&` and `||` cannot be used as logical operators with both Hive's interactive shell and `spark-sql`.
I also built the modified document and confirmed that the modified document doesn't break layout.
Closes#32023 from sarutak/modify-hive-compatibility-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add more flexable parameters for stage end point
endpoint /application/{app-id}/stages. It can be:
/application/{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|false]$quantiles=[comma separated quantiles string]&taskStatus=[RUNNING|SUCCESS|FAILED|PENDING]
where
```
query parameter details=true is to show the detailed task information within each stage. The default value is details=false;
query parameter status can select those stages with the specified status. When status parameter is not specified, a list of all stages are generated.
query parameter withSummaries=true is to show both task summary information in percentile distribution and executor summary information in percentile distribution. The default value is withSummaries=false.
query parameter quantiles support user defined quantiles, default quantiles is `0.0,0.25,0.5,0.75,1.0`
query parameter taskStatus is to show only those tasks with the specified status within their corresponding stages. This parameter will be set when details=true (i.e. this parameter will be ignored when details=false).
```
### Why are the changes needed?
More flexable restful API
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
UT
Closes#31204 from AngersZhuuuu/SPARK-26399-NEW.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Modify the `SubtractTimestamps` expression to return values of `DayTimeIntervalType` when `spark.sql.legacy.interval.enabled` is set to `false` (which is the default).
### Why are the changes needed?
To conform to the ANSI SQL standard which requires ANSI intervals as the result of timestamps subtraction, see
<img width="656" alt="Screenshot 2021-03-29 at 19 09 34" src="https://user-images.githubusercontent.com/1580697/112866455-7e2f0d00-90c2-11eb-96e6-3feb7eea7e09.png">
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *ColumnExpressionSuite"
```
and some tests from `SQLQueryTestSuite`:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```
Closes#32016 from MaxGekk/subtract-timestamps-to-intervals.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Add a new SQL function `try_cast`.
`try_cast` is identical to `AnsiCast` (or `Cast` when `spark.sql.ansi.enabled` is true), except it returns NULL instead of raising an error.
This expression has one major difference from `cast` with `spark.sql.ansi.enabled` as true: when the source value can't be stored in the target integral(Byte/Short/Int/Long) type, `try_cast` returns null instead of returning the low order bytes of the source value.
Note that the result of `try_cast` is not affected by the configuration `spark.sql.ansi.enabled`.
This is learned from Google BigQuery and Snowflake:
https://docs.snowflake.com/en/sql-reference/functions/try_cast.htmlhttps://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting
### Why are the changes needed?
This is an useful for the following scenarios:
1. When ANSI mode is on, users can choose `try_cast` an alternative way to run SQL without errors for certain operations.
2. When ANSI mode is off, users can use `try_cast` to get a more reasonable result for casting a value to an integral type: when an overflow error happens, `try_cast` returns null while `cast` returns the low order bytes of the source value.
### Does this PR introduce _any_ user-facing change?
Yes, adding a new function `try_cast`
### How was this patch tested?
Unit tests.
Closes#31982 from gengliangwang/tryCast.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
Fix code not close issue in monitoring.md
### Why are the changes needed?
Fix doc issue
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#32008 from AngersZhuuuu/SPARK-34911.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1. Add the SQL config `spark.sql.legacy.interval.enabled` which will control when Spark SQL should use `CalendarIntervalType` instead of ANSI intervals.
2. Modify the `SubtractDates` expression to return values of `DayTimeIntervalType` when `spark.sql.legacy.interval.enabled` is set to `false` (which is the default).
### Why are the changes needed?
To conform to the ANSI SQL standard which requires ANSI intervals as the result of dates subtraction, see
<img width="656" alt="Screenshot 2021-03-29 at 19 09 34" src="https://user-images.githubusercontent.com/1580697/112866455-7e2f0d00-90c2-11eb-96e6-3feb7eea7e09.png">
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *ColumnExpressionSuite"
```
and some tests from `SQLQueryTestSuite`:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```
Closes#31996 from MaxGekk/subtract-dates-to-intervals.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Add a new config, `spark.shuffle.service.name`, which allows for Spark applications to look for a YARN shuffle service which is defined at a name other than the default `spark_shuffle`.
Add a new config, `spark.yarn.shuffle.service.metrics.namespace`, which allows for configuring the namespace used when emitting metrics from the shuffle service into the NodeManager's `metrics2` system.
Add a new mechanism by which to override shuffle service configurations independently of the configurations in the NodeManager. When a resource `spark-shuffle-site.xml` is present on the classpath of the shuffle service, the configs present within it will be used to override the configs coming from `yarn-site.xml` (via the NodeManager).
### Why are the changes needed?
There are two use cases which can benefit from these changes.
One use case is to run multiple instances of the shuffle service side-by-side in the same NodeManager. This can be helpful, for example, when running a YARN cluster with a mixed workload of applications running multiple Spark versions, since a given version of the shuffle service is not always compatible with other versions of Spark (e.g. see SPARK-27780). With this PR, it is possible to run two shuffle services like `spark_shuffle` and `spark_shuffle_3.2.0`, one of which is "legacy" and one of which is for new applications. This is possible because YARN versions since 2.9.0 support the ability to run shuffle services within an isolated classloader (see YARN-4577), meaning multiple Spark versions can coexist.
Besides this, the separation of shuffle service configs into `spark-shuffle-site.xml` can be useful for administrators who want to change and/or deploy Spark shuffle service configurations independently of the configurations for the NodeManager (e.g., perhaps they are owned by two different teams).
### Does this PR introduce _any_ user-facing change?
Yes. There are two new configurations related to the external shuffle service, and a new mechanism which can optionally be used to configure the shuffle service. `docs/running-on-yarn.md` has been updated to provide user instructions; please see this guide for more details.
### How was this patch tested?
In addition to the new unit tests added, I have deployed this to a live YARN cluster and successfully deployed two Spark shuffle services simultaneously, one running a modified version of Spark 2.3.0 (which supports some of the newer shuffle protocols) and one running Spark 3.1.1. Spark applications of both versions are able to communicate with their respective shuffle services without issue.
Closes#31936 from xkrogen/xkrogen-SPARK-34828-shufflecompat-config-from-classpath.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
As discussed in
https://github.com/apache/spark/pull/30145#discussion_r514728642https://github.com/apache/spark/pull/30145#discussion_r514734648
We need to rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development.
In postgres sql, it support
```
select a, b, c, count(1) from t group by cube (a, b, c);
select a, b, c, count(1) from t group by cube(a, b, c);
select a, b, c, count(1) from t group by cube (a, b, c, (a, b), (a, b, c));
select a, b, c, count(1) from t group by rollup(a, b, c);
select a, b, c, count(1) from t group by rollup (a, b, c);
select a, b, c, count(1) from t group by rollup (a, b, c, (a, b), (a, b, c));
```
In this pr, we have done three things as below, and we will split it to different pr:
- Refactor CUBE/ROLLUP (regarding them as ANTLR tokens in a parser)
- Refactor GROUPING SETS (the logical node -> a new expr)
- Support new syntax for CUBE/ROLLUP (e.g., GROUP BY CUBE ((a, b), (a, c)))
### Why are the changes needed?
Rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development.
### Does this PR introduce _any_ user-facing change?
User can write Grouping Analytics grammar as flexible as Postgres SQL to support subsequent development.
### How was this patch tested?
Added UT
Closes#30212 from AngersZhuuuu/refact-grouping-analytics.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, when a connection for TransportClient is marked as idled and closed, we suggest users adjust `spark.network.timeout` for all transport modules. As a lot of timeout configs will fallback to the `spark.network.timeout`, this could be a piece of overkill advice, we should give a more targeted one with `spark.${moduleName}.io.connectionTimeout`
### Why are the changes needed?
better advise for overloaded network traffic cases
### Does this PR introduce _any_ user-facing change?
yes, when a connection is zombied and closed by spark internally, users can use a more targeted config to tune their jobs
### How was this patch tested?
Just log and doc. Passing Jenkins and GA
Closes#31990 from yaooqinn/SPARK-34894.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
When we want to get stage's detail info with task information, it will return all tasks, the content is huge and always we just want to know some failed tasks/running tasks with whole stage info to judge is a task has some problem. This pr support
user to use
```
/application/[appid]/stages/[stage-id]?details=true&taskStatus=xxx
/application/[appid]/stages/[stage-id]/[stage-attempted-id]?details=true&taskStatus=xxx
```
to filter task details by task status
### Why are the changes needed?
More flexiable Restful API
### Does this PR introduce _any_ user-facing change?
User can use
```
/application/[appid]/stages/[stage-id]?details=true&taskStatus=xxx
/application/[appid]/stages/[stage-id]/[stage-attempted-id]?details=true&taskStatus=xxx
```
to filter task details by task status
### How was this patch tested?
Added
Closes#31165 from AngersZhuuuu/SPARK-34092.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Deprecating `spark.launcher.childConectionTimeout` in favor of `spark.launcher.childConnectionTimeout`
### Why are the changes needed?
srowen suggested it https://github.com/apache/spark/pull/30323#discussion_r521449342
### How was this patch tested?
No testing. Not even compiled
Closes#30679 from jsoref/spelling-connection.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Allow casting complex types as string type in ANSI mode.
### Why are the changes needed?
Currently, complex types are not allowed to cast as string type. This breaks the DataFrame.show() API. E.g
```
scala> sql(“select array(1, 2, 2)“).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` AS STRING)’ due to data type mismatch:
cannot cast array<int> to string with ANSI mode on.
```
We should allow the conversion as the extension of the ANSI SQL standard, so that the DataFrame.show() still work in ANSI mode.
### Does this PR introduce _any_ user-facing change?
Yes, casting complex types as string type is now allowed in ANSI mode.
### How was this patch tested?
Unit tests.
Closes#31954 from gengliangwang/fixExplicitCast.
Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
### What changes were proposed in this pull request?
For a specific stage, it is useful to show the task metrics in percentile distribution. This information can help users know whether or not there is a skew/bottleneck among tasks in a given stage. We list an example in taskMetricsDistributions.json
Similarly, it is useful to show the executor metrics in percentile distribution for a specific stage. This information can show whether or not there is a skewed load on some executors. We list an example in executorMetricsDistributions.json
We define `withSummaries` and `quantiles` query parameter in the REST API for a specific stage as:
applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>?withSummaries=[true|false]& quantiles=0.05,0.25,0.5,0.75,0.95
1. withSummaries: default is false, define whether to show current stage's taskMetricsDistribution and executorMetricsDistribution
2. quantiles: default is `0.0,0.25,0.5,0.75,1.0` only effect when `withSummaries=true`, it define the quantiles we use when calculating metrics distributions.
When withSummaries=true, both task metrics in percentile distribution and executor metrics in percentile distribution are included in the REST API output. The default value of withSummaries is false, i.e. no metrics percentile distribution will be included in the REST API output.
### Why are the changes needed?
For a specific stage, it is useful to show the task metrics in percentile distribution. This information can help users know whether or not there is a skew/bottleneck among tasks in a given stage. We list an example in taskMetricsDistributions.json
### Does this PR introduce _any_ user-facing change?
User can use below restful API to get task metrics distribution and executor metrics distribution for indivial stage
```
applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>?withSummaries=[true|false]
```
### How was this patch tested?
Added UT
Closes#31611 from AngersZhuuuu/SPARK-34488.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch adds a config `spark.yarn.kerberos.renewal.excludeHadoopFileSystems` which lists the filesystems to be excluded from delegation token renewal at YARN.
### Why are the changes needed?
MapReduce jobs can instruct YARN to skip renewal of tokens obtained from certain hosts by specifying the hosts with configuration mapreduce.job.hdfs-servers.token-renewal.exclude=<host1>,<host2>,..,<hostN>.
But seems Spark lacks of similar option. So the job submission fails if YARN fails to renew DelegationToken for any of the remote HDFS cluster. The failure in DT renewal can happen due to many reason like Remote HDFS does not trust Kerberos identity of YARN etc. We have a customer facing such issue.
### Does this PR introduce _any_ user-facing change?
No, if the config is not set. Yes, as users can use this config to instruct YARN not to renew delegation token from certain filesystems.
### How was this patch tested?
It is hard to do unit test for this. We did verify it work from the customer using this fix in the production environment.
Closes#31761 from viirya/SPARK-34295.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
the given example uses a non-standard syntax for CREATE TABLE, by defining the partitioning column with the other columns, instead of in PARTITION BY.
This works is this case, because the partitioning column happens to be the last column defined, but it will break if instead 'name' would be used for partitioning.
I suggest therefore to change the example to use a standard syntax, like in
https://spark.apache.org/docs/3.1.1/sql-ref-syntax-ddl-create-table-hiveformat.html
### Why are the changes needed?
To show the better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, this fixes the user-facing docs.
### How was this patch tested?
CI should test it out.
Closes#31900 from robert4os/patch-1.
Authored-by: robert4os <robert4os@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Current link for `Azure Blob Storage and Azure Datalake Gen 2` leads to AWS information. Replacing the link to point to the right page.
### Why are the changes needed?
For users to access to the correct link.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the link correctly.
### How was this patch tested?
N/A
Closes#31938 from lenadroid/patch-1.
Authored-by: Lena <alehall@microsoft.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Update the Avro version to 1.10.2
### Why are the changes needed?
To stay up to date with upstream and catch compatibility issues with zstd
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests
Closes#31866 from iemejia/SPARK-27733-upgrade-avro-1.10.2.
Authored-by: Ismaël Mejía <iemejia@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
This PR aims to enable `spark.hadoopRDD.ignoreEmptySplits` by default for Apache Spark 3.2.0.
### Why are the changes needed?
Although this is a safe improvement, this hasn't been enabled by default to avoid the explicit behavior change. This PR aims to switch the default explicitly in Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
Yes, the behavior change is documented.
### How was this patch tested?
Pass the existing CIs.
Closes#31909 from dongjoon-hyun/SPARK-34809.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Document `mode` as a supported Imputer strategy in Pyspark docs.
### Why are the changes needed?
Support was added in 3.1, and documented in Scala, but some Python docs were missed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#31883 from srowen/ImputerModeDocs.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to support remote driver/executor template files.
### Why are the changes needed?
Currently, `KubernetesUtils.loadPodFromTemplate` supports only local files.
With this PR, we can do the following.
```bash
bin/spark-submit \
...
-c spark.kubernetes.driver.podTemplateFile=s3a://dongjoon/driver.yml \
-c spark.kubernetes.executor.podTemplateFile=s3a://dongjoon/executor.yml \
...
```
### Does this PR introduce _any_ user-facing change?
Yes, this is an improvement.
### How was this patch tested?
Manual testing.
Closes#31877 from dongjoon-hyun/SPARK-34783-2.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
For the following cases, ABS should throw exceptions since the results are out of the range of the result data types in ANSI mode.
```
SELECT abs(${Int.MinValue});
SELECT abs(${Long.MinValue});
```
### Why are the changes needed?
Better ANSI compliance
### Does this PR introduce _any_ user-facing change?
Yes, Abs throws an exception if input is out of range in ANSI mode
### How was this patch tested?
Unit test
Closes#31836 from gengliangwang/ansiAbs.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `ADD ARCHIVE` and `LIST ARCHIVES` commands to SQL and updates relevant documents.
SPARK-33530 added `addArchive` and `listArchives` to `SparkContext` but it's not supported yet to add/list archives with SQL.
### Why are the changes needed?
To complement features.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new test and confirmed the generated HTML from the updated documents.
Closes#31721 from sarutak/sql-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use resolved attributes instead of data-frame fields for replacing values.
### Why are the changes needed?
dataframe.na.replace() does not work for column having a dot in the name
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
Added unit tests for the same
Closes#31769 from amandeep-sharma/master.
Authored-by: Amandeep Sharma <happyaman91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Fix the table of valid type coercion combinations. Binary type should be allowed casting to String type and disallowed casting to Numeric types.
2. Summary all the `CAST`s that can cause runtime exceptions.
### Why are the changes needed?
Fix a mistake in the docs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run `jekyll serve` and preview:
![image](https://user-images.githubusercontent.com/1097932/110334374-8fab5a80-7fd7-11eb-86e7-c519cfa41b99.png)
Closes#31781 from gengliangwang/reviseAnsiDoc2.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
1 add a sapce between words
2 unify the initials' case
### Why are the changes needed?
correct spelling issues for better user experience
### Does this PR introduce _any_ user-facing change?
yes.
### How was this patch tested?
manually
Closes#31748 from hopefulnick/doc_rectify.
Authored-by: nickhliu <nickhliu@tencent.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In ANSI mode, casting String to Boolean should throw an exception on parse error, instead of returning null
### Why are the changes needed?
For better ANSI compliance
### Does this PR introduce _any_ user-facing change?
Yes, in ANSI mode there will be an exception on parse failure of casting String value to Boolean type.
### How was this patch tested?
Unit tests.
Closes#31734 from gengliangwang/ansiCastToBoolean.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Hive support type constructed value as partition spec value, spark should support too.
### Why are the changes needed?
Support TypeConstructed partition spec value keep same with hive
### Does this PR introduce _any_ user-facing change?
Yes, user can use TypeConstruct value as partition spec value such as
```
CREATE TABLE t1(name STRING) PARTITIONED BY (part DATE)
INSERT INTO t1 PARTITION(part = date'2019-01-02') VALUES('a')
CREATE TABLE t2(name STRING) PARTITIONED BY (part TIMESTAMP)
INSERT INTO t2 PARTITION(part = timestamp'2019-01-02 11:11:11') VALUES('a')
CREATE TABLE t4(name STRING) PARTITIONED BY (part BINARY)
INSERT INTO t4 PARTITION(part = X'537061726B2053514C') VALUES('a')
```
### How was this patch tested?
Added UT
Closes#30421 from AngersZhuuuu/SPARK-33474.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
spark.sql.adaptive.coalescePartitions.initialPartitionNum 200 -> (none)
spark.sql.adaptive.skewJoin.skewedPartitionFactor is 10 -> 5
### Why are the changes needed?
the wrong doc misguide people
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
passing doc
Closes#31717 from yaooqinn/minordoc0.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up of https://github.com/apache/spark/pull/31618 to document the available codecs for event log compression.
### Why are the changes needed?
Documentation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual.
Closes#31695 from dongjoon-hyun/SPARK-34503-DOC.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Cleanup all Zinc standalone server code, and realated coniguration.
### Why are the changes needed?
![image](https://user-images.githubusercontent.com/1736354/109154790-c1d3e580-77a9-11eb-8cde-835deed6e10e.png)
- Zinc is the incremental compiler to speed up builds of compilation.
- The scala-maven-plugin is the mave plugin, which is used by Spark, one of the function is to integrate the Zinc to enable the incremental compiler.
- Since Spark v3.0.0 ([SPARK-28759](https://issues.apache.org/jira/browse/SPARK-28759)), the scala-maven-plugin is upgraded to v4.X, that means Zinc v0.3.13 standalone server is useless anymore.
However, we still download, install, start the standalone Zinc server. we should remove all zinc standalone server code, and all related configuration.
See more in [SPARK-34539](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-34539) or the doc [Zinc standalone server is useless after scala-maven-plugin 4.x](https://docs.google.com/document/d/1u4kCHDx7KjVlHGerfmbcKSB0cZo6AD4cBdHSse-SBsM).
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run any mvn build:
./build/mvn -DskipTests clean package -pl core
You could see the increamental compilation is still working, the stage of "scala-maven-plugin:4.3.0:compile (scala-compile-first)" with incremental compilation info, like:
```
[INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: /root/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] compiler plugin: BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] Compiling 303 Scala sources and 27 Java sources to /root/spark/core/target/scala-2.12/test-classes ...
```
Closes#31647 from Yikun/cleanup-zinc.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't
1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L169)) in the coordinate and [false for invalid values](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L124)). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes).
2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L122)).
I propose that we be compatible with Hive for these behaviors
### Why are the changes needed?
To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior
### Does this PR introduce _any_ user-facing change?
The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet
1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does.
2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively.
### How was this patch tested?
Modified existing unit tests to test new behavior
Add new unit test to cover usage of `exclude` with unspecified `transitive`
Closes#31623 from shardulm94/spark-34506.
Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
All code is entirely my own work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.
Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.
### Does this PR introduce _any_ user-facing change?
A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
### How was this patch tested?
Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.
Closes#31535 from PhillHenry/ParamRandomBuilder.
Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The endpoints of Prometheus metrics are properly marked and documented as an experimental (SPARK-31674). The class `PrometheusServlet` itself is not the part of an API so this PR proposes to remove it.
### Why are the changes needed?
To avoid marking a non-API as an API.
### Does this PR introduce _any_ user-facing change?
No, the class is already `private[spark]`.
### How was this patch tested?
Existing tests should cover.
Closes#31640 from HyukjinKwon/SPARK-34531.
Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix typo and highlight that `ADD PARTITIONS` is the default.
### Why are the changes needed?
Fix a typo which can mislead users.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
n/a
Closes#31633 from MaxGekk/repair-table-drop-partitions-followup.
Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to update `Hadoop` dependency in K8S doc example.
### Why are the changes needed?
Apache Spark 3.2.0 is using Apache Hadoop 3.2.2 by default.
### Does this PR introduce _any_ user-facing change?
No. This is a doc-only change.
### How was this patch tested?
N/A
Closes#31628 from dongjoon-hyun/minor-doc.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Apache Spark 3.0 introduced `spark.eventLog.compression.codec` configuration.
For Apache Spark 3.2, this PR aims to set `zstd` as the default value for `spark.eventLog.compression.codec` configuration.
This only affects creating a new log file.
### Why are the changes needed?
The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users.
**1. Save storage resources (and money)**
In general, ZSTD is much smaller than LZ4.
For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4.
| CODEC | SIZE (bytes) |
|---------|-------------|
| LZ4 | 184001434|
| ZSTD | 64522396|
And, the plain file is 17.6 times bigger.
```
-rw-r--r-- 1 dongjoon staff 1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679
-rw-r--r-- 1 dongjoon staff 64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd
```
**2. Better Usability**
We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them.
```
$ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4
Decoding file spark-d3deba027bd34435ba849e14fc2c42ef
Error 44 : Unrecognized header : file cannot be decoded
```
```
$ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd
spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes
```
**3. Speed**
The following results are collected by running [lzbench](https://github.com/inikep/lzbench) on the above Spark event log. Note that
- This is not a direct comparison of Spark compression/decompression codec.
- `lzbench` is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD.
Here,
- To get ZSTD 1.4.8-1 result, `lzbench` `master` branch is used because Spark is using ZSTD 1.4.8.
- To get LZ4 1.7.5 result, `lzbench` `v1.7` branch is used because Spark is using LZ4 1.7.1.
```
Compressor name Compress. Decompress. Compr. size Ratio Filename
memcpy 7393 MB/s 7166 MB/s 1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679
zstd 1.4.8 -1 1344 MB/s 3351 MB/s 56665767 4.99 spark-a1843ead29834f46b1125a03eca32679
lz4 1.7.5 1385 MB/s 4782 MB/s 127662168 11.24 spark-a1843ead29834f46b1125a03eca32679
```
### Does this PR introduce _any_ user-facing change?
- No for the apps which doesn't use `spark.eventLog.compress` because `spark.eventLog.compress` is disabled by default.
- No for the apps using `spark.eventLog.compression.codec` explicitly because this is a change of the default value.
- Yes for the apps using `spark.eventLog.compress` without setting `spark.eventLog.compression.codec`. In this case, previously `spark.io.compression.codec` value was used whose default is `lz4`.
So this JIRA issue, SPARK-34503, is labeled with `releasenotes`.
### How was this patch tested?
Pass the updated UT.
Closes#31618 from dongjoon-hyun/SPARK-34503.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to extend the `MSCK REPAIR TABLE` command, and support new options `{ADD|DROP|SYNC} PARTITIONS`. In particular:
1. Extend the logical node `RepairTable`, and add two new flags `enableAddPartitions` and `enableDropPartitions`.
2. Add similar flags to the v1 execution node `AlterTableRecoverPartitionsCommand`
3. Add new method `dropPartitions()` to `AlterTableRecoverPartitionsCommand` which drops partitions from the catalog if their locations in the file system don't exist.
4. Updated public docs about the `MSCK REPAIR TABLE` command:
<img width="1037" alt="Screenshot 2021-02-16 at 13 46 39" src="https://user-images.githubusercontent.com/1580697/108052607-7446d280-705d-11eb-8e25-7398254787a4.png">
Closes#31097
### Why are the changes needed?
- The changes allow to recover tables with removed partitions. The example below portraits the problem:
```sql
spark-sql> create table tbl2 (col int, part int) partitioned by (part);
spark-sql> insert into tbl2 partition (part=1) select 1;
spark-sql> insert into tbl2 partition (part=0) select 0;
spark-sql> show table extended like 'tbl2' partition (part = 0);
default tbl2 false Partition Values: [part=0]
Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
...
```
Remove the partition (part = 0) from the filesystem:
```
$ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
```
Even after recovering, we cannot query the table:
```sql
spark-sql> msck repair table tbl2;
spark-sql> select * from tbl2;
21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0
```
- To have feature parity with Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, we can query recovered table:
```sql
spark-sql> msck repair table tbl2 sync partitions;
spark-sql> select * from tbl2;
1 1
spark-sql> show partitions tbl2;
part=1
```
### How was this patch tested?
- By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *PlanResolutionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRecoverPartitionsParallelSuite"
```
- Added unified v1 and v2 tests for `MSCK REPAIR TABLE`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite"
```
Closes#31499 from MaxGekk/repair-table-drop-partitions.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR replaces all the occurrences of symbol literals (`'name`) with string interpolation (`$"name"`) in examples and documents.
### Why are the changes needed?
Symbol literals are used to represent columns in Spark SQL but the Scala community seems to remove `Symbol` completely.
As we discussed in #31569, first we should replacing symbol literals with `$"name"` in user facing examples and documents.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Build docs.
Closes#31615 from sarutak/replace-symbol-literals-in-doc-and-examples.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add `table_identifier` in sql-migration-guide for SHOW CREATE TABLE.
### Why are the changes needed?
To make document more readable.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test suites.
Closes#31608 from Karl-WangSK/sqldoc.
Lead-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com>
Co-authored-by: ShiKai Wang <wskqing@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Update public docs of SQL commands about altering cached tables/views. For instance:
<img width="869" alt="Screenshot 2021-02-08 at 15 11 48" src="https://user-images.githubusercontent.com/1580697/107217940-fd3b8980-6a1f-11eb-98b9-9b2e3fe7f4ef.png">
### Why are the changes needed?
To inform users about commands behavior in altering cached tables or views.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the command below and manually checking the docs:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```
Closes#31524 from MaxGekk/doc-cmd-caching.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fix an issue that `java.sql.RowId` is mapped to `LongType` and prefer `StringType`.
In the current implementation, JDBC RowID type is mapped to `LongType` except for `OracleDialect`, but there is no guarantee to be able to convert RowID to long.
`java.sql.RowId` declares `toString` and the specification of `java.sql.RowId` says
> _all methods on the RowId interface must be fully implemented if the JDBC driver supports the data type_
(https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
So, we should prefer StringType to LongType.
### Why are the changes needed?
This seems to be a potential bug.
### Does this PR introduce _any_ user-facing change?
Yes. RowID is mapped to StringType rather than LongType.
### How was this patch tested?
New test and the existing test case `SPARK-32992: map Oracle's ROWID type to StringType` in `OracleIntegrationSuite` passes.
Closes#31491 from sarutak/rowid-type.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This change is to document the newly added streaming table APIs in Structured Streaming Programming Guide.
### Why are the changes needed?
This will help our users when they try to use the new APIs.
### Does this PR introduce _any_ user-facing change?
Yes. Users will see the changes in the programming guide.
### How was this patch tested?
Built the HTML page and verified.
Attached is a screenshot of the section added:
![Table APIs Section - Scala](https://user-images.githubusercontent.com/44179472/108581923-1ff86700-736b-11eb-8fcd-efa04ac936de.png)
Closes#31590 from bozhang2820/table-api-doc.
Lead-authored-by: Bo Zhang <bo.zhang@databricks.com>
Co-authored-by: Bo Zhang <bozhang2820@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Explicitly highlight that the table rename command cannot move a table between databases.
### Why are the changes needed?
To inform users about actual behavior of the table rename command.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```sql
spark-sql> CREATE DATABASE db1;
spark-sql> CREATE DATABASE db2;
spark-sql> CREATE TABLE db1.tbl1 (c0 INT);
spark-sql> ALTER TABLE db1.tbl1 RENAME TO db2.tbl1;
Error in query: RENAME TABLE source and destination databases do not match: 'db1' != 'db2';
spark-sql> ALTER TABLE db1.tbl1 RENAME TO db1.tbl2;
spark-sql> SHOW TABLES IN db1 LIKE '*';
db1 tbl2 false
```
Closes#31586 from MaxGekk/doc-rename-table.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
BasicWriteStatsTracker to probe for a custom Xattr if the size of
the generated file is 0 bytes; if found and parseable use that as
the declared length of the output.
The matching Hadoop patch in HADOOP-17414:
* Returns all S3 object headers as XAttr attributes prefixed "header."
* Sets the custom header x-hadoop-s3a-magic-data-length to the length of
the data in the marker file.
As a result, spark job tracking will correctly report the amount of data uploaded
and yet to materialize.
### Why are the changes needed?
Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer
which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro`
to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at
the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending`
containing all the information for the job committer to complete the upload.
But: the write tracker statictics don't show progress as they measure the length of the
created file, find the marker file and report 0 bytes.
By probing for a specific HTTP header in the marker file and parsing that if
retrieved, the real progress can be reported.
There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530)
which adds getXAttr API support to the S3A connector and returns the headers; the magic
committer adds the relevant attributes.
If the FS being probed doesn't support the XAttr API, the header is missing
or the value not a positive long then the size of 0 is returned.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to
implement getXAttr on top of LocalFS; this is used to explore the set of
options:
* no XAttr API implementation (existing tests; what callers would see with
most filesystems)
* no attribute found (HDFS, ABFS without the attribute)
* invalid data of different forms
All of these return Some(0) as file length.
The Hadoop PR verifies XAttr implementation in S3A and that
the commit protocol attaches the header to the files.
External downstream testing has done the full hadoop+spark end
to end operation, with manual review of logs to verify that the
data was successfully collected from the attribute.
Closes#30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master.
Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing:
- spark.sql.parquet.int96RebaseModeInWrite
- spark.sql.parquet.datetimeRebaseModeInWrite
- spark.sql.parquet.int96RebaseModeInRead
- spark.sql.parquet.datetimeRebaseModeInRead
- spark.sql.avro.datetimeRebaseModeInWrite
- spark.sql.avro.datetimeRebaseModeInRead
Parquet options added by #31489:
- datetimeRebaseMode
- int96RebaseMode
and Avro options added by #31529:
- datetimeRebaseMode
<img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png">
### Why are the changes needed?
To inform users about supported DS options and SQL configs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By generating the doc and manually checking:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```
Closes#31564 from MaxGekk/doc-rebase-options.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Put the SQL config `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` to the list of deprecated configs `deprecatedSQLConfigs`
2. Update docs for the Avro datasource
<img width="982" alt="Screenshot 2021-02-17 at 21 04 26" src="https://user-images.githubusercontent.com/1580697/108249890-abed7180-7166-11eb-8cb7-0c246d2a34fc.png">
### Why are the changes needed?
The config exists for enough time. We can deprecate it, and recommend users to use `.format("avro")` instead.
### Does this PR introduce _any_ user-facing change?
Should not except of the warning with the recommendation to use the `avro` format.
### How was this patch tested?
1. By generating docs via:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```
2. Manually checking the warning:
```
scala> spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", false)
21/02/17 21:20:18 WARN SQLConf: The SQL config 'spark.sql.legacy.replaceDatabricksSparkAvro.enabled' has been deprecated in Spark v3.2 and may be removed in the future. Use `.format("avro")` in `DataFrameWriter` or `DataFrameReader` instead.
```
Closes#31578 from MaxGekk/deprecate-replaceDatabricksSparkAvro.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler.
Some files and their responsibilities within this PR:
- `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access
- `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions
- `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19".
After this the correct process to update Jekyll version is the following:
1. update the version in `Gemfile`
2. run "bundle update" which updates the `Gemfile.lock`
3. commit both files
This process for version update is tested for details please check the testing section.
### Why are the changes needed?
Using different Jekyll versions can generate different output documents.
This PR standardize the process.
### Does this PR introduce _any_ user-facing change?
No, assuming the release was done via docker by using `do-release-docker.sh`.
In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile.
### How was this patch tested?
#### Testing document generation
Doc generation step was triggered via the docker release:
```
$ ./do-release-docker.sh -d ~/working -n -s docs
...
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
Skipping publish step.
```
The docs.log contains the followings:
```
Building Spark docs
Fetching gem metadata from https://rubygems.org/.........
Using bundler 2.2.9
Fetching rb-fsevent 0.10.4
Fetching forwardable-extended 2.6.0
Fetching public_suffix 4.0.6
Fetching colorator 1.1.0
Fetching eventmachine 1.2.7
Fetching http_parser.rb 0.6.0
Fetching ffi 1.14.2
Fetching concurrent-ruby 1.1.8
Installing colorator 1.1.0
Installing forwardable-extended 2.6.0
Installing rb-fsevent 0.10.4
Installing public_suffix 4.0.6
Installing http_parser.rb 0.6.0 with native extensions
Installing eventmachine 1.2.7 with native extensions
Installing concurrent-ruby 1.1.8
Fetching rexml 3.2.4
Fetching liquid 4.0.3
Installing ffi 1.14.2 with native extensions
Installing rexml 3.2.4
Installing liquid 4.0.3
Fetching mercenary 0.4.0
Installing mercenary 0.4.0
Fetching rouge 3.26.0
Installing rouge 3.26.0
Fetching safe_yaml 1.0.5
Installing safe_yaml 1.0.5
Fetching unicode-display_width 1.7.0
Installing unicode-display_width 1.7.0
Fetching webrick 1.7.0
Installing webrick 1.7.0
Fetching pathutil 0.16.2
Fetching kramdown 2.3.0
Fetching terminal-table 2.0.0
Fetching addressable 2.7.0
Fetching i18n 1.8.9
Installing terminal-table 2.0.0
Installing pathutil 0.16.2
Installing i18n 1.8.9
Installing addressable 2.7.0
Installing kramdown 2.3.0
Fetching kramdown-parser-gfm 1.1.0
Installing kramdown-parser-gfm 1.1.0
Fetching rb-inotify 0.10.1
Fetching sassc 2.4.0
Fetching em-websocket 0.5.2
Installing rb-inotify 0.10.1
Installing em-websocket 0.5.2
Installing sassc 2.4.0 with native extensions
Fetching listen 3.4.1
Installing listen 3.4.1
Fetching jekyll-watch 2.2.1
Installing jekyll-watch 2.2.1
Fetching jekyll-sass-converter 2.1.0
Installing jekyll-sass-converter 2.1.0
Fetching jekyll 4.2.0
Installing jekyll 4.2.0
Fetching jekyll-redirect-from 0.16.0
Installing jekyll-redirect-from 0.16.0
Bundle complete! 4 Gemfile dependencies, 30 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
```
#### Testing Jekyll (or other gem) update
First locally I reverted Jekyll to 4.1.0:
```
$ rm Gemfile.lock
$ rm -rf .local_ruby_bundle
# edited Gemfile to use version 4.1.0
$ cat Gemfile
source "https://rubygems.org"
gem "jekyll", "4.1.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ bundle install
...
```
Testing Jekyll version before the update:
```
$ bundle exec jekyll --version
jekyll 4.1.0
```
Imitating Jekyll update coming from git by reverting my local changes:
```
$ git checkout Gemfile
Updated 1 path from the index
$ cat Gemfile
source "https://rubygems.org"
gem "jekyll", "4.2.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ git checkout Gemfile.lock
Updated 1 path from the index
```
Run the install:
```
$ bundle install
...
```
Checking the updated Jekyll version:
```
$ bundle exec jekyll --version
jekyll 4.2.0
```
Closes#31559 from attilapiros/pin-jekyll-version.
Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Per discussion in https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057, we should add documentation for added new features of full outer and left semi joins into SS programming guide.
* Reworded the section for "Outer Joins with Watermarking", to make it work for full outer join. Updated the code snippet to show up full outer and left semi join.
* Added one section for "Semi Joins with Watermarking", similar to "Outer Joins with Watermarking".
* Updated "Support matrix for joins in streaming queries" to reflect latest fact for full outer and left semi join.
### Why are the changes needed?
Good for users and developers to follow guide to try out these two new features.
### Does this PR introduce _any_ user-facing change?
Yes. They will see the corresponding updated guide.
### How was this patch tested?
No, just documentation change. Previewed the markdown file in browser.
Also attached here for the change to the "Support matrix for joins in streaming queries" table.
<img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png">
Closes#31572 from c21/ss-doc.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR changes the type mapping for `money` and `money[]` types for PostgreSQL.
Currently, those types are tried to convert to `DoubleType` and `ArrayType` of `double` respectively.
But the JDBC driver seems not to be able to handle those types properly.
https://github.com/pgjdbc/pgjdbc/issues/100https://github.com/pgjdbc/pgjdbc/issues/1405
Due to these issue, we can get the error like as follows.
money type.
```
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00
[info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432)
[info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418)
```
money[] type.
```
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : $2,000.00
[info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235)
[info] at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122)
[info] at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764)
[info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:310)
[info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:171)
[info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:111)
```
For money type, a known workaround is to treat it as string so this PR do it.
For money[], however, there is no reasonable workaround so this PR remove the support.
### Why are the changes needed?
This is a bug.
### Does this PR introduce _any_ user-facing change?
Yes. As of this PR merged, money type is mapped to `StringType` rather than `DoubleType` and the support for money[] is stopped.
For money type, if the value is less than one thousand, `$100.00` for instance, it works without this change so I also updated the migration guide because it's a behavior change for such small values.
On the other hand, money[] seems not to work with any value but mentioned in the migration guide just in case.
### How was this patch tested?
New test.
Closes#31442 from sarutak/fix-for-money-type.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>