### What changes were proposed in this pull request?
This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`.
### Why are the changes needed?
For consistent option names.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#27372 from maropu/SPARK-30234-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Update `testthat` to >= 2.0.0
- Replace of `testthat:::run_tests` with `testthat:::test_package_dir`
- Add trivial assertions for tests, without any expectations, to avoid skipping.
- Update related docs.
### Why are the changes needed?
`testthat` version has been frozen by [SPARK-22817](https://issues.apache.org/jira/browse/SPARK-22817) / https://github.com/apache/spark/pull/20003, but 1.0.2 is pretty old, and we shouldn't keep things in this state forever.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
- Existing CI pipeline:
- Windows build on AppVeyor, R 3.6.2, testthtat 2.3.1
- Linux build on Jenkins, R 3.1.x, testthat 1.0.2
- Additional builds with thesthat 2.3.1 using [sparkr-build-sandbox](https://github.com/zero323/sparkr-build-sandbox) on c7ed64af9e697b3619779857dd820832176b3be3
R 3.4.4 (image digest ec9032f8cf98)
```
docker pull zero323/sparkr-build-sandbox:3.4.4
docker run zero323/sparkr-build-sandbox:3.4.4 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
```
3.5.3 (image digest 0b1759ee4d1d)
```
docker pull zero323/sparkr-build-sandbox:3.5.3
docker run zero323/sparkr-build-sandbox:3.5.3 zero323 --branch SPARK-23435 --commit
c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
```
and 3.6.2 (image digest 6594c8ceb72f)
```
docker pull zero323/sparkr-build-sandbox:3.6.2
docker run zero323/sparkr-build-sandbox:3.6.2 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc
````
Corresponding [asciicast](https://asciinema.org/) are available as 10.5281/zenodo.3629431
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3629431.svg)](https://doi.org/10.5281/zenodo.3629431)
(a bit to large to burden asciinema.org, but can run locally via `asciinema play`).
----------------------------
Continued from #27328Closes#27359 from zero323/SPARK-23435.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch addresses remaining functionality on event log compaction: integrate compaction into FsHistoryProvider.
This patch is next task of SPARK-30479 (#27164), please refer the description of PR #27085 to see overall rationalization of this patch.
### Why are the changes needed?
One of major goal of SPARK-28594 is to prevent the event logs to become too huge, and SPARK-29779 achieves the goal. We've got another approach in prior, but the old approach required models in both KVStore and live entities to guarantee compatibility, while they're not designed to do so.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#27208 from HeartSaVioR/SPARK-30481.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
### What changes were proposed in this pull request?
This is a follow-up for https://github.com/apache/spark/pull/24098 to refactor to build string once according to the [review comment](https://github.com/apache/spark/pull/24098#discussion_r369845234)
### Why are the changes needed?
Previously, we chose the minimal change way.
In this PR, we choose a more robust way than the previous post-step string processing.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
The test case is extended with more cases.
Closes#27353 from dongjoon-hyun/SPARK-27166-2.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
1, stack input vectors to blocks (like ALS/MLP);
2, add new param `blockSize`;
3, add a new class `InstanceBlock`
4, standardize the input outside of optimization procedure;
### Why are the changes needed?
1, reduce RAM to persist traing dataset; (save ~40% in test)
2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS)
### Does this PR introduce any user-facing change?
a new param `blockSize`
### How was this patch tested?
existing and updated testsuites
Closes#27360 from zhengruifeng/blockify_svc.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by 187f3c1773.
### Why are the changes needed?
The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions.
### Does this PR introduce any user-facing change?
Yes, before `like` fails with the exception:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7
```
After:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
true
```
### How was this patch tested?
- Add new example for the `like` function which is checked by `SQLQuerySuite`
- Run `RegexpExpressionsSuite` and `ExpressionParserSuite`.
Closes#27355 from MaxGekk/like-3-args.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Allow for using longs as seed for xxHash.
### Why are the changes needed?
Codegen fails when passing a seed to xxHash that is > 2^31.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests pass. Should more be added?
Closes#27354 from patrickcording/fix_xxhash_seed_bug.
Authored-by: Patrick Cording <patrick.cording@datarobot.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files.
### Why are the changes needed?
The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
```
grep -IUrl --color "^M" . | grep "\.java\|\.scala\|\.xml\|\.py\|\.R" | grep -v "/target/" | grep -v "/build/" | grep -v "/dist/" | grep -v "dependency-reduced-pom.xml" | grep -v ".pyc"
```
(Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.)
Before the patch, the result is:
```
./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java
./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
```
and after the patch, the result is None.
And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below:
```
warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala.
The file will have its original line endings in your working directory.
```
Closes#27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Document CLUSTER BY clause of SELECT statement in SQL Reference Guide.
### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.
### Does this PR introduce any user-facing change?
Yes.
**Before:**
There was no documentation for this.
**After.**
<img width="972" alt="Screen Shot 2020-01-20 at 2 59 05 PM" src="https://user-images.githubusercontent.com/14225158/72762704-7528de80-3b95-11ea-9d34-8fa0ab63d4c0.png">
<img width="972" alt="Screen Shot 2020-01-20 at 2 59 19 PM" src="https://user-images.githubusercontent.com/14225158/72762710-78bc6580-3b95-11ea-8279-2848d3b9e619.png">
### How was this patch tested?
Tested using jykyll build --serve
Closes#27297 from dilipbiswal/sql-ref-select-clusterby.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Document DISTRIBUTE BY clause of SELECT statement in SQL Reference Guide.
### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.
### Does this PR introduce any user-facing change?
Yes.
**Before:**
There was no documentation for this.
**After.**
<img width="972" alt="Screen Shot 2020-01-20 at 3 08 24 PM" src="https://user-images.githubusercontent.com/14225158/72763045-c08fbc80-3b96-11ea-8fb6-023cba5eb96a.png">
<img width="972" alt="Screen Shot 2020-01-20 at 3 08 34 PM" src="https://user-images.githubusercontent.com/14225158/72763047-c38aad00-3b96-11ea-80d8-cd3d2d4257c8.png">
### How was this patch tested?
Tested using jykyll build --serve
Closes#27298 from dilipbiswal/sql-ref-select-distributeby.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Prevent unnecessary copies of data during conversion from Arrow to Pandas.
### Why are the changes needed?
During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step.
### Why are the changes needed?
In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit tests in the following suites:
CatalogManagerSuite.scala
CatalogV2UtilSuite.scala
SupportsCatalogOptionsSuite.scala
PlanResolutionSuite.scala
Closes#26957 from yuchenhuo/SPARK-30314.
Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
This PR is to remove query index from the golden files of SQLQueryTestSuite
### Why are the changes needed?
Because the SQLQueryTestSuite's golden files have the query index for each query, removal of any query statement [except the last one] will generate many unneeded difference. This will make code review harder. The number of changed lines is misleading.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#27361 from gatorsmile/removeIndexNum.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This reverts commit 1d20d13149.
Closes#27351 from gatorsmile/revertSPARK25496.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Disabling test for cleaning closure of recursive function.
### Why are the changes needed?
As of 9514b822a7 this test is no longer valid, and recursive calls, even simple ones:
```lead
f <- function(x) {
if(x > 0) {
f(x - 1)
} else {
x
}
}
```
lead to
```
Error: node stack overflow
```
This is issue is silenced when tested with `testthat` 1.x (reason unknown), but cause failures when using `testthat` 2.x (issue can be reproduced outside test context).
Problem is known and tracked by [SPARK-30629](https://issues.apache.org/jira/browse/SPARK-30629)
Therefore, keeping this test active doesn't make sense, as it will lead to continuous test failures, when `testthat` is updated (https://github.com/apache/spark/pull/27359 / SPARK-23435).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
CC falaki
Closes#27363 from zero323/SPARK-29777-FOLLOWUP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
For better JDK11 support, this PR aims to upgrade **Jersey** and **javassist** to `2.30` and `3.35.0-GA` respectively.
### Why are the changes needed?
**Jersey**: This will bring the following `Jersey` updates.
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.30.html
- https://github.com/eclipse-ee4j/jersey/issues/4245 (Java 11 java.desktop module dependency)
**javassist**: This is a transitive dependency from 3.20.0-CR2 to 3.25.0-GA.
- `javassist` officially supports JDK11 from [3.24.0-GA release note](https://github.com/jboss-javassist/javassist/blob/master/Readme.html#L308).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with both JDK8 and JDK11.
Closes#27357 from dongjoon-hyun/SPARK-30639.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.
### Why are the changes needed?
In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#26978 from viirya/SPARK-29721.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove ```numTrees``` in GBT in 3.0.0.
### Why are the changes needed?
Currently, GBT has
```
/**
* Number of trees in ensemble
*/
Since("2.0.0")
val getNumTrees: Int = trees.length
```
and
```
/** Number of trees in ensemble */
val numTrees: Int = trees.length
```
I think we should remove one of them. We deprecated it in 2.4.5 via https://github.com/apache/spark/pull/27352.
### Does this PR introduce any user-facing change?
Yes, remove ```numTrees``` in GBT in 3.0.0
### How was this patch tested?
existing tests
Closes#27330 from huaxingao/spark-numTrees.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add SPARK_APPLICATION_ID environment when spark configure driver pod.
### Why are the changes needed?
Currently, driver doesn't have this in environments and it's no convenient to retrieve spark id.
The use case is we want to look up spark application id and create application folder and redirect driver logs to application folder.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
unit tested. I also build new distribution and container image to kick off a job in Kubernetes and I do see SPARK_APPLICATION_ID added there. .
Closes#27347 from Jeffwan/SPARK-30626.
Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Disable all the V2 file sources in Spark 3.0 by default.
### Why are the changes needed?
There are still some missing parts in the file source V2 framework:
1. It doesn't support reporting file scan metrics such as "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. This requires another patch in the data source V2 framework. Tracked by [SPARK-30362](https://issues.apache.org/jira/browse/SPARK-30362)
2. It doesn't support partition pruning with subqueries(including dynamic partition pruning) for now. Tracked by [SPARK-30628](https://issues.apache.org/jira/browse/SPARK-30628)
As we are going to code freeze on Jan 31st, this PR proposes to disable all the V2 file sources in Spark 3.0 by default.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#27348 from gengliangwang/disableFileSourceV2.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Minor documentation fix
### Why are the changes needed?
### Does this PR introduce any user-facing change?
### How was this patch tested?
Manually; consider adding tests?
Closes#27295 from deepyaman/patch-2.
Authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up PR https://github.com/apache/spark/pull/25666 for adding the description and example for the Scala function API `filter`.
### Why are the changes needed?
It is hard to tell which parameter is the index column.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#27336 from gatorsmile/spark28962.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR is to rename spark.sql.subquery.reuse to spark.sql.execution.subquery.reuse.enabled
### Why are the changes needed?
Make it consistent and clear.
### Does this PR introduce any user-facing change?
N/A. This is a [new conf added in Spark 3.0](https://github.com/apache/spark/pull/23998)
### How was this patch tested?
N/A
Closes#27346 from gatorsmile/spark27083.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Currently, in the following scenario, bucket join is not utilized:
```scala
val df = (0 until 20).map(i => (i, i)).toDF("i", "j").as("df")
df.write.format("parquet").bucketBy(8, "i").saveAsTable("t")
sql("CREATE VIEW v AS SELECT * FROM t")
sql("SELECT * FROM t a JOIN v b ON a.i = b.i").explain
```
```
== Physical Plan ==
*(4) SortMergeJoin [i#13], [i#15], Inner
:- *(1) Sort [i#13 ASC NULLS FIRST], false, 0
: +- *(1) Project [i#13, j#14]
: +- *(1) Filter isnotnull(i#13)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
+- *(3) Sort [i#15 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i#15, 8), true, [id=#64] <----- Exchange node introduced
+- *(2) Project [i#13 AS i#15, j#14 AS j#16]
+- *(2) Filter isnotnull(i#13)
+- *(2) ColumnarToRow
+- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:..., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
```
Notice that `Exchange` is present. This is because `Project` introduces aliases and `outputPartitioning` and `requiredChildDistribution` do not consider aliases while considering bucket join in `EnsureRequirements`. This PR addresses to allow this scenario.
### Why are the changes needed?
This allows bucket join to be utilized in the above example.
### Does this PR introduce any user-facing change?
Yes, now with the fix, the `explain` out is as follows:
```
== Physical Plan ==
*(3) SortMergeJoin [i#13], [i#15], Inner
:- *(1) Sort [i#13 ASC NULLS FIRST], false, 0
: +- *(1) Project [i#13, j#14]
: +- *(1) Filter isnotnull(i#13)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:.., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
+- *(2) Sort [i#15 ASC NULLS FIRST], false, 0
+- *(2) Project [i#13 AS i#15, j#14 AS j#16]
+- *(2) Filter isnotnull(i#13)
+- *(2) ColumnarToRow
+- FileScan parquet default.t[i#13,j#14] Batched: true, DataFilters: [isnotnull(i#13)], Format: Parquet, Location: InMemoryFileIndex[file:.., PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int>, SelectedBucketsCount: 8 out of 8
```
Note that the `Exchange` is no longer present.
### How was this patch tested?
Closes#26943 from imback82/bucket_alias.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In this PR, I propose to move the `RESERVED_PROPERTIES `s from `SupportsNamespaces` and `TableCatalog` to `CatalogV2Util`, which can keep `RESERVED_PROPERTIES ` safe for interval usages only.
### Why are the changes needed?
the `RESERVED_PROPERTIES` should not be changed by subclasses
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing uts
Closes#27318 from yaooqinn/SPARK-30603.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix a bug in https://github.com/apache/spark/pull/26589 , to make this feature work.
### Why are the changes needed?
This feature doesn't work actually.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
new test
Closes#27341 from cloud-fan/cache.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update the scalafmt plugin to 1.0.3 and use its new onlyChangedFiles feature rather than --diff
### Why are the changes needed?
Older versions of the plugin either didn't work with scala 2.13, or got rid of the --diff argument and didn't allow for formatting only changed files
### Does this PR introduce any user-facing change?
The /dev/scalafmt script no longer passes through arbitrary args, instead using the arg to select scala version. The issue here is the plugin name literally contains the scala version, and doesn't appear to have a shorter way to refer to it. If srowen or someone else with better maven-fu has an idea I'm all ears.
### How was this patch tested?
Manually, e.g. edited a file and ran
dev/scalafmt
or
dev/scalafmt 2.13
Closes#27279 from koeninger/SPARK-30570.
Authored-by: cody koeninger <cody@koeninger.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Document CREATE TABLE statement in SQL Reference Guide.
### Why are the changes needed?
Adding documentation for SQL reference.
### Does this PR introduce any user-facing change?
yes
Before:
There was no documentation for this.
### How was this patch tested?
Used jekyll build and serve to verify.
Closes#26759 from PavithraRamachandran/create_doc.
Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Expressions are very likely to be serialized and sent to executors, we should avoid unnecessary serialization overhead as much as we can.
This PR fixes `AggregateExpression`.
### Why are the changes needed?
small improvement
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#27342 from cloud-fan/fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to remove sorting in the asserts of checking output of:
- expression examples,
- SQL tests in `SQLQueryTestSuite`.
### Why are the changes needed?
* Sorted `actual` and `expected` make assert output unusable. Instead of `"[true]" did not equal "[false]"`, it looks like `"[ertu]" did not equal "[aefls]"`.
* Output of expression examples should be always the same except nondeterministic expressions listed in the `ignoreSet` of the `check outputs of expression examples` test.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `SQLQuerySuite` via `./build/sbt "sql/test:testOnly org.apache.spark.sql.SQLQuerySuite"`.
Closes#27324 from MaxGekk/remove-sorting-in-examples-tests.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move the `defaultNamespace` method from the interface `SupportsNamespace` to `CatalogPlugin`.
### Why are the changes needed?
While I'm implementing JDBC V2, I realize that the default namespace is very an important information. Even if you don't want to implement namespace manipulation functionalities like CREATE/DROP/ALTER namespace, you still need to report the default namespace.
The default namespace is not about functionality but a matter of correctness. If you don't know the default namespace of a catalog, it's wrong to assume it's `[]`. You may get table not found exception if you do so.
I think it's more reasonable to put the `defaultNamespace` method in the base class `CatalogPlugin`. It returns `[]` by default so won't bother implementation if they really don't have namespace concept.
### Does this PR introduce any user-facing change?
yes, but for an unreleased API.
### How was this patch tested?
existing tests
Closes#27319 from cloud-fan/ns.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is to remove the conf
### Why are the changes needed?
This rule can be excluded using spark.sql.optimizer.excludedRules without an extra conf
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
N/A
Closes#27334 from gatorsmile/spark27871.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix a few super nit problems
### Why are the changes needed?
To make doc look better
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Tested using jykyll build --serve
Closes#27332 from huaxingao/spark-30575-followup.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
add a param `bootstrap` to control whether bootstrap samples are used.
### Why are the changes needed?
Current RF with numTrees=1 will directly build a tree using the orignial dataset,
while with numTrees>1 it will use bootstrap samples to build trees.
This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange.
In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used.
### Does this PR introduce any user-facing change?
Yes, new param is added
### How was this patch tested?
existing testsuites
Closes#27254 from zhengruifeng/add_bootstrap.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR adds:
- `pyspark.sql.functions.overlay` function to PySpark
- `overlay` function to SparkR
### Why are the changes needed?
Feature parity. At the moment R and Python users can access this function only using SQL or `expr` / `selectExpr`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New unit tests.
Closes#27325 from zero323/SPARK-30607.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This reverts commit b5cb9abdd5.
### Why are the changes needed?
The merged commit (#27243) was too risky for several reasons:
1. It doesn't fix a bug
2. It makes the resolution of the table that's going to be altered a child. We had avoided this on purpose as having an arbitrary rule change the child of AlterTable seemed risky. This change alone is a big -1 for me for this change.
3. While the code may look cleaner, I think this approach makes certain things harder, e.g. differentiating between the Hive based Alter table CHANGE COLUMN and ALTER COLUMN syntax. Resolving and normalizing columns for ALTER COLUMN also becomes a bit harder, as we now have to check every single AlterTable command instead of just a single ALTER TABLE ALTER COLUMN statement
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests
This closes#27315Closes#27327 from brkyvz/revAlter.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Skip resolving the merge expressions if the target is a DSv2 table with ACCEPT_ANY_SCHEMA capability.
### Why are the changes needed?
Some DSv2 sources may want to customize the merge resolution logic. For example, a table that can accept any schema (TableCapability.ACCEPT_ANY_SCHEMA) may want to allow certain merge queries that are blocked (that is, throws AnalysisError) by the default resolution logic. So there should be a way to completely bypass the merge resolution logic in the Analyzer.
### Does this PR introduce any user-facing change?
No, since merge itself is an unreleased feature
### How was this patch tested?
added unit test to specifically test the skipping.
Closes#27326 from tdas/SPARK-30609.
Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
### What changes were proposed in this pull request?
In `org.apache.spark.sql.execution.SubqueryExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the sub-execution thread in `org.apache.spark.sql.execution.SubqueryExec#executionContext`
### Why are the changes needed?
Local properties set via sparkContext are not available as TaskContext properties when executing jobs and threadpools have idle threads which are reused
Explanation:
When `SubqueryExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads.
These threads are created in the `executionContext` (thread pools). Each Thread pool has a default keepAliveSeconds of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added UT
Closes#27267 from ajithme/subquerylocalprop.
Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When you save a Spark UI SQL query page to disk and then display the html file with your browser, the query plan will be rendered a second time. This change avoids rendering the plan visualization when it exists already.
This is master:
![grafik](https://user-images.githubusercontent.com/44700269/72543429-fcb8d980-3885-11ea-82aa-c0b3638847e5.png)
And with the fix:
![grafik](https://user-images.githubusercontent.com/44700269/72543641-57523580-3886-11ea-8cdf-5fb0cdffa983.png)
### Why are the changes needed?
The duplicate query plan is unexpected and redundant.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested. Testing this in a reproducible way requires a running browser or HTML rendering engine that executes the JavaScript.
Closes#27238 from EnricoMi/branch-sql-ui-duplicate-plan.
Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Document SELECT statement in SQL Reference Guide. In this PR includes the main
entry page for SELECT. I will open follow-up PRs for different clauses.
### Why are the changes needed?
Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.
### Does this PR introduce any user-facing change?
Yes.
**Before:**
There was no documentation for this.
**After.**
<img width="972" alt="Screen Shot 2020-01-19 at 11 20 41 PM" src="https://user-images.githubusercontent.com/14225158/72706257-6c42f900-3b12-11ea-821a-171ff035443f.png">
<img width="972" alt="Screen Shot 2020-01-19 at 11 21 55 PM" src="https://user-images.githubusercontent.com/14225158/72706313-91d00280-3b12-11ea-90e4-be7174b4593d.png">
<img width="972" alt="Screen Shot 2020-01-19 at 11 22 16 PM" src="https://user-images.githubusercontent.com/14225158/72706323-97c5e380-3b12-11ea-99e5-e7aaa3b4df68.png">
### How was this patch tested?
Tested using jykyll build --serve
Closes#27216 from dilipbiswal/sql_ref_select_hook.
Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to add additional constructor in the `Like` expression. The constructor can be used on applying the `like` function with 2 parameters.
### Why are the changes needed?
`FunctionRegistry` cannot find a constructor if the `like` function is applied to 2 parameters.
### Does this PR introduce any user-facing change?
Yes, before:
```sql
spark-sql> SELECT like('Spark', '_park');
Invalid arguments for function like; line 1 pos 7
org.apache.spark.sql.AnalysisException: Invalid arguments for function like; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1412)
```
After:
```sql
spark-sql> SELECT like('Spark', '_park');
true
```
### How was this patch tested?
By running `check outputs of expression examples` from `SQLQuerySuite`.
Closes#27323 from MaxGekk/fix-like-func.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add HostLocalBlock size in log total bytes
### Why are the changes needed?
total size in log is wrong as hostlocal block size is missed
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Manually checking the log
Closes#27320 from Udbhav30/bug.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>