This reverts commit a0e63b61e7.
### What changes were proposed in this pull request?
This reverts the patch at #26978 based on gatorsmile's suggestion.
### Why are the changes needed?
Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#27504 from viirya/revert-SPARK-29721.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Simplify the changes for adding metrics description for WholeStageCodegen in https://github.com/apache/spark/pull/27405
### Why are the changes needed?
In https://github.com/apache/spark/pull/27405, the UI changes can be made without using the function `adjustPositionOfOperationName` to adjust the position of operation name and mark as an operation-name class.
I suggest we make simpler changes so that it would be easier for future development.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test with the queries provided in https://github.com/apache/spark/pull/27405
```
sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show
sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output")
sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output")
```
![image](https://user-images.githubusercontent.com/1097932/74073629-e3f09f00-49bf-11ea-90dc-1edb5ca29e5e.png)
Closes#27490 from gengliangwang/wholeCodegenUI.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27501 from huaxingao/spark_30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add the new tab `SQL` in the `Data Types` page.
### Why are the changes needed?
New type added in SPARK-29587.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Locally test by Jekyll.
![image](https://user-images.githubusercontent.com/4833765/73908593-2e511d80-48e5-11ea-85a7-6ee451e6b727.png)
Closes#27447 from xuanyuanking/SPARK-29587-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior.
### Why are the changes needed?
The original change might risky to end-users, it changes behavior silently.
### Does this PR introduce any user-facing change?
Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional.
### How was this patch tested?
New UT.
Closes#27454 from xuanyuanking/SPARK-28228-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Revert
#27360#27396#27374#27389
### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.
### Does this PR introduce any user-facing change?
remove newly added param blockSize
### How was this patch tested?
reverted testsuites
Closes#27487 from zhengruifeng/revert_blockify_ii.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The current ALTER COLUMN syntax allows to change multiple properties at a time:
```
ALTER TABLE table=multipartIdentifier
(ALTER | CHANGE) COLUMN? column=multipartIdentifier
(TYPE dataType)?
(COMMENT comment=STRING)?
colPosition?
```
The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan)
This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time.
Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior.
### Why are the changes needed?
To follow SQL standard (and other recent SQL systems) behavior.
### Does this PR introduce any user-facing change?
Yes, now the user can update the column properties only one at a time.
For example,
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment'
```
should be broken into
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
```
### How was this patch tested?
Updated existing tests.
Closes#27444 from imback82/alter_column_one_at_a_time.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API
- Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence.
### Why are the changes needed?
- Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example:
```scala
scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show
+-------------------+
| local UTC|
+-------------------+
|2019-11-03 03:00:00|
+-------------------+
```
but the result must be 2019-11-03 04:00:00:
<img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png">
- Simplifies the code, and make it more maintainable
- Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions.
### Does this PR introduce any user-facing change?
Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`.
### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
- Added `convert time zones on a daylight saving day` to DateFunctionsSuite
Closes#27474 from MaxGekk/port-convertTz-on-Java8-api.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes some typos in `python/pyspark/sql/types.py` file.
### Why are the changes needed?
To deliver correct wording in documentation and codes.
### Does this PR introduce any user-facing change?
Yes, it fixes some typos in user-facing API documentation.
### How was this patch tested?
Locally tested the linter.
Closes#27475 from sharifahmad2061/master.
Lead-authored-by: sharif ahmad <sharifahmad2061@gmail.com>
Co-authored-by: Sharif ahmad <sharifahmad2061@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26656.
We don't support window aggregate function with filter predicate yet and we should fail explicitly.
Observable metrics has the same issue. This PR fixes it as well.
### Why are the changes needed?
If we simply ignore filter predicate when we don't support it, the result is wrong.
### Does this PR introduce any user-facing change?
yea, fix the query result.
### How was this patch tested?
new tests
Closes#27476 from cloud-fan/filter.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally.
This PR also add a config to not skip AQE.
### Why are the changes needed?
It's not a warning at all if we intentionally skip AQE.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs.
Closes#27452 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Add new config `spark.network.maxRemoteBlockSizeFetchToMem` fallback to the old config `spark.maxRemoteBlockSizeFetchToMem`.
### Why are the changes needed?
For naming consistency.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27463 from xuanyuanking/SPARK-26700-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add examples and parameter description for these Scala functions:
* transform
* exists
* forall
* aggregate
* zip_with
* transform_keys
* transform_values
* map_filter
* map_zip_with
### Why are the changes needed?
Better documentation for UX.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27449 from Ngone51/doc-funcs.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel.
### Why are the changes needed?
For better performance.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27471 from Ngone51/dev_calculate_in_parallel.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty.
I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql
Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`.
Because only one case reference the table `empsalary` and it throws `AnalysisException`.
```
-- !query
select last(salary) over(order by salary range between 1000 preceding and 1000 following),
lag(salary) over(order by salary range between 1000 preceding and 1000 following),
salary from empsalary
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1);
```
So we should do four work:
1. comment out the only one case and create a new ticket.
2. Add `INSERT INTO empsalary`.
Note: window_part4.sql not use the table `empsalary`.
### Why are the changes needed?
Supplementary test data.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New test case
Closes#27439 from beliefer/add-insert-to-window.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables.
This PR would allow qualified column names in query as following:
```SQL
SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT tbl.foo FROM testcat.ns1.ns2.tbl
```
### Why are the changes needed?
This is a bug because you cannot qualify column names in queries.
### Does this PR introduce any user-facing change?
Yes, now users can qualify column names for v2 tables.
### How was this patch tested?
Added new tests.
Closes#27391 from imback82/qualified_col.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Eagerly filter out zombie `TaskSetManager` before offering resources to reduce any overhead as possible.
And this PR also avoid doing `recomputeLocality` and `addPendingTask` when `TaskSetManager` is zombie.
### Why are the changes needed?
Zombie `TaskSetManager` could still exist in Pool's `schedulableQueue` when it has running tasks. Offering resources on a zombie `TaskSetManager` could bring unnecessary overhead and is meaningless.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27455 from Ngone51/exclude-zombie-tsm.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to make hardcoded `python3` to a variable `PYTHON_EXECUTABLE` in ' lint-python' script.
### Why are the changes needed?
To make changes easier. See 561e9b9688 as an example.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually by running `dev/lint-python`.
Closes#27470 from HyukjinKwon/minor-PYTHON_EXECUTABLE.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently, it doesn't seem to be possible to have Spark Driver set the serviceAccountName for executor pods it launches.
### Why are the changes needed?
it will allow settings serviceAccountName for executors pods.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
It was covered by unit test.
Closes#27034 from ayudovin/srevice-account-name-for-executor-pods.
Authored-by: yudovin <artsiom.yudovin@profitero.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
update `DataFrameAggregateSuite` to make it pass with AQE
### Why are the changes needed?
We don't need to turn off AQE in `DataFrameAggregateSuite`
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
run `DataFrameAggregateSuite` locally with AQE on.
Closes#27451 from cloud-fan/aqe-test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message.
### Why are the changes needed?
For better usability.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27464 from xuanyuanking/SPARK-29864-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR use a specific version of docker image instead of `latest`. As of today, when I run K8s integration test locally, this test case fails always.
Also, in this PR, I shows two consecutive failures with a dummy change.
- https://github.com/apache/spark/pull/27465#issuecomment-582326614
- https://github.com/apache/spark/pull/27465#issuecomment-582329114
```
- Launcher client dependencies *** FAILED ***
```
After that, I added the patch and K8s Integration test passed.
- https://github.com/apache/spark/pull/27465#issuecomment-582361696
### Why are the changes needed?
[SPARK-28465](https://github.com/apache/spark/pull/25222) switched from `v4.0.0-stable-4.0-master-centos-7-x86_64` to `latest` to catch up the API change. However, the API change seems to occur again. We had better use a specific version to prevent accidental failures.
```scala
- .withImage("ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64")
+ .withImage("ceph/daemon:latest")
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass `Launcher client dependencies` test in Jenkins K8s Integration Suite.
Or, run K8s Integration test locally.
Closes#27465 from dongjoon-hyun/SPARK-K8S-IT.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add migration note for removing `org.apache.spark.ml.image.ImageSchema.readImages`
### Why are the changes needed?
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes#27467 from WeichenXu123/SC-26286.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When spark.sql.ansi.enabled is true, for the statement:
```
select cast(cast(2147483648 as Float) as Integer) //result is 2147483647
```
Its result is 2147483647 and does not throw `ArithmeticException`.
The root cause is that, the below code does not work for some corner cases.
94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)
For example:
![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png)
In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.
### Why are the changes needed?
Result corrupt.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added Unit test.
Closes#27151 from turboFei/SPARK-26218-follow-up-int-overflow.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0181, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`.
To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`.
### Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- Restored additional test cases in `JsonInferSchemaSuite`.
Closes#27441 from MaxGekk/support-simpledateformat.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1
### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version
Closes#27443 from onursatici/os/bump-fabric8.
Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
If `updateBlockInfo` returns false, which means the `BlockManager` will re-register and report all blocks later. So, we may report two times for the same block, which causes `AppStatusListener` to count used memory for two times, too. As a result, the used memory can exceed the total memory.
So, this PR changes it to not post `SparkListenerBlockUpdated` when `updateBlockInfo` returns false. And, always clean up used memory whenever `AppStatusListener` receives `SparkListenerBlockManagerAdded`.
### Why are the changes needed?
This PR tries to fix negative memory usage in UI (https://user-images.githubusercontent.com/3488126/72131225-95e37e00-33b6-11ea-8708-6e5ed328d1ca.png, see #27144 ). Though, I'm not very sure this is the root cause for #27144 since known information is limited here.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new tests by xuanyuanking
Closes#27306 from Ngone51/fix-possible-negative-memory.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There are currently the R test failures after upgrading `testthat` to 2.0.0, and R version 3.5.2 as of SPARK-23435. This PR targets to fix the tests and make the tests pass. See the explanations and causes below:
```
test_context.R:49: failure: Check masked functions
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 6 - 4 == 2
test_context.R:53: failure: Check masked functions
sort(maskedCompletely, na.last = TRUE) not equal to sort(namesOfMaskedCompletely, na.last = TRUE).
5/6 mismatches
x[2]: "endsWith"
y[2]: "filter"
x[3]: "filter"
y[3]: "not"
x[4]: "not"
y[4]: "sample"
x[5]: "sample"
y[5]: NA
x[6]: "startsWith"
y[6]: NA
```
From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.
```
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
package ���plyr��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.
```
From my cursory look, previously installed `plyr` remains and it's not compatible with the new R version. I fixed accordingly and Jenkins will test it out.
```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink
```
Seems a env problem. I suppressed the warnings for now.
```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
It is strongly recommended to set envionment variable TZ to ���America/Los_Angeles��� (or equivalent)
```
Seems a env problem. I suppressed the warnings for now.
```
test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature ���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at /home/jenkins/workspace/SparkPullRequestBuilder2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable)
{
methods <- .findInheritedMethods(classes, fdef, mtable)
if (length(methods) == 1L)
return(methods[[1L]])
else if (length(methods) == 0L) {
cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", collapse = ", ")
stop(gettextf("unable to find an inherited method for function %s for signature %s",
sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
}
else stop("Internal error in finding inherited methods; didn't return a unique method",
domain = NA)
})(list("character"), new("nonstandardGenericFunction", .Data = function (x, prefix)
{
standardGeneric("startsWith")
}, generic = structure("startsWith", package = "SparkR"), package = "SparkR", group = list(),
valueClass = character(0), signature = c("x", "prefix"), default = NULL, skeleton = (function (x,
prefix)
stop("invalid call in method dispatch to 'startsWith' (no default method)", domain = NA))(x,
prefix)), <environment>)
6: stop(gettextf("unable to find an inherited method for function %s for signature %s",
sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
```
From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.
Also, this PR causes a CRAN check failure as below:
```
* creating vignettes ... ERROR
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
package ���htmltools��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
```
This PR disables it for now.
### Why are the changes needed?
To unblock other PRs.
### Does this PR introduce any user-facing change?
No. Test only and dev only.
### How was this patch tested?
No. I am going to use Jenkins to test.
Closes#27460 from HyukjinKwon/r-test-failure.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given:
- `spark.sql.execution.pandas.arrowSafeTypeConversion`
- `spark.sql.execution.pandas.respectSessionTimeZone`
- `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName`
- other configurations like `spark.sql.execution.arrow.*`.
### Why are the changes needed?
To make configuration names consistent.
### Does this PR introduce any user-facing change?
No because this configuration was not released yet.
### How was this patch tested?
Existing tests should cover.
Closes#27450 from HyukjinKwon/SPARK-27870-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This reverts commit b89c3de1a4.
### Why are the changes needed?
`FIRST_VALUE` is used only for window expression. Please see the discussion on https://github.com/apache/spark/pull/25082 .
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Pass the Jenkins.
Closes#27458 from dongjoon-hyun/SPARK-28310.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string.
### Why are the changes needed?
Adding a legacy config to make migration easy for Spark users.
### Does this PR introduce any user-facing change?
Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0.
### How was this patch tested?
Unit test.
Closes#27456 from viirya/SPARK-25040-followup.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
All legacy SQL configs are marked as internal configs. In particular, the following configs are updated as internals:
- spark.sql.legacy.sizeOfNull
- spark.sql.legacy.replaceDatabricksSparkAvro.enabled
- spark.sql.legacy.typeCoercion.datetimeToString.enabled
- spark.sql.legacy.looseUpcast
- spark.sql.legacy.arrayExistsFollowsThreeValuedLogic
### Why are the changes needed?
In general case, users shouldn't change legacy configs, so, they can be marked as internals.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Should be tested by jenkins build and run tests.
Closes#27448 from MaxGekk/legacy-internal-sql-conf.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update the SQL migration guide, and clarify behavior change of typed `TIMESTAMP` and `DATE` literals for input strings without time zone information - local timestamp and date strings.
### Why are the changes needed?
To inform users that the typed literals may change their behavior in Spark 3.0 because of different sources of the default time zone - JVM system time zone in Spark 2.4 and earlier, and `spark.sql.session.timeZone` in Spark 3.0.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#27435 from MaxGekk/timestamp-lit-migration-guide.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is to fix a potential bug in AQE where an `ExecSubqueryExpression` could be mistakenly replaced with another `ExecSubqueryExpression` with the same `ListQuery` but a different `child` expression.
This is because a ListQuery's id can only identify the ListQuery itself, not the parent expression `InSubquery`, but right now the `subqueryMap` in `InsertAdaptiveSparkPlan` uses the `ListQuery`'s id as key and the corresponding `InSubqueryExec` for the `ListQuery`'s parent expression as value. So the fix uses the corresponding `SubqueryExec` for the `ListQuery` itself as the map's value.
### Why are the changes needed?
This logical bug could potentially cause a wrong query plan, which could throw an exception related to unresolved columns.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Passed existing UTs.
Closes#27446 from maryannxue/spark-30717.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR removes the `jdk.tools:jdk.tools` transitive dependency from `hadoop-yarn-api`.
- This is only used in `hadoop-annotation` project in some `*Doclet.java`.
### Why are the changes needed?
Although this is not used in Apache Spark, this can cause a resolve failure in JDK11 environment.
<img width="530" alt="jdk tools" src="https://user-images.githubusercontent.com/9700541/73697745-2f3f4080-4694-11ea-95a7-228638e31cf7.png">
### Does this PR introduce any user-facing change?
No. This is a dev-only change.
From developers, this will remove the `Cannot resolve` error in IDE environment.
### How was this patch tested?
- Pass the Jenkins in JDK8
- Manually, import the project with JDK11.
Closes#27445 from dongjoon-hyun/SPARK-30718.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called.
### Why are the changes needed?
In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Existing unit tests.
Closes#27232 from fuwhu/SPARK-30525.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Put the configs below needed by Structured Streaming UI into StaticSQLConf:
- spark.sql.streaming.ui.enabled
- spark.sql.streaming.ui.retainedProgressUpdates
- spark.sql.streaming.ui.retainedQueries
### Why are the changes needed?
Make all SS UI configs consistent with other similar configs in usage and naming.
### Does this PR introduce any user-facing change?
Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`.
### How was this patch tested?
Existing UT.
Closes#27425 from xuanyuanking/SPARK-29543-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
### What changes were proposed in this pull request?
Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog.
### Why are the changes needed?
This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added test to DataSourceV2SessionCatalogSuite
Closes#27423 from brkyvz/isViewF.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove duplicate `name` tags from `read.df` and `read.stream`.
### Why are the changes needed?
These tags are already present in
1adf3520e3/R/pkg/R/SQLContext.R (L546)
and
1adf3520e3/R/pkg/R/SQLContext.R (L678)
for `read.df` and `read.stream` respectively.
As only one `name` tag per block is allowed, this causes build warnings with recent `roxygen2` versions:
```
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:559] name May only use one name per block
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:690] name May only use one name per block
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27437 from zero323/roxygen-warnings-names.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5.
### Why are the changes needed?
`jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
```
$ cd dev/create-release/spark-rm/
$ docker build -t spark:test .
...
ERROR: Error installing jekyll-redirect-from:
jekyll-redirect-from requires Ruby version >= 2.4.0.
...
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually do the above command to build `spark-rm` Docker image.
```
...
Successfully installed jekyll-redirect-from-0.15.0
Parsing documentation for jekyll-redirect-from-0.15.0
Installing ri documentation for jekyll-redirect-from-0.15.0
Done installing documentation for jekyll-redirect-from after 0 seconds
1 gem installed
Successfully installed rouge-3.15.0
Parsing documentation for rouge-3.15.0
Installing ri documentation for rouge-3.15.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container e0ec7c77b69f
---> 32dec37291c6
```
Closes#27434 from dongjoon-hyun/SPARK-30704.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
We have upgraded the built-in Hive from 1.2 to 2.3. This may need to set `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars` according to the version of your Hive metastore. Example:
```
--conf spark.sql.hive.metastore.version=1.2.1 --conf spark.sql.hive.metastore.jars=/root/hive-1.2.1-lib/*
```
Otherwise:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req';
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:841)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:431)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:52)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:226)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3487)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3485)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:226)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
... 47 elided
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req'
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1282)
at org.apache.spark.sql.hive.client.HiveClientImpl.getRawTableOption(HiveClientImpl.scala:422)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$tableExists$1(HiveClientImpl.scala:436)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:322)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
at org.apache.spark.sql.hive.client.HiveClientImpl.tableExists(HiveClientImpl.scala:436)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$tableExists$1(HiveExternalCatalog.scala:841)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
... 63 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
at com.sun.proxy.$Proxy38.getTable(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336)
at com.sun.proxy.$Proxy38.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274)
... 74 more
```
### Why are the changes needed?
Improve documentation.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
```SKIP_API=1 jekyll build```:
![image](https://user-images.githubusercontent.com/5399861/73531432-67a50b80-4455-11ea-9401-5cad12fd3d14.png)
Closes#27161 from wangyum/SPARK-27686.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
var `negThetaSum` is always used together with `pi`, so we can add them at first
### Why are the changes needed?
only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27427 from zhengruifeng/nb_predict.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR aims to increase the timeout of `StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy` from 30s (default) to 60s.
In this PR, before increasing the timeout,
1. I verified that this is not a JDK11 environmental issue by repeating 3 times first.
2. I reproduced the accuracy failure by reducing the timeout in Jenkins (https://github.com/apache/spark/pull/27424#issuecomment-580981262)
Then, the final commit passed the Jenkins.
### Why are the changes needed?
This seems to happen when Jenkins environment has congestion and the jobs are slowdown. The streaming job seems to be unable to repeat the designed iteration `numIteration=25` in 30 seconds. Since the error is decreasing at each iteration, the failure occurs.
By reducing the timeout, we can reproduce the similar issue locally like Jenkins.
```python
- eventually(condition, catch_assertions=True)
+ eventually(condition, timeout=10.0, catch_assertions=True)
```
```
$ python/run-tests --testname 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy' --python-executables=python
...
======================================================================
FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
eventually(condition, timeout=10.0, catch_assertions=True)
File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 86, in eventually
raise lastValue
Reproduce the error
File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 77, in eventually
lastValue = condition()
File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.25749106949322637 != 0.1 within 1 places (0.15749106949322636 difference)
----------------------------------------------------------------------
Ran 1 test in 14.814s
FAILED (failures=1)
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins (and manual check by reducing the timeout).
Since this is a flakiness issue depending on the Jenkins job situation, it's difficult to reproduce there.
Closes#27424 from dongjoon-hyun/SPARK-TEST.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I found checkstyle have a new release https://checkstyle.org/releasenotes.html#Release_8.29
Bumps checkstyle from 8.25 to 8.29.
### Why are the changes needed?
I have bump checkstyle from 8.25 to 8.29 on my fork branch and test to build.
It's OK
8.29 added some new features:
- New Check: AvoidNoArgumentSuperConstructorCall.
- New Check NoEnumTrailingComma.
- ENUM_DEF token support in RightCurlyCheck.
- FallThrough module does not support the spelling "fall-through" by default.
8.29 fix some bugs:
- Java 8 Grammar: annotations on varargs parameters.
- Sonar violation: Disable XML external entity (XXE) processing.
- Disable instantiation of modules with private ctor.
- Sonar violation: "ThreadLocal" variables should be cleaned up when no longer used.
- Indentation incorrect level for chained method with bracket on new line.
- InvalidJavadocPosition: false positive when comment is between javadoc and package.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No UT
Closes#27426 from beliefer/bump-checkstyle.
Authored-by: jiaan geng <jiaan.geng@jiaandeMacBook-Air.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This change is to allow custom resource scheduler (GPUs,FPGAs,etc) resource discovery to be more flexible. Users are asking for it to work with hadoop 2.x versions that do not support resource scheduling in YARN and/or also they may not run in an isolated environment.
This change creates a plugin api that users can write their own resource discovery class that allows a lot more flexibility. The user can chain plugins for different resource types. The user specified plugins execute in the order specified and will fall back to use the discovery script plugin if they don't return information for a particular resource.
I had to open up a few of the classes to be public and change them to not be case classes and make them developer api in order for the the plugin to get enough information it needs.
I also relaxed the yarn side so that if yarn isn't configured for resource scheduling we just warn and go on. This helps users that have yarn 3.1 but haven't configured the resource scheduling side on their cluster yet, or aren't running in isolated environment.
The user would configured this like:
--conf spark.resources.discovery.plugin="org.apache.spark.resource.ResourceDiscoveryFPGAPlugin, org.apache.spark.resource.ResourceDiscoveryGPUPlugin"
Note the executor side had to be wrapped with a classloader to make sure we include the user classpath for jars they specified on submission.
Note this is more flexible because the discovery script has limitations such as spawning it in a separate process. This means if you are trying to allocate resources in that process they might be released when the script returns. Other things are the class makes it more flexible to be able to integrate with existing systems and solutions for assigning resources.
### Why are the changes needed?
to more easily use spark resource scheduling with older versions of hadoop or in non-isolated enivronments.
### Does this PR introduce any user-facing change?
Yes a plugin api
### How was this patch tested?
Unit tests added and manual testing done on yarn and standalone modes.
Closes#27410 from tgravescs/hadoop27spark3.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>