### What changes were proposed in this pull request?
This is a small follow-up for https://github.com/apache/spark/pull/27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.
### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added test to `SparkPlanSuite`.
Closes#27530 from hvanhovell/SPARK-30780.
Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to revert the commit 8aebc80e0e.
### Why are the changes needed?
See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites.
Closes#27531 from MaxGekk/revert-like-3-args.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds some more information and context to `spark.sql.defaultUrlStreamHandlerFactory.enabled`.
### Why are the changes needed?
It is a bit difficult to understand the documentation of `spark.sql.defaultUrlStreamHandlerFactory.enabled`.
### Does this PR introduce any user-facing change?
Nope, internal doc only fix.
### How was this patch tested?
Nope. I only tested linter.
Closes#27541 from HyukjinKwon/SPARK-29462-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the case of back-to-back calculation of `floorDiv` and `floorMod` with the same arguments, the result of `foorDiv` can be reused in calculation of `floorMod`. The `floorMod` method is defined as the following in Java standard library:
```java
public static int floorMod(int x, int y) {
int r = x - floorDiv(x, y) * y;
return r;
}
```
If `floorDiv(x, y)` has been already calculated, it can be reused in `x - floorDiv(x, y) * y`.
I propose to modify 2 places in `DateTimeUtils`:
1. `microsToInstant` which is widely used in many date-time functions. `Math.floorMod(us, MICROS_PER_SECOND)` is just replaced by its definition from Java Math library.
2. `truncDate`: `Math.floorMod(oldYear, divider) == 0` is replaced by `Math.floorDiv(oldYear, divider) * divider == oldYear` where `floorDiv(...) * divider` is pre-calculated.
### Why are the changes needed?
This reduces the number of arithmetic operations, and can slightly improve performance of date-time functions.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#27491 from MaxGekk/opt-microsToInstant.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add class document for PruneFileSourcePartitions and PruneHiveTablePartitions.
### Why are the changes needed?
To describe these two classes.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#27535 from fuwhu/SPARK-15616-FOLLOW-UP.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Document updated for `CACHE TABLE` & `UNCACHE TABLE`
### Why are the changes needed?
Cache table creates a temp view while caching data using `CACHE TABLE name AS query`. `UNCACHE TABLE` does not remove this temp view.
These things were not mentioned in the existing doc for `CACHE TABLE` & `UNCACHE TABLE`.
### Does this PR introduce any user-facing change?
Document updated for `CACHE TABLE` & `UNCACHE TABLE` command.
### How was this patch tested?
Manually
Closes#27090 from iRakson/SPARK-27545.
Lead-authored-by: root1 <raksonrakesh@gmail.com>
Co-authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.
- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems `array() + array(1)` -> array of strings
Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.
Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:
> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )
From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.
### Why are the changes needed?
When empty array is created, it should be declared as array<null>.
### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.
### How was this patch tested?
Tested manually
Closes#27521 from HyukjinKwon/SPARK-29462.
Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR tries #26710 (comment) way to fix the test.
### Why are the changes needed?
To make the tests pass.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Jenkins will test first, and then `on spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3` will test it out.
Closes#27513 from HyukjinKwon/test-SPARK-30756.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 8efe367a4e)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix PySpark test failures for using Pandas >= 1.0.0.
### Why are the changes needed?
Pandas 1.0.0 has recently been released and has API changes that result in PySpark test failures, this PR fixes the broken tests.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested with Pandas 1.0.1 and PyArrow 0.16.0
Closes#27529 from BryanCutler/pandas-fix-tests-1.0-SPARK-30777.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Fix the scope of `Logging.initializeForcefully` so that it doesn't appear in subclasses' public methods. Right now, `sc.initializeForcefully(false, false)` is allowed to called.
- Don't show classes under `org.apache.spark.internal` package in API docs.
- Add missing `since` annotation.
- Fix the scope of `ArrowUtils` to remove it from the API docs.
### Why are the changes needed?
Avoid leaking APIs unintentionally in Spark 3.0.0.
### Does this PR introduce any user-facing change?
No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0.
### How was this patch tested?
Manually generated the API docs and verified the above issues have been fixed.
Closes#27528 from zsxwing/audit-ss-apis.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured.
### Why are the changes needed?
For code safety.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#27516 from xuanyuanking/SPARK-30556-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix `cache` initialization in `StringRegexExpression` by changing `case Literal(value: String, StringType)` to `case p: Expression if p.foldable`
### Why are the changes needed?
Actually, the case doesn't work at all because of:
1. Literals value has type `UTF8String`
2. It doesn't work for foldable expressions like in the example:
```sql
SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';
```
<img width="649" alt="Screen Shot 2020-02-08 at 22 45 50" src="https://user-images.githubusercontent.com/1580697/74091681-0d4a2180-4acb-11ea-8a0d-7e8c65f4214e.png">
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By the `check outputs of expression examples` test from `SQLQuerySuite`.
Closes#27502 from MaxGekk/str-regexp-foldable-pattern.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This reverts commit 8ce7962931. There's variable name conflicts with 8aebc80e0e (diff-39298b470865a4cbc67398a4ea11e767).
This can be cleanly ported back to branch-3.0.
### Why are the changes needed?
Performance investigation were not made enough and it's not clear if it really beneficial or now.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Jenkins tests.
Closes#27514 from HyukjinKwon/revert-cache-PR.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This is a follow-up for #24938 to tweak error message and migration doc.
### Why are the changes needed?
Making user know workaround if SHOW CREATE TABLE doesn't work for some Hive tables.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests.
Closes#27505 from viirya/SPARK-27946-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Enhance RuleExecutor strategy to take different actions when exceeding max iterations. And raise exception if analyzer exceed max iterations.
### Why are the changes needed?
Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add test in AnalysisSuite
Closes#26977 from Eric5553/EnhanceMaxIterations.
Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up in [#27452](https://github.com/apache/spark/pull/27452).
Add a unit test to verify whether the log warning is print when intentionally skip AQE.
### Why are the changes needed?
Add unit test
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
adding unit test
Closes#27515 from JkSelf/aqeLoggingWarningTest.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR updates the documentation on `TableCatalog.alterTable`s behavior on the order by which the requested changes are applied. It now explicitly mentions that the changes are applied in the order given.
### Why are the changes needed?
The current documentation on `TableCatalog.alterTable` doesn't mention which order the requested changes are applied. It will be useful to explicitly document this behavior so that the user can expect the behavior. For example, `REPLACE COLUMNS` needs to delete columns before adding new columns, and if the order is guaranteed by `alterTable`, it's much easier to work with the catalog API.
### Does this PR introduce any user-facing change?
Yes, document change.
### How was this patch tested?
Not added (doc changes).
Closes#27496 from imback82/catalog_table_alter_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add round-trip tests for CSV and JSON functions as https://github.com/apache/spark/pull/27317#discussion_r376745135 asked.
### Why are the changes needed?
improve test coverage
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add uts
Closes#27510 from yaooqinn/SPARK-30592-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This reverts commit a0e63b61e7.
### What changes were proposed in this pull request?
This reverts the patch at #26978 based on gatorsmile's suggestion.
### Why are the changes needed?
Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#27504 from viirya/revert-SPARK-29721.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Simplify the changes for adding metrics description for WholeStageCodegen in https://github.com/apache/spark/pull/27405
### Why are the changes needed?
In https://github.com/apache/spark/pull/27405, the UI changes can be made without using the function `adjustPositionOfOperationName` to adjust the position of operation name and mark as an operation-name class.
I suggest we make simpler changes so that it would be easier for future development.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test with the queries provided in https://github.com/apache/spark/pull/27405
```
sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").show
sc.parallelize(1 to 10).toDF.sort("value").filter("value > 1").selectExpr("value * 2").write.format("json").mode("overwrite").save("/tmp/test_output")
sc.parallelize(1 to 10).toDF.write.format("json").mode("append").save("/tmp/test_output")
```
![image](https://user-images.githubusercontent.com/1097932/74073629-e3f09f00-49bf-11ea-90dc-1edb5ca29e5e.png)
Closes#27490 from gengliangwang/wholeCodegenUI.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Add ```HasBlockSize``` in shared Params in both Scala and Python.
Make ALS/MLP extend ```HasBlockSize```
### Why are the changes needed?
Add ```HasBlockSize ``` in ALS, so user can specify the blockSize.
Make ```HasBlockSize``` a shared param so both ALS and MLP can use it.
### Does this PR introduce any user-facing change?
Yes
```ALS.setBlockSize/getBlockSize```
```ALSModel.setBlockSize/getBlockSize```
### How was this patch tested?
Manually tested. Also added doctest.
Closes#27501 from huaxingao/spark_30662.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add the new tab `SQL` in the `Data Types` page.
### Why are the changes needed?
New type added in SPARK-29587.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Locally test by Jekyll.
![image](https://user-images.githubusercontent.com/4833765/73908593-2e511d80-48e5-11ea-85a7-6ee451e6b727.png)
Closes#27447 from xuanyuanking/SPARK-29587-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior.
### Why are the changes needed?
The original change might risky to end-users, it changes behavior silently.
### Does this PR introduce any user-facing change?
Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional.
### How was this patch tested?
New UT.
Closes#27454 from xuanyuanking/SPARK-28228-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Revert
#27360#27396#27374#27389
### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.
### Does this PR introduce any user-facing change?
remove newly added param blockSize
### How was this patch tested?
reverted testsuites
Closes#27487 from zhengruifeng/revert_blockify_ii.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The current ALTER COLUMN syntax allows to change multiple properties at a time:
```
ALTER TABLE table=multipartIdentifier
(ALTER | CHANGE) COLUMN? column=multipartIdentifier
(TYPE dataType)?
(COMMENT comment=STRING)?
colPosition?
```
The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan)
This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time.
Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior.
### Why are the changes needed?
To follow SQL standard (and other recent SQL systems) behavior.
### Does this PR introduce any user-facing change?
Yes, now the user can update the column properties only one at a time.
For example,
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment'
```
should be broken into
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
```
### How was this patch tested?
Updated existing tests.
Closes#27444 from imback82/alter_column_one_at_a_time.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API
- Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence.
### Why are the changes needed?
- Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example:
```scala
scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show
+-------------------+
| local UTC|
+-------------------+
|2019-11-03 03:00:00|
+-------------------+
```
but the result must be 2019-11-03 04:00:00:
<img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png">
- Simplifies the code, and make it more maintainable
- Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions.
### Does this PR introduce any user-facing change?
Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`.
### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
- Added `convert time zones on a daylight saving day` to DateFunctionsSuite
Closes#27474 from MaxGekk/port-convertTz-on-Java8-api.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes some typos in `python/pyspark/sql/types.py` file.
### Why are the changes needed?
To deliver correct wording in documentation and codes.
### Does this PR introduce any user-facing change?
Yes, it fixes some typos in user-facing API documentation.
### How was this patch tested?
Locally tested the linter.
Closes#27475 from sharifahmad2061/master.
Lead-authored-by: sharif ahmad <sharifahmad2061@gmail.com>
Co-authored-by: Sharif ahmad <sharifahmad2061@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26656.
We don't support window aggregate function with filter predicate yet and we should fail explicitly.
Observable metrics has the same issue. This PR fixes it as well.
### Why are the changes needed?
If we simply ignore filter predicate when we don't support it, the result is wrong.
### Does this PR introduce any user-facing change?
yea, fix the query result.
### How was this patch tested?
new tests
Closes#27476 from cloud-fan/filter.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally.
This PR also add a config to not skip AQE.
### Why are the changes needed?
It's not a warning at all if we intentionally skip AQE.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs.
Closes#27452 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Add new config `spark.network.maxRemoteBlockSizeFetchToMem` fallback to the old config `spark.maxRemoteBlockSizeFetchToMem`.
### Why are the changes needed?
For naming consistency.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27463 from xuanyuanking/SPARK-26700-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add examples and parameter description for these Scala functions:
* transform
* exists
* forall
* aggregate
* zip_with
* transform_keys
* transform_values
* map_filter
* map_zip_with
### Why are the changes needed?
Better documentation for UX.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27449 from Ngone51/doc-funcs.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel.
### Why are the changes needed?
For better performance.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27471 from Ngone51/dev_calculate_in_parallel.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty.
I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql
Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`.
Because only one case reference the table `empsalary` and it throws `AnalysisException`.
```
-- !query
select last(salary) over(order by salary range between 1000 preceding and 1000 following),
lag(salary) over(order by salary range between 1000 preceding and 1000 following),
salary from empsalary
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1);
```
So we should do four work:
1. comment out the only one case and create a new ticket.
2. Add `INSERT INTO empsalary`.
Note: window_part4.sql not use the table `empsalary`.
### Why are the changes needed?
Supplementary test data.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New test case
Closes#27439 from beliefer/add-insert-to-window.
Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables.
This PR would allow qualified column names in query as following:
```SQL
SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT tbl.foo FROM testcat.ns1.ns2.tbl
```
### Why are the changes needed?
This is a bug because you cannot qualify column names in queries.
### Does this PR introduce any user-facing change?
Yes, now users can qualify column names for v2 tables.
### How was this patch tested?
Added new tests.
Closes#27391 from imback82/qualified_col.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Eagerly filter out zombie `TaskSetManager` before offering resources to reduce any overhead as possible.
And this PR also avoid doing `recomputeLocality` and `addPendingTask` when `TaskSetManager` is zombie.
### Why are the changes needed?
Zombie `TaskSetManager` could still exist in Pool's `schedulableQueue` when it has running tasks. Offering resources on a zombie `TaskSetManager` could bring unnecessary overhead and is meaningless.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27455 from Ngone51/exclude-zombie-tsm.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to make hardcoded `python3` to a variable `PYTHON_EXECUTABLE` in ' lint-python' script.
### Why are the changes needed?
To make changes easier. See 561e9b9688 as an example.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually by running `dev/lint-python`.
Closes#27470 from HyukjinKwon/minor-PYTHON_EXECUTABLE.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently, it doesn't seem to be possible to have Spark Driver set the serviceAccountName for executor pods it launches.
### Why are the changes needed?
it will allow settings serviceAccountName for executors pods.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
It was covered by unit test.
Closes#27034 from ayudovin/srevice-account-name-for-executor-pods.
Authored-by: yudovin <artsiom.yudovin@profitero.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
update `DataFrameAggregateSuite` to make it pass with AQE
### Why are the changes needed?
We don't need to turn off AQE in `DataFrameAggregateSuite`
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
run `DataFrameAggregateSuite` locally with AQE on.
Closes#27451 from cloud-fan/aqe-test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Follow up work for SPARK-29864, reference the config `spark.sql.legacy.fromDayTimeString.enabled` in error message.
### Why are the changes needed?
For better usability.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#27464 from xuanyuanking/SPARK-29864-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR use a specific version of docker image instead of `latest`. As of today, when I run K8s integration test locally, this test case fails always.
Also, in this PR, I shows two consecutive failures with a dummy change.
- https://github.com/apache/spark/pull/27465#issuecomment-582326614
- https://github.com/apache/spark/pull/27465#issuecomment-582329114
```
- Launcher client dependencies *** FAILED ***
```
After that, I added the patch and K8s Integration test passed.
- https://github.com/apache/spark/pull/27465#issuecomment-582361696
### Why are the changes needed?
[SPARK-28465](https://github.com/apache/spark/pull/25222) switched from `v4.0.0-stable-4.0-master-centos-7-x86_64` to `latest` to catch up the API change. However, the API change seems to occur again. We had better use a specific version to prevent accidental failures.
```scala
- .withImage("ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64")
+ .withImage("ceph/daemon:latest")
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass `Launcher client dependencies` test in Jenkins K8s Integration Suite.
Or, run K8s Integration test locally.
Closes#27465 from dongjoon-hyun/SPARK-K8S-IT.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add migration note for removing `org.apache.spark.ml.image.ImageSchema.readImages`
### Why are the changes needed?
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes#27467 from WeichenXu123/SC-26286.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When spark.sql.ansi.enabled is true, for the statement:
```
select cast(cast(2147483648 as Float) as Integer) //result is 2147483647
```
Its result is 2147483647 and does not throw `ArithmeticException`.
The root cause is that, the below code does not work for some corner cases.
94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)
For example:
![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png)
In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.
### Why are the changes needed?
Result corrupt.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added Unit test.
Closes#27151 from turboFei/SPARK-26218-follow-up-int-overflow.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0181, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`.
To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`.
### Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- Restored additional test cases in `JsonInferSchemaSuite`.
Closes#27441 from MaxGekk/support-simpledateformat.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1
### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version
Closes#27443 from onursatici/os/bump-fabric8.
Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>