Commit graph

29910 commits

Author SHA1 Message Date
Angerszhuuuu 1db031f158 [SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited
### What changes were proposed in this pull request?
Add note in migration guide about  DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited

### Why are the changes needed?
Add note

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-26 17:47:47 +03:00
Liang-Chi Hsieh bdac19184a [SPARK-35230][SQL] Move custom metric classes to proper package
### What changes were proposed in this pull request?

This patch moves DS v2 custom metric classes to `org.apache.spark.sql.connector.metric` package. Moving `CustomAvgMetric` and `CustomSumMetric` to above package and make them as public java abstract class too.

### Why are the changes needed?

`CustomAvgMetric` and `CustomSumMetric`  should be public APIs for developers to extend. As there are a few metric classes, we should put them together in one package.

### Does this PR introduce _any_ user-facing change?

No, dev only and they are not released yet.

### How was this patch tested?

Unit tests.

Closes #32348 from viirya/move-custom-metric-classes.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-26 07:19:36 -07:00
Cheng Pan 84026d794a [SPARK-35223] Add IssueNavigationLink
### What changes were proposed in this pull request?

Add `IssueNavigationLink` to make IDEA support hyperlink on JIRA Ticket and GitHub PR on Git plugin.

![image](https://user-images.githubusercontent.com/26535726/115997353-5ecdc600-a615-11eb-99eb-6acbf15d8626.png)

### Why are the changes needed?

Make it more friendly for developers which using IDEA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32337 from pan3793/SPARK-35223.

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-26 21:51:21 +08:00
beliefer c0a3c0cbbe [SPARK-35088][SQL] Accept ANSI intervals by the Sequence expression
### What changes were proposed in this pull request?
This PR makes `Sequence` expression supports ANSI intervals as step expression.
If the start and stop expression is `TimestampType,` then the step expression could select year-month or day-time interval.
If the start and stop expression is `DateType,` then the step expression must be year-month.

### Why are the changes needed?
Extends the function of `Sequence` expression.

### Does this PR introduce _any_ user-facing change?
'Yes'. Users could use ANSI intervals as step expression for `Sequence` expression.

### How was this patch tested?
New tests.

Closes #32311 from beliefer/SPARK-35088.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-26 10:05:57 +03:00
Adam Binford 74afc68e21 [SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations
### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in https://github.com/apache/spark/pull/29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-25 23:39:56 -07:00
Max Gekk d572a85989 [SPARK-35224][SQL][TESTS] Fix buffer overflow in MutableProjectionSuite
### What changes were proposed in this pull request?
In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data.

### Why are the changes needed?
To make the test suite `MutableProjectionSuite` more stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32339 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-26 09:35:24 +03:00
Venkata krishnan Sowrirajan 38ef4771d4 [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
### What changes were proposed in this pull request?
This is one of the patches for SPIP SPARK-30602 for push-based shuffle.
Summary of changes:

- Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver
- Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker`
- Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions.

The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922

### Why are the changes needed?
Refer to SPARK-30602

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests.

Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com

Closes #30480 from Victsm/SPARK-32921.

Lead-authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-04-26 00:17:26 -05:00
kyoty 2d6467d6d1 [SPARK-35087][UI] Some columns in table Aggregated Metrics by Executor of stage-detail page shows incorrectly.
### What changes were proposed in this pull request?

 columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc  in table ` Aggregated Metrics by Executor` of stage-detail page should be sorted as numerical-order instead of lexicographical-order.

### Why are the changes needed?
buf fix,the sorting style should be consistent between different columns.

The correspondence between the table and the index is shown below(it is defined in stagespage-template.html):
| index | column name                            |
| ----- | -------------------------------------- |
| 0     | Executor ID                            |
| 1     | Logs                                   |
| 2     | Address                                |
| 3     | Task Time                              |
| 4     | Total Tasks                            |
| 5     | Failed Tasks                           |
| 6     | Killed Tasks                           |
| 7     | Succeeded Tasks                        |
| 8     | Excluded                               |
| 9     | Input Size / Records                   |
| 10    | Output Size / Records                  |
| 11    | Shuffle Read Size / Records            |
| 12    | Shuffle Write Size / Records           |
| 13    | Spill (Memory)                         |
| 14    | Spill (Disk)                           |
| 15    | Peak JVM Memory OnHeap / OffHeap       |
| 16    | Peak Execution Memory OnHeap / OffHeap |
| 17    | Peak Storage Memory OnHeap / OffHeap   |
| 18    | Peak Pool Memory Direct / Mapped       |

I constructed some data to simulate the sorting results of the index columns from 9 to 18.
As shown below,it can be seen that the sorting results of columns 9-12 are wrong:

![simulate-result](https://user-images.githubusercontent.com/52202080/115120775-c9fa1580-9fe1-11eb-8514-71f29db3a5eb.png)

The reason is that the real data corresponding to columns 9-12 (note that it is not the data displayed on the page) are **all strings similar to`94685/131`(bytes/records),while the real data corresponding to columns 13-18 are all numbers,**
so the sorting corresponding to columns 13-18 loos well, but the results of columns 9-12 are incorrect because the strings are sorted according to lexicographical order.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only JS was modified, and the manual test result works well.

**before modified:**
![looks-illegal](https://user-images.githubusercontent.com/52202080/115120812-06c60c80-9fe2-11eb-9ada-fa520fe43c4e.png)

**after modified:**
![sort-result-corrent](https://user-images.githubusercontent.com/52202080/114865187-7c847980-9e24-11eb-9fbc-39ee224726d6.png)

Closes #32190 from kyoty/aggregated-metrics-by-executor-sorted-incorrectly.

Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-26 12:13:22 +09:00
Angerszhuuuu 6f782efb04 [SPARK-35220][SQL] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited
### What changes were proposed in this pull request?
DayTimeIntervalType/YearMonthIntervalString show different between Hive SerDe and row format delimited.
Create this pr to add a test and  have disscuss.

For this problem I think we have two direction:

1. leave it as current and add a item t explain this  in migration guide docs.
2. Since we should not change hive serde's behavior, so we can cast spark row format delimited's behavior to use cast  DayTimeIntervalType/YearMonthIntervalType as HIVE_STYLE

### Why are the changes needed?
Add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added ut

Closes #32335 from AngersZhuuuu/SPARK-35220.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-26 11:26:32 +09:00
Kent Yao 5b1353f690 [SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions not adaptive.coalescePartitions.initialPartitionNum
### What changes were proposed in this pull request?

```sql
spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1;
spark.sql.adaptive.coalescePartitions.initialPartitionNum	1
Time taken: 2.18 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	1
Time taken: 0.03 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions	200
Time taken: 0.024 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks=2;
21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	2
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions	1
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql>
```

`mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions`

### Why are the changes needed?

roundtrip for `mapred.reduce.tasks`

### Does this PR introduce _any_ user-facing change?

yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not.

### How was this patch tested?

a new test

Closes #32265 from yaooqinn/SPARK-35168.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-04-25 20:27:12 +08:00
Dongjoon Hyun b108e7fff3 [SPARK-33913][SS] Upgrade Kafka to 2.8.0
### What changes were proposed in this pull request?

This PR aims to upgrade Kafka client to 2.8.0.
Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.

### Why are the changes needed?

This will bring the latest client-side improvement and bug fixes like the following examples.

- KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
- KAFKA-10134 High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
- KAFKA-12193 Re-resolve IPs when a client is disconnected
- KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a known config
- KAFKA-9263 The new hw is added to incorrect log when  ReplicaAlterLogDirsThread is replacing log
- KAFKA-10607 Ensure the error counts contains the NONE
- KAFKA-10458 Need a way to update quota for TokenBucket registered with Sensor
- KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition for topic

**RELEASE NOTE**
- https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
- https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the existing tests because this is a dependency change.

Closes #32325 from dongjoon-hyun/SPARK-33913.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-25 16:20:22 +09:00
Ruifeng Zheng 1f150b9392 [SPARK-35024][ML] Refactor LinearSVC - support virtual centering
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites

### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites and added testsuites

Closes #32124 from zhengruifeng/svc_agg_refactor.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
2021-04-25 13:16:46 +08:00
weixiuli bcac733bf1 [SPARK-35200][CORE] Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove some unnecessary code
### What changes were proposed in this pull request?
Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager, and remove some unnecessary code.

### Why are the changes needed?

The number of the pending speculative tasks is recomputed in the ExecutorAllocationManager to calculate the maximum number of executors required.  While , it only needs to be computed once to improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #32306 from weixiuli/SPARK-35200.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-24 14:32:51 -07:00
Maya Anderson 166cc6204c [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite
### What changes were proposed in this pull request?

A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).

### Why are the changes needed?

To provide a test coverage for Parquet encryption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] [SBT / Hadoop 3.2 / Java8 (the default)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137785/testReport)
- [ ] ~SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.~ (Jenkins Java11 build is broken due to missing JDK11 installation)
- [x] [SBT / Hadoop 2.7 / Java8 by adding [test-hadoop2.7] to the PR title.](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137836/testReport)
- [x] Maven / Hadoop 3.2 / Java8 by adding [test-maven] to the PR title.
- [x] Maven / Hadoop 2.7 / Java8 by adding [test-maven][test-hadoop2.7] to the PR title.

Closes #32146 from andersonm-ibm/pme_testing.

Authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-24 14:28:00 -07:00
Kousuke Saruta 44c13871d1 [SPARK-35210][BUILD] Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RESET issue
### What changes were proposed in this pull request?

This PR proposes to upgrade Jetty to 9.4.40.

### Why are the changes needed?

SPARK-34988 (#32091) upgraded Jetty to 9.4.39 for CVE-2021-28165.
But after the upgrade, Jetty 9.4.40 was released to fix the ERR_CONNECTION_RESET issue (https://github.com/eclipse/jetty.project/issues/6152).
This issue seems to affect Jetty 9.4.39 when POST method is used with SSL.
For Spark, job submission using REST and ThriftServer with HTTPS protocol can be affected.

### Does this PR introduce _any_ user-facing change?

No. No released version uses Jetty 9.3.39.

### How was this patch tested?

CI.

Closes #32318 from sarutak/upgrade-jetty-9.4.40.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-24 19:22:32 +09:00
Liang-Chi Hsieh b2a2b5d820 [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range for KafkaMicroBatchStream
### What changes were proposed in this pull request?

This patch proposes to add a couple of metrics in scan node for Kafka batch streaming query.

### Why are the changes needed?

When testing SS, I found it is hard to track data loss of SS reading from Kafka. The micro batch scan node has only one metric, number of output rows. Users have no idea how many offsets to fetch are out of Kafka, how many times data loss happens. These metrics are important for users to know the quality of SS query running.

### Does this PR introduce _any_ user-facing change?

Yes, adding two metrics to micro batch scan node for Kafka batch streaming.

### How was this patch tested?

Currently I tested on internal cluster with Kafka:

<img width="1193" alt="Screen Shot 2021-04-22 at 7 16 29 PM" src="https://user-images.githubusercontent.com/68855/115808460-61bf8100-a39f-11eb-99a9-65d22c3f5fb0.png">

I was trying to add unit test. But as our batch streaming query disallows to specify ending offsets. If I only specify an out-of-range starting offset, when we get offset range in `getRanges`,  any negative size range will be filtered out. So it cannot actually test the case of fetched non-existing offset.

Closes #31398 from viirya/micro-batch-metrics.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-23 13:56:53 -07:00
Wenchen Fan a9345a0591 [SPARK-35204][SQL] CatalystTypeConverters of date/timestamp should accept both the old and new Java time classes
### What changes were proposed in this pull request?

`CatalystTypeConverters` is useful when the type of the input data classes are not known statically (otherwise we can use `ExpressionEncoder`). However, the current `CatalystTypeConverters` requires you to know the datetime data class statically, which makes it hard to use.

This PR improves the `CatalystTypeConverters` for date/timestamp, to support the old and new Java time classes at the same time.

### Why are the changes needed?

Make `CatalystTypeConverters` easier to use.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

new test

Closes #32312 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-23 19:21:43 +03:00
Angerszhuuuu e503b9c462 [SPARK-35201][SQL] Format empty grouping set exception in CUBE/ROLLUP
### What changes were proposed in this pull request?
Format empty grouping set exception in CUBE/ROLLUP

### Why are the changes needed?
Format empty grouping set exception in CUBE/ROLLUP

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32307 from AngersZhuuuu/SPARK-35201.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2021-04-23 21:23:28 +09:00
Yingyi Bu 9af338cd68 [SPARK-35078][SQL] Add tree traversal pruning in expression rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- AND_OR
- BINARY_ARITHMETIC
- BINARY_COMPARISON
- CASE_WHEN
- CAST
- CONCAT
- COUNT
- IF
- LIKE_FAMLIY
- NOT
- NULL_CHECK
- UNARY_POSITIVE
- UPPER_OR_LOWER

Used them in the following rules:
- ConstantPropagation
- ReorderAssociativeOperator
- BooleanSimplification
- SimplifyBinaryComparison
- SimplifyCaseConversionExpressions
- SimplifyConditionals
- PushFoldableIntoBranches
- LikeSimplification
- NullPropagation
- SimplifyCasts
- RemoveDispensableExpressions
- CombineConcats

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32280 from sigmod/expression.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-23 16:33:58 +08:00
Wenchen Fan fdccd88c2a Revert "[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function"
This reverts commit c8d78a70b4.
2021-04-23 15:55:30 +08:00
Angerszhuuuu 20d68dc2f4 [SPARK-35159][SQL][DOCS] Extract hive format doc
### What changes were proposed in this pull request?
Extract common doc about hive format for `sql-ref-syntax-ddl-create-table-hiveformat.md` and `sql-ref-syntax-qry-select-transform.md` to refer.

![image](https://user-images.githubusercontent.com/46485123/115802193-04641800-a411-11eb-827d-d92544881842.png)

### Why are the changes needed?
Improve doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32264 from AngersZhuuuu/SPARK-35159.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-23 05:47:48 +00:00
hongdongdong 7582dc86bc [SPARK-35143][SQL][SHELL] Add default log level config for spark-sql
### What changes were proposed in this pull request?
Add default log config for spark-sql

### Why are the changes needed?
The default log level for spark-sql is `WARN`. How to change the log level is confusing, we need a default config.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Change config `log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=INFO` in log4j.properties, then spark-sql's default log level changed.

Closes #32248 from hddong/spark-35413.

Lead-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-23 14:26:19 +09:00
Cheng Su cab205e9e4 [SPARK-35141][SQL] Support two level of hash maps for final hash aggregation
### What changes were proposed in this pull request?

For partial hash aggregation (code-gen path), we have two level of hash map for aggregation. First level is from `RowBasedHashMapGenerator`, which is computation faster compared to the second level from `UnsafeFixedWidthAggregationMap`. The introducing of two level hash map can help improve CPU performance of query as the first level hash map normally fits in hardware cache and has cheaper hash function for key lookup.

For final hash aggregation, we can also support two level of hash map, to improve query performance further.
The original two level of hash map code works for final aggregation mostly out of box. The major change here is to support testing fall back of final aggregation (see change related to `bitMaxCapacity` and `checkFallbackForGeneratedHashMap`).

Example:

An aggregation query:

```
spark.sql(
  """
    |SELECT key, avg(value)
    |FROM agg1
    |GROUP BY key
  """.stripMargin)
```

The generated code for final aggregation is [here](https://gist.github.com/c21/20c10cc8e2c7e561aafbe9b8da055242).

An aggregation query with testing fallback:
```
withSQLConf("spark.sql.TungstenAggregate.testFallbackStartsAt" -> "2, 3") {
  spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
}
```
The generated code for final aggregation is [here](https://gist.github.com/c21/dabf176cbc18a5e2138bc0a29e81c878). Note the no more counter condition for first level fast map.

### Why are the changes needed?

Improve the CPU performance of hash aggregation query in general.

For `AggregateBenchmark."Aggregate w multiple keys"`, seeing query performance improved by 10%.
`codegen = T` means whole stage code-gen is enabled.
`hashmap = T` means two level maps is enabled for partial aggregation.
`finalhashmap = T` means two level maps is enabled for final aggregation.

```
Running benchmark: Aggregate w multiple keys
  Running case: codegen = F
  Stopped after 2 iterations, 8284 ms
  Running case: codegen = T hashmap = F
  Stopped after 2 iterations, 5424 ms
  Running case: codegen = T hashmap = T finalhashmap = F
  Stopped after 2 iterations, 4753 ms
  Running case: codegen = T hashmap = T finalhashmap = T
  Stopped after 2 iterations, 4508 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Aggregate w multiple keys:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
codegen = F                                        3881           4142         370          5.4         185.1       1.0X
codegen = T hashmap = F                            2701           2712          16          7.8         128.8       1.4X
codegen = T hashmap = T finalhashmap = F           2363           2377          19          8.9         112.7       1.6X
codegen = T hashmap = T finalhashmap = T           2252           2254           3          9.3         107.4       1.7X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test in `HashAggregationQuerySuite` and `HashAggregationQueryWithControlledFallbackSuite` already cover the test.

Closes #32242 from c21/agg.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-23 05:24:19 +00:00
Chao Sun 86238d0e88 [SPARK-35195][SQL][TEST] Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
### What changes were proposed in this pull request?

Move the following classes:
- `InMemoryAtomicPartitionTable`
- `InMemoryPartitionTable`
- `InMemoryPartitionTableCatalog`
- `InMemoryTable`
- `InMemoryTableCatalog`
- `StagingInMemoryTableCatalog`

from `org.apache.spark.sql.connector` to `org.apache.spark.sql.connector.catalog`.

### Why are the changes needed?

These classes implement catalog related interfaces but reside in `org.apache.spark.sql.connector`. A more suitable place should be `org.apache.spark.sql.connector.catalog`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #32302 from sunchao/SPARK-35195.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-22 21:56:45 -07:00
Yingyi Bu 47f86875f7 [SPARK-35075][SQL] Add traversal pruning for subquery related rules
### What changes were proposed in this pull request?

Added the following TreePattern enums:
- DYNAMIC_PRUNING_SUBQUERY
- EXISTS_SUBQUERY
- IN_SUBQUERY
- LIST_SUBQUERY
- PLAN_EXPRESSION
- SCALAR_SUBQUERY
- FILTER

Used them in the following rules:
- ResolveSubquery
- UpdateOuterReferences
- OptimizeSubqueries
- RewritePredicateSubquery
- PullupCorrelatedPredicates
- RewriteCorrelatedScalarSubquery (not the rule itself but an internal transform call, the full support is in SPARK-35148)
- InsertAdaptiveSparkPlan
- PlanAdaptiveSubqueries

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32247 from sigmod/subquery.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-23 12:42:55 +08:00
Xinrong Meng 4fcbf59079 [SPARK-35040][PYTHON] Remove Spark-version related codes from test codes
### What changes were proposed in this pull request?

Removes PySpark version dependent codes from pyspark.pandas test codes.

### Why are the changes needed?

There are several places to check the PySpark version and switch the logic, but now those are not necessary.
We should remove them.

We will do the same thing after we finish porting tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-04-22 18:01:07 -07:00
Dongjoon Hyun 6ab00488d0 [SPARK-35182][K8S] Support driver-owned on-demand PVC
### What changes were proposed in this pull request?

This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod.

### Why are the changes needed?

This allows K8s backend scheduler can reuse this later.

**BEFORE**
```
$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc-exec-1
```

**AFTER**
```
$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc
```

### Does this PR introduce _any_ user-facing change?

No. (The default is `false`)

### How was this patch tested?

Manually check the above and pass K8s IT.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 16 minutes, 40 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32288 from dongjoon-hyun/SPARK-35182.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-22 17:03:19 -07:00
Xinrong Meng 4d2b559d92 [SPARK-34999][PYTHON] Consolidate PySpark testing utils
### What changes were proposed in this pull request?
Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`.

### Why are the changes needed?

`python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain.

Updated import statements are as shown below:
- from pyspark.testing.sqlutils import SQLTestUtils
- from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils
(PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`)

Minor improvements include:
- Usage of missing library's requirement_message
- `except ImportError` rather than `except`
- import pyspark.pandas alias as `ps` rather than `pp`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests under python/pyspark/pandas/tests.

Closes #32177 from xinrong-databricks/port.merge_utils.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-04-22 13:07:35 -07:00
Angerszhuuuu 04e2305a9b [SPARK-35187][SQL] Fix failure on the minimal interval literal
### What changes were proposed in this pull request?
If the sign '-' inside of interval string, everything is fine after bb5459fb26:
```
spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
-178956970-8
```
but the sign outside of interval string is not handled properly:
```
spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
Error in query:
Error parsing interval year-month string: integer overflow(line 1, pos 16)

== SQL ==
SELECT INTERVAL -'178956970-8' YEAR TO MONTH
----------------^^^
```
This pr fix this issue

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32296 from AngersZhuuuu/SPARK-35187.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 21:58:57 +03:00
beliefer 6c587d2627 [SPARK-35110][SQL] Handle ANSI intervals in WindowExecBase
### What changes were proposed in this pull request?
This PR makes window frame could support `YearMonthIntervalType` and `DayTimeIntervalType`.

### Why are the changes needed?
Extend the function of window frame

### Does this PR introduce _any_ user-facing change?
Yes. Users could use `YearMonthIntervalType` or `DayTimeIntervalType` as the sort expression for window frame.

### How was this patch tested?
New tests

Closes #32294 from beliefer/SPARK-35110.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 17:14:11 +03:00
Yingyi Bu 7f7a3d80a6 [SPARK-35183][SQL] Use transformAllExpressions in CombineConcats
### What changes were proposed in this pull request?

Use transformAllExpressions instead of transformExpressionsDown in CombineConcats. The latter only transforms the root plan node.

### Why are the changes needed?

It allows CombineConcats to cover more cases where `concat` are not in the root plan node.

### How was this patch tested?

Unit test. The updated tests would fail without the code change.

Closes #32290 from sigmod/concat.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-22 13:10:39 +00:00
Angerszhuuuu b22d54a58a [SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS
### What changes were proposed in this pull request?
PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)

In this PR, we support it in Spark SQL too

### Why are the changes needed?
Keep consistent with PG and oracle

### Does this PR introduce _any_ user-facing change?
User can write grouping analytics like
```
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
```

### How was this patch tested?
Added Test

Closes #32201 from AngersZhuuuu/SPARK-35026.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-22 13:08:22 +00:00
kyoty 7242d7f774 [SPARK-35127][UI] When we switch between different stage-detail pages, the entry item in the newly-opened page may be blank
### What changes were proposed in this pull request?

To make sure that pageSize shoud not be shared between different stage pages.
The screenshots of the problem are placed in the attachment of [JIRA](https://issues.apache.org/jira/browse/SPARK-35127)

### Why are the changes needed?
fix the bug.

according to reference:`https://datatables.net/reference/option/lengthMenu`
`-1` represents display all rows, but now we use `totalTasksToShow`, it will cause the select item show as empty when we swich between different stage-detail pages.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test, it is a small io problem, and the modification does not affect the function, but just an adjustment of js configuration

the gif below shows how the problem can be reproduced:
![reproduce](https://user-images.githubusercontent.com/52202080/115204351-f7060f80-a12a-11eb-8900-a009ad0c8870.gif)

![微信截图_20210419162849](https://user-images.githubusercontent.com/52202080/115205675-629cac80-a12c-11eb-9cb8-1939c7450e99.png)

the gif below shows the result after modified:

![after_modified](https://user-images.githubusercontent.com/52202080/115204886-91fee980-a12b-11eb-9ccb-d5900a99095d.gif)

Closes #32223 from kyoty/stages-task-empty-pagesize.

Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-04-22 21:00:04 +09:00
Kousuke Saruta c0972dec1d [SPARK-35180][BUILD] Allow to build SparkR with SBT
### What changes were proposed in this pull request?

This PR proposes a change that allows us to build SparkR with SBT.

### Why are the changes needed?

In the current master, SparkR can be built only with Maven.
It's helpful if we can built it with SBT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that I can build SparkR on Ubuntu 20.04 with the following command.
```
build/sbt -Psparkr package
```

Closes #32285 from sarutak/sbt-sparkr.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-22 20:56:33 +09:00
Angerszhuuuu bb5459fb26 [SPARK-35177][SQL] Fix arithmetic overflow in parsing the minimal interval by IntervalUtils.fromYearMonthString
### What changes were proposed in this pull request?
IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly.
In current logic, just use `Math.addExact(Math.multiplyExact(years, 12), months)` to calculate  negative total months will overflow when actual total months is Int.MinValue, this pr fixes this bug.

### Why are the changes needed?
IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32281 from AngersZhuuuu/SPARK-35177.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-22 11:59:38 +03:00
skotlov b17a0e6931 [SPARK-34674][CORE][K8S] Close SparkContext after the Main method has finished
### What changes were proposed in this pull request?
Close SparkContext after the Main method has finished, to allow SparkApplication on K8S to complete.
This is fixed version of [merged and reverted PR](https://github.com/apache/spark/pull/32081).

### Why are the changes needed?
if I don't call the method sparkContext.stop() explicitly, then a Spark driver process doesn't terminate even after its Main method has been completed. This behaviour is different from spark on yarn, where the manual sparkContext stopping is not required. It looks like, the problem is in using non-daemon threads, which prevent the driver jvm process from terminating.
So I have inserted code that closes sparkContext automatically.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually on the production AWS EKS environment in my company.

Closes #32283 from kotlovs/close-spark-context-on-exit-2.

Authored-by: skotlov <skotlov@joom.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-21 22:54:16 -07:00
Liang-Chi Hsieh 548e66c98a [SPARK-34692][SQL][FOLLOWUP] Add INSET to ReplaceNullWithFalseInPredicate's pattern
### What changes were proposed in this pull request?

The test added by #31797 has the [failure](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137741/testReport/org.apache.spark.sql.catalyst.optimizer/ReplaceNullWithFalseInPredicateSuite/SPARK_34692__Support_Not_Int__and_Not_InSet__propagate_null/). This is a followup to fix it.

### Why are the changes needed?

Due to #32157, the rule `ReplaceNullWithFalseInPredicate` will check tree pattern before actually doing transformation. As `null` in `INSET` is not `NULL_LITERAL` pattern, we miss it and fail the newly added `not inset ...` check in `replaceNullWithFalse`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

Closes #32278 from viirya/SPARK-34692-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-21 19:06:16 -07:00
Sean Owen 6860efe468 [SPARK-35178][BUILD] Use new Apache 'closer.lua' syntax to obtain Maven
### What changes were proposed in this pull request?

Use new Apache 'closer.lua' syntax to obtain Maven

### Why are the changes needed?

The current closer.lua redirector, which redirects to download Maven from a local mirror, has a new syntax. build/mvn does not work properly otherwise now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual testing.

Closes #32277 from srowen/SPARK-35178.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-21 18:49:19 -07:00
Yuming Wang e609395913 [SPARK-34897][SQL] Support reconcile schemas based on index after nested column pruning
### What changes were proposed in this pull request?

It will remove `StructField` when [pruning nested columns](0f2c0b53e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala (L28-L42)). For example:
```scala
spark.sql(
  """
    |CREATE TABLE t1 (
    |  _col0 INT,
    |  _col1 STRING,
    |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
    |USING ORC
    |""".stripMargin)

spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))")

spark.sql("SELECT _col0, _col2.c1 FROM t1").show
```

Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception:
```
java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
```

After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```.

The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning:
7a5647a93a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (L208-L213)

e64eb75aed/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (L96-L97)

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31993 from wangyum/SPARK-34897.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-04-21 10:23:38 -07:00
ulysses-you 81dbaedede [SPARK-34692][SQL] Support Not(Int) and Not(InSet) propagate null in predicate
### What changes were proposed in this pull request?

* Add `Not(In)` and `Not(InSet)` check in `NullPropagation` rule.
* Add more test for `In` and `Not(In)` in `Project` level.

### Why are the changes needed?

The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that match the `NullIntolerant`.

As we already simplify the `NullIntolerant` expression to null if it's children have null. E.g. `a != null` => `null`. It's safe to do this with `Not(In)`/`Not(InSet)`.

Note that, we can only do the simplify in predicate which `ReplaceNullWithFalseInPredicate`  rule do.

Let's say we have two sqls:
```
select 1 not in (2, null);
select 1 where 1 not in (2, null);
```
The first sql we cannot optimize since it would return `NULL` instead of `false`. The second is postive.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31797 from ulysses-you/SPARK-34692.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-21 15:37:02 +00:00
Karen Feng 355c39939d [SPARK-35140][INFRA] Add error message guidelines to PR template
### What changes were proposed in this pull request?

Adds a link to the [error message guidelines](https://spark.apache.org/error-message-guidelines.html) to the PR template to increase visibility.

### Why are the changes needed?

Increases visibility of the error message guidelines, which are otherwise hidden in the [Contributing guidelines](https://spark.apache.org/contributing.html).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not needed.

Closes #32241 from karenfeng/spark-35140.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 21:34:49 +09:00
Yuanjian Li 8e9e70045b [SPARK-35171][R] Declare the markdown package as a dependency of the SparkR package
### What changes were proposed in this pull request?
Declare the markdown package as a dependency of the SparkR package

### Why are the changes needed?
If we didn't install pandoc locally, running make-distribution.sh will fail with the following message:
```
— re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
The 'markdown' package should be declared as a dependency of the 'SparkR' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see https://github.com/yihui/knitr/issues/1864 for more information.
— failed re-building ‘sparkr-vignettes.Rmd’
```

### Does this PR introduce _any_ user-facing change?
Yes. Workaround for R packaging.

### How was this patch tested?
Manually test. After the fix, the command `sh dev/make-distribution.sh -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the environment without pandoc will pass.

Closes #32270 from xuanyuanking/SPARK-35171.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 20:43:47 +09:00
harupy b6350f5bb0 [SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel
### What changes were proposed in this pull request?

Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`.

### Why are the changes needed?
Bugfix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #32245 from harupy/SPARK-35142.

Authored-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-04-21 16:29:10 +08:00
Gengliang Wang 43ad939a7e [SPARK-35152][SQL] ANSI mode: IntegralDivide throws exception on overflow
### What changes were proposed in this pull request?

IntegralDivide should throw an exception on overflow in ANSI mode.
There is only one case that can cause that:
```
Long.MinValue div -1
```

### Why are the changes needed?

ANSI compliance

### Does this PR introduce _any_ user-facing change?

Yes, IntegralDivide throws an exception on overflow in ANSI mode

### How was this patch tested?

Unit test

Closes #32260 from gengliangwang/integralDiv.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-21 16:08:20 +08:00
sandeep.katta 4f309cec07 [SPARK-35096][SQL] SchemaPruning should adhere spark.sql.caseSensitive config
### What changes were proposed in this pull request?

As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark

In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config
**Exception Before Fix**

```
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
  at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
  at
```

### Why are the changes needed?
After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR

### Does this PR introduce _any_ user-facing change?
No, Infact fixes the regression

### How was this patch tested?
Added tests and also tested verified manually

Closes #32194 from sandeep-katta/SPARK-35096.

Authored-by: sandeep.katta <sandeep.katta2007@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-21 15:16:17 +08:00
HyukjinKwon 97ec57e667 [SPARK-35120][INFRA][FOLLOW-UP] Try catch an error to show the correct guidance
### What changes were proposed in this pull request?

This PR proposes to handle 404 not found, see https://github.com/apache/spark/pull/32255/checks?check_run_id=2390446579 as an example.

If a fork does not have any previous workflow runs, it seems throwing 404 error instead of empty runs.

### Why are the changes needed?

To show the correct guidance to contributors.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at https://github.com/HyukjinKwon/spark/pull/48. See https://github.com/HyukjinKwon/spark/runs/2391469416 as an example.

Closes #32258 from HyukjinKwon/SPARK-35120-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-21 13:32:04 +08:00
Angerszhuuuu d259f9301a [SPARK-35113][SQL] Support ANSI intervals in the Hash expression
### What changes were proposed in this pull request?
Support ANSI interval in HashExpression and add UT

### Why are the changes needed?
Support ANSI interval in HashExpression

### Does this PR introduce _any_ user-facing change?
User can pass ANSI interval in HashExpression function

### How was this patch tested?
Added UT

Closes #32259 from AngersZhuuuu/SPARK-35113.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-21 07:44:44 +03:00
Kent Yao 81c3cc2312 [SPARK-35044][SQL][FOLLOWUP][TEST-HADOOP2.7] Fix hadoop 2.7 test due to diff between hadoop 2.7 and hadoop 3
### What changes were proposed in this pull request?

dfs.replication is inconsistent from hadoop 2.x to 3.x, so in this PR we use `dfs.hosts` to verify per https://github.com/apache/spark/pull/32144#discussion_r616833099

```
== Results ==
!== Correct Answer - 1 ==        == Spark Answer - 1 ==
!struct<>                        struct<key:string,value:string>
![dfs.replication,<undefined>]   [dfs.replication,3]
```

### Why are the changes needed?

fix Jenkins job with Hadoop 2.7

### Does this PR introduce _any_ user-facing change?

test only change
### How was this patch tested?

test only change

Closes #32263 from yaooqinn/SPARK-35044-F.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-04-21 09:10:05 +09:00
yangjie01 c7e18ad223 [SPARK-35132][BUILD][CORE] Upgrade netty-all to 4.1.63.Final
### What changes were proposed in this pull request?
There are 3 CVE problems were found after netty 4.1.51.Final as follows:

- [CVE-2021-21409](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21409)
- [CVE-2021-21295](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21295)
- [CVE-2021-21290](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21290)

So the main change of this pr is upgrade netty-all to 4.1.63.Final avoid these potential risks.

Another change is to clean up deprecated api usage: [Tiny caches have been merged into small caches](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L447-L455)(after [netty#10267](https://github.com/netty/netty/pull/10267)) and [should use  PooledByteBufAllocator(boolean, int, int, int, int, int, int, boolean, int)](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L227-L239) api to create `PooledByteBufAllocator`.

### Why are the changes needed?
Upgrade netty-all to 4.1.63.Final avoid CVE problems.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32227 from LuciferYang/SPARK-35132.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-04-20 18:28:43 -05:00
Max Gekk e8d6992b66 [SPARK-35153][SQL] Make textual representation of ANSI interval operators more readable
### What changes were proposed in this pull request?
In the PR, I propose to override the `sql` and `toString` methods of the expressions that implement operators over ANSI intervals (`YearMonthIntervalType`/`DayTimeIntervalType`), and replace internal expression class names by operators like `*`, `/` and `-`.

### Why are the changes needed?
Proposed methods should make the textual representation of such operators more readable, and potentially parsable by Spark SQL parser.

### Does this PR introduce _any_ user-facing change?
Yes. This can influence on column names.

### How was this patch tested?
By running existing test suites for interval and datetime expressions, and re-generating the `*.sql` tests:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
```

Closes #32262 from MaxGekk/interval-operator-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-20 23:13:40 +03:00