Commit graph

30645 commits

Author SHA1 Message Date
itholic 2537fe8cba [SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame
### What changes were proposed in this pull request?

Currently, inferring nested structs is always using `MapType`.

This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below:

```python
data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}]
df = spark.createDataFrame(data)
df.show(truncate=False)
+--------------------------------+
|inside_struct                   |
+--------------------------------+
|{name -> null, payment -> 100.5}|
+--------------------------------+
```

The "name" became `null`, but it should've been `"Lee"`.

In this case, we need to be able to infer the schema with a `StructType` instead of a `MapType`.

Therefore, this PR proposes adding an new configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` to handle which type is used for inferring nested structs.
- When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `false` (by default), inferring nested structs by `MapType`
- When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `true`, inferring nested structs by `StructType`

### Why are the changes needed?

Because always inferring the nested structs by `MapType` doesn't work properly for some cases.

### Does this PR introduce _any_ user-facing change?

New configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is added.

### How was this patch tested?

Added an unit test

Closes #33214 from itholic/SPARK-35929.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 15:14:18 +09:00
Yuming Wang ddc5cb9051 [SPARK-35906][SQL][FOLLOWUP] Recursive remove sort if the maximum number of rows less than or equal to 1
### What changes were proposed in this pull request?

Make it recursive remove sort  if the maximum number of rows less than or equal to 1. For example:
```sql
select a from (select a from values(0, 1) t(a, b) order by a) order by a
```

### Why are the changes needed?

Fix Once strategy's idempotence is broken for batch Eliminate Sorts.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33240 from wangyum/SPARK-35906-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 14:27:00 +09:00
Hyukjin Kwon 16c195ccfb [SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to bump up the mypy version to 0.910 which is the latest.

### Why are the changes needed?

To catch the type hint mistakes better in PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GitHub Actions should test it out.

Closes #33223 from HyukjinKwon/SPARK-35684.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 13:26:28 +09:00
allisonwang-db ca348e50a4 [SPARK-36028][SQL] Allow Project to host outer references in scalar subqueries
### What changes were proposed in this pull request?
This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule.

Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation.

### Why are the changes needed?
To allow more types of correlated scalar subqueries.

### Does this PR introduce _any_ user-facing change?
Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example:
```sql
SELECT (SELECT c1) FROM t;
```
Before this change:
```
org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported
outside of WHERE/HAVING clauses
```

After this change:
```
+------------------+
|scalarsubquery(c1)|
+------------------+
|0                 |
|1                 |
+------------------+
```

### How was this patch tested?
Added unit tests and SQL tests.

Closes #33235 from allisonwang-db/spark-36028-outer-in-project.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-07 04:25:54 +00:00
attilapiros bad6f89ae2 [SPARK-36026][BUILD][K8S] Upgrade kubernetes-client to 5.5.0
### What changes were proposed in this pull request?

Upgrading the kubernetes-client to 5.5.0

### Why are the changes needed?

There are [several bugfixes](https://github.com/fabric8io/kubernetes-client/releases/tag/v5.5.0) but the main reason is version 5.5.0 contains [Support HTTP operation retry with exponential backoff (for status code >= 500)](https://github.com/fabric8io/kubernetes-client/issues/3087).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

By running the integration tests including `persistentVolume` tests:

```
./resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh \
    --spark-tgz $TARBALL_TO_TEST --hadoop-profile $HADOOP_PROFILE --exclude-tags r --include-tags persistentVolume
...
[INFO] --- scalatest-maven-plugin:2.0.0:test (integration-test)  spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 413 milliseconds.
Run starting. Expected test count is: 26
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
Run completed in 18 minutes, 34 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Checked the compatibility matrix and the same k8s versions are supported as were by version 5.4.1.

Closes #33233 from attilapiros/SPARK-36026.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 13:02:37 +09:00
Linhong Liu 7566db6033 [SPARK-35984][SQL][TEST] Config to force applying shuffled hash join
### What changes were proposed in this pull request?
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join
during the join selection.

### Why are the changes needed?
In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.

### Does this PR introduce _any_ user-facing change?
No, only for testing

### How was this patch tested?
newly added tests
Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true`

Closes #33182 from linhongliu-db/SPARK-35984-hash-join-config.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-07 00:58:14 +08:00
Wenchen Fan 6b3ab8262f [SPARK-36020][SQL] Check logical link in remove redundant projects
### What changes were proposed in this pull request?

The `RemoveRedundantProjects` feature can conflict with the AQE broadcast threshold ([PR](https://github.com/apache/spark/pull/32391)) sometimes. After removing the project, the physical plan to logical plan link can be changed and we may have a `Project` above `LogicalQueryStage`. This breaks AQE broadcast threshold, because the stats of `Project` does not have the `isRuntime = true` flag, and thus still use the normal broadcast threshold.

This PR updates `RemoveRedundantProjects` to not remove `ProjectExec` that has a different logical plan link than its child.

### Why are the changes needed?

Make AQE broadcast threshold work in more cases.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #33222 from cloud-fan/aqe2.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-06 21:17:33 +08:00
Matthew Jones 2fff060b6d [SPARK-36009][GRAPHX] Add missing GraphX classes to registerKryoClasses util method
### What changes were proposed in this pull request?
Add `ShippableVertexPartition` and `RoutingTablePartition` to the classes registered with Kryo in `GraphXUtils.registerKryoClasses`.

### Why are the changes needed?
`VertexRDDImpl` uses an `RDD[ShippableVertexPartition[VD]]` however, `GraphXUtils.registerKryoClasses` does not register `ShippableVertexPartition`. This means when running with `spark.kryo.registrationRequired` set to `true`, we get a "Class is not registered" exception. This is an issue as it prevents other unregistered classes from being discovered using `spark.kryo.registrationRequired` as the first unregistered class found halts the whole job. It also potentially decreases the serialised size of the RDD when using Kryo.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested manually by running with `spark.kryo.registrationRequired` set to `true` and verifying no "Class is not registered" exception was thrown.

Closes #32973 from matthewrj/bug/register-graphx-classes.

Lead-authored-by: Matthew Jones <matthew@vitaler.com>
Co-authored-by: Matthew Jones <mlety2@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-06 07:25:22 -05:00
Gengliang Wang d5d1222686 [SPARK-36025][SQL][TESTS] Reduce the run time of DateExpressionsSuite
### What changes were proposed in this pull request?

Some of the test cases in `DateExpressionsSuite` are quite slow:

- `Hour`: 24s
- `Minute`: 26s
- `Day / DayOfMonth`: 8s
- `Year`: 4s

Each test case has a large loop. We should improve them.

### Why are the changes needed?

Save test running time

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Verified the run times on local:
- `Hour`: 2s
- `Minute`: 3.2
- `Day / DayOfMonth`:0.5s
- `Year`: 2s

Total reduced time: 54.3s

Closes #33229 from gengliangwang/improveTest.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-06 20:17:02 +08:00
Dongjoon Hyun d8e91eb2f6 [SPARK-36004][INFRA] Update MiMa and audit API changes
### What changes were proposed in this pull request?

This PR aims to update MiMa based on Apache Spark 3.1.1 (the first release on 3.1 line) for Apache Spark 3.2.0 release.

### Why are the changes needed?

Old MiMa rules hides the breaking changes in Apache Spark 3.2.0. We need to audit and document it correctly in MiMa exclusion file. This issue is discussed here originally.
- https://github.com/apache/spark/pull/33196#issuecomment-873249068

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CIs

Closes #33199 from dongjoon-hyun/SPARK-36004.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-07-06 07:16:07 -05:00
Angerszhuuuu 26d1bb16bc [SPARK-36023][SPARK-35735][SPARK-35768][SQL] Refactor code about parse string to DT/YM
### What changes were proposed in this pull request?
 Refactor code about parse string to DT/YM intervals.

### Why are the changes needed?
Extracting the common code about parse string to DT/YM should improve code maintenance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existed UT.

Closes #33217 from AngersZhuuuu/SPARK-35735-35768.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-06 13:51:06 +03:00
Kousuke Saruta def8bc5c96 [SPARK-35999][SQL] Make from_csv/to_csv to handle day-time intervals properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_csv/to_csv` doesn't handle day-time intervals properly.
`from_csv` throws exception if day-time interval types are given.
```
spark-sql> select from_csv("interval '1 2:3:4' day to second", "a interval day to second");
21/07/03 04:39:13 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1 2:3:4' day to second", "a interval day to second")]
java.lang.Exception: Unsupported type: interval day to second
 at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
 at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
 at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```

Also, `to_csv` doesn't handle day-time interval types properly though any exception is thrown.
The result of `to_csv` for day-time interval types is not ANSI interval compliant form.

```
spark-sql> select to_csv(named_struct("a", interval '1 2:3:4' day to second));
93784000000
```
The result above should be `INTERVAL '1 02:03:04' DAY TO SECOND`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33226 from sarutak/csv-dtinterval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-06 17:37:38 +08:00
Kousuke Saruta c8ff613c3c [SPARK-35983][SQL] Allow from_json/to_json for map types where value types are day-time intervals
### What changes were proposed in this pull request?

This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `day-time` interval types like:
```
spark-sql> select to_json(map('a', interval '1 2:3:4' day to second));
21/07/06 14:53:58 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1 2:3:4' day to second))]
java.lang.RuntimeException: Failed to convert value 93784000000 (class of class java.lang.Long) with the type of DayTimeIntervalType(0,3) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `day-time` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1 2:3:4' day to second)), 'a interval day to second');
{"a":null}
```

### Why are the changes needed?

There should be no reason why day-time intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33225 from sarutak/json-dtinterval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-06 11:06:56 +03:00
Angerszhuuuu 87282f04bf [SPARK-35972][SQL] When replace ExtractValue in NestedColumnAliasing we should use semanticEquals
### What changes were proposed in this pull request?
Ideally, in SQL query, nested columns should result to GetStructField with non-None name. But there are places that can create GetStructField with None name, such as UnresolvedStar.expand, Dataset encoder stuff, etc.
the current `nestedFieldToAlias` cannot catch it up and will cause job failed.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT,

Closes #33183 from AngersZhuuuu/SPARK-35972.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-06 00:09:34 -07:00
RoryQi e0c6b2e965 [SPARK-36011][SQL] Disallow altering permanent views based on temporary views or UDFs
### What changes were proposed in this pull request?
PR #15764 disabled creating permanent views based on temporary views or UDFs.  But AlterViewCommand didn't block temporary objects.

### Why are the changes needed?
More robust view canonicalization.

### Does this PR introduce _any_ user-facing change?
Yes, now if you alter a permanent view based on temporary views or UDFs, the operation will fail.

### How was this patch tested?
Add new unit tests.

Closes #33204 from jerqi/alter_view.

Authored-by: RoryQi <1242949407@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-06 14:56:12 +08:00
Gengliang Wang b0b9643cd7 [SPARK-35978][SQL] Support non-reserved keyword TIMESTAMP_LTZ
### What changes were proposed in this pull request?

Support new keyword `TIMESTAMP_LTZ`, which can be used for:

- timestamp with local time zone data type in DDL
- timestamp with local time zone data type in Cast clause.
- timestamp with local time zone data type literal

### Why are the changes needed?

Users can use `TIMESTAMP_LTZ` in DDL/Cast/Literals for the timestamp with local time zone type directly. The new keyword is independent of the SQL configuration `spark.sql.timestampType`.

### Does this PR introduce _any_ user-facing change?

No, the new timestamp type is not released yet.

### How was this patch tested?

Unit test

Closes #33224 from gengliangwang/TIMESTAMP_LTZ.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-06 14:33:22 +08:00
Yuanjian Li 9544277b0a [SPARK-35788][SS] Metrics support for RocksDB instance
### What changes were proposed in this pull request?
Add more metrics for the RocksDB instance. We transform the native states from RocksDB.

### Why are the changes needed?
Improve the usability with more metrics for RocksDB instance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #32934 from xuanyuanking/SPARK-35788.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-07-06 11:12:21 +09:00
Wenchen Fan 8b46e26fc6 [SPARK-34302][SQL][FOLLOWUP] More code cleanup
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/33113, to do some code cleanup:
1. `UnresolvedFieldPosition` doesn't need to include the field name. We can get it through "context" (`AlterTableAlterColumn.column.name`).
2. Run `ResolveAlterTableCommands` in the main resolution batch, so that the column/field resolution is also unified between v1 and v2 commands (same error message).
3. Fail immediately in `ResolveAlterTableCommands` if we can't resolve the field, instead of waiting until `CheckAnalysis`. We don't expect other rules to resolve fields in ALTER  TABLE commands, so failing immediately is simpler and we can remove duplicated code in `CheckAnalysis`.

### Why are the changes needed?

code simplification.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33213 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-06 03:43:42 +08:00
Gengliang Wang 5f44acff3d [SPARK-35977][SQL] Support non-reserved keyword TIMESTAMP_NTZ
### What changes were proposed in this pull request?

Support new keyword TIMESTAMP_NTZ, which can be used for:

- timestamp without time zone data type in DDL
- timestamp without time zone data type in Cast clause.
- timestamp without time zone data type literal

### Why are the changes needed?

Users can use `TIMESTAMP_NTZ` in DDL/Cast/Literals for the timestamp without time zone type directly.

### Does this PR introduce _any_ user-facing change?

No, the new timestamp type is not released yet.

### How was this patch tested?

Unit test

Closes #33221 from gengliangwang/timstamp_ntz.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-05 22:30:44 +03:00
Gengliang Wang 2fffec7de8 [SPARK-35979][SQL] Return different timestamp literals based on the default timestamp type
### What changes were proposed in this pull request?

For the timestamp literal, it should have the following behavior.
1. When `spark.sql.timestampType` is TIMESTAMP_NTZ: if there is no time zone part, return timestamp without time zone literal; otherwise, return timestamp with local time zone literal

2. When `spark.sql.timestampType` is TIMESTAMP_LTZ: return timestamp with local time zone literal

### Why are the changes needed?

When the default timestamp type is TIMESTAMP_NTZ, the result of type literal should return TIMESTAMP_NTZ when there is no time zone part in the string.

From setion 5.3 "literal" of ANSI SQL standard 2011:
```
27) The declared type of a <timestamp literal> that does not specify <time zone interval> is TIMESTAMP(P) WITHOUT TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise. The declared type of a <timestamp literal> that specifies <time zone interval> is TIMESTAMP(P) WITH TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise.
```
Since we don't have "timestamp with time zone", we use timestamp with local time zone instead.
### Does this PR introduce _any_ user-facing change?

No, the new timestmap type and the default timestamp configuration is not released yet.

### How was this patch tested?

Unit test

Closes #33215 from gengliangwang/tsLiteral.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-06 00:54:58 +08:00
gengjiaan c605ba2d46 [SPARK-35664][SQL][FOLLOWUP] Fix incorrect comment for TimestampNTZType
### What changes were proposed in this pull request?
This PR fix the incorrect comment for `TimestampNTZType`.

### Why are the changes needed?
Fix the incorrect comment

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
No need.

Closes #33218 from beliefer/SPARK-35664-followup.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-05 18:48:00 +08:00
Kousuke Saruta f4237aff7e [SPARK-35998][SQL] Make from_csv/to_csv to handle year-month intervals properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_csv/to_csv` doesn't handle year-month intervals properly.
`from_csv` throws exception if year-month interval types are given.
```
spark-sql> select from_csv("interval '1-2' year to month", "a interval year to month");
21/07/03 04:32:24 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1-2' year to month", "a interval year to month")]
java.lang.Exception: Unsupported type: interval year to month
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```

Also, `to_csv` doesn't handle year-month interval types properly though any exception is thrown.
The result of `to_csv` for year-month interval types is not ANSI interval compliant form.

```
spark-sql> select to_csv(named_struct("a", interval '1-2' year to month));
14
```
The result above should be `INTERVAL '1-2' YEAR TO MONTH`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33210 from sarutak/csv-yminterval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-05 13:10:50 +03:00
Hyukjin Kwon 2fe6c94544 [SPARK-33996][BUILD][FOLLOW-UP] Match SBT's plugin checkstyle version to Maven's
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/31019 that forgot to update SBT's to match.

### Why are the changes needed?

To use the same version in both Maven and SBT.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI should test them.

Closes #33207 from HyukjinKwon/SPARK-33996.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-05 18:55:45 +09:00
ulysses-you 7fe4c4a9ad [SPARK-35989][SQL] Only remove redundant shuffle if shuffle origin is REPARTITION_BY_COL in AQE
### What changes were proposed in this pull request?

Skip remove shuffle if it's shuffle origin is not `REPARTITION_BY_COL` in AQE.

### Why are the changes needed?

`REPARTITION_BY_COL` doesn't guarantee the output partitioning number so we can remove it safely in AQE.

For `REPARTITION_BY_NUM`, we should retain the shuffle which partition number is specified by user.
For `REBALANCE_PARTITIONS_BY_COL`, it is a special shuffle used to rebalance partitions so we should not remove it.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add test

Closes #33188 from ulysses-you/SPARK-35989.

Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-05 17:10:42 +08:00
Cheng Su 044dddf288 [SPARK-35794][SQL] Allow custom plugin for AQE cost evaluator
### What changes were proposed in this pull request?

Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is `SimpleCostEvaluator` to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own `CostEvaluator` subclass and plug in dynamically based on configuration.

The approach is to introduce a new config to allow define sub-class name of `CostEvaluator` - `spark.sql.adaptive.customCostEvaluatorClass`. And add `CostEvaluator.instantiate` to instantiate the cost evaluator class in `AdaptiveSparkPlanExec.costEvaluator`.

### Why are the changes needed?

Make AQE cost evaluation more flexible.

### Does this PR introduce _any_ user-facing change?

No but an internal config is introduced - `spark.sql.adaptive.customCostEvaluatorClass` to allow custom implementation of `CostEvaluator`.

### How was this patch tested?

Added unit test in `AdaptiveQueryExecSuite.scala`.

Closes #32944 from c21/aqe-cost.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-05 09:06:38 +00:00
Kousuke Saruta 7f70350929 [SPARK-36013][BUILD] Upgrade Dropwizard Metrics to 4.2.2
### What changes were proposed in this pull request?

This PR aims to upgrade Dropwizard Metrics from `4.2.0` to `4.2.2`.

### Why are the changes needed?

Dropwizard `4.2.1` fixes a bug related to `JMXReporter` but `4.2.1` also contains a bug. so upgrading to `4.2.2` seems better.
https://github.com/dropwizard/metrics/releases/tag/v4.2.1
https://github.com/dropwizard/metrics/releases/tag/v4.2.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33209 from sarutak/upgrade-metrics-4.2.2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-05 17:49:50 +09:00
Hyukjin Kwon 554d5fef13 [SPARK-36010][BUILD] Upgrade sbt-antlr4 from 0.8.2 to 0.8.3
### What changes were proposed in this pull request?

This PR proposes to upgrade sbt-antlr4 from 0.8.2 to 0.8.3 per the guides at https://github.com/ihji/sbt-antlr4
I can't find an official proper docs for this.

### Why are the changes needed?

To stick to the guides in https://github.com/ihji/sbt-antlr4, and leverage the fixes included.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR should tests it out.

Closes #33208 from HyukjinKwon/SPARK-36010.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-07-05 16:54:07 +09:00
Kousuke Saruta 6474226852 [SPARK-35982][SQL] Allow from_json/to_json for map types where value types are year-month intervals
### What changes were proposed in this pull request?

This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `year-month` interval types like:
```
spark-sql> select to_json(map('a', interval '1-2' year to  month));
21/07/02 11:38:15 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1-2' year to  month))]
java.lang.RuntimeException: Failed to convert value 14 (class of class java.lang.Integer) with the type of YearMonthIntervalType(0,1) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `year-month` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1-2' year to month)), 'a interval year to month');
{"a":null}
```

### Why are the changes needed?

There should be no reason why year-month intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33181 from sarutak/map-json-yminterval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-05 10:35:50 +03:00
Gengliang Wang 51103cdcdd [SPARK-35987][SQL] The ANSI flags of Sum and Avg should be kept after being copied
### What changes were proposed in this pull request?

Make the ANSI flag part of expressions `Sum` and `Average`'s parameter list, instead of fetching it from the sessional SQLConf.

### Why are the changes needed?

For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list.

We should make it consistent for the expressions `Sum` and `Average`

### Does this PR introduce _any_ user-facing change?

Yes, the `Sum` and `Average` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session.

### How was this patch tested?

Existing UT

Closes #33186 from gengliangwang/sumAndAvg.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-05 12:34:21 +08:00
Kevin Su 11fcbc73cb [SPARK-36007][INFRA] Failed to run benchmark in GA
### What changes were proposed in this pull request?

When I'm running the benchmark in GA, I met the below error.

https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true
```
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.j
ava:1692)java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
21/06/20 07:40:02 ERROR SparkContext: Error initializing SparkContext.java.lang.AssertionError: assertion failed:
spark.test.home is not set! at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.worker.Worker.<init>
(Worker.scala:148) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65) at
scala.collection.immutable.Range.foreach(Range.scala:158) at
org.apache.spark.deploy.LocalSparkCluster.start(LocalSparkCluster.scala:65) at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2954) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:559) at org.apache.spark.SparkContext.<init>
(SparkContext.scala:137) at
org.apache.spark.serializer.KryoSerializerBenchmark$.createSparkContext(KryoSerializerBenchmark.scala:86) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$lzycompute$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.$anonfun$run$3(KryoSerializerBenchmark.scala:63)
```

### Why are the changes needed?

Set `spark.test.home` in the benchmark workflow.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Rerun the benchmark in my fork.
https://github.com/pingsutw/spark/actions/runs/996067851

Closes #33203 from pingsutw/SPARK-36007.

Lead-authored-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: Kevin Su <pingsutw@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-05 09:17:06 +09:00
Tim Armstrong e4273f7098 [SPARK-35980][CORE] ThreadAudit logs whether thread is daemon
### What changes were proposed in this pull request?
Add `daemon={true|false}` to the POSSIBLE THREAD LEAK IN SUITE warning printed by test framework.
### Why are the changes needed?
This is to slightly accelerate interpretation of that warning, since non-daemon threads can block the process from exiting and are likely to be problematic.

Only affects test code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually ran some tests, inspected the output log line.

Closes #33178 from timarmstrong/thread-leak.

Authored-by: Tim Armstrong <tim.armstrong@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-04 10:44:00 +09:00
Tomas Pereira de Vasconcelos 495d234c6e [SPARK-35986][PYSPARK] Fix type hint for RDD.histogram's buckets
### What changes were proposed in this pull request?
Fix the type hint for `pyspark.rdd .RDD.histogram`'s `buckets` argument

### Why are the changes needed?
The current type hint is incomplete.
![image](https://user-images.githubusercontent.com/17701527/124248180-df7fd580-db22-11eb-8391-ba0bb51d689b.png)
From `pyspark.rdd .RDD.histogram`'s source:
```python
if isinstance(buckets, int):
    ...
elif isinstance(buckets, (list, tuple)):
    ...
else:
    raise TypeError("buckets should be a list or tuple or number(int or long)")
```

### Does this PR introduce _any_ user-facing change?
Fixed the warning displayed above.

### How was this patch tested?
Fixed warning above with this change.

Closes #33185 from tpvasconcelos/master.

Authored-by: Tomas Pereira de Vasconcelos <tomasvasconcelos1@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-04 10:22:57 +09:00
Wenchen Fan e6ce220690 [SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient
### What changes were proposed in this pull request?

This PR uses 2 ideas to make `EquivalentExpressions` more efficient:
1. do not keep all the equivalent expressions, we only need a count
2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs`

This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.

### Why are the changes needed?

code cleanup and small perf improvement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33142 from cloud-fan/codegen.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-03 08:28:44 -07:00
Dongjoon Hyun f9f95686cb [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.3.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.3.0 and the published snapshot version should not conflict with `branch-3.2`.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #33196 from dongjoon-hyun/SPARK-35996.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 13:47:36 -07:00
Kousuke Saruta 6c4616b2ac [SPARK-35990][BUILD] Remove avro-sbt plugin dependency
### What changes were proposed in this pull request?

This PR removes sbt-avro plugin dependency.
In the current master, Build with SBT depends on the plugin but it seems never used.
Originally, the plugin was introduced for `flume-sink` in SPARK-1729 (#807) but `flume-sink` is no longer in Spark repository.

After SBT was upgraded to 1.x in SPARK-21708 (#29286), `avroGenerate` part was introduced in `object SQL` in `SparkBuild.scala`.
It's confusable but I understand `Test / avroGenerate := (Compile / avroGenerate).value` is for suppressing sbt-avro for `sql` sub-module.
In fact, Test/compile will fail if `Test / avroGenerate :=(Compile / avroGenerate).value` is commented out.

`sql` sub-module contains `parquet-compat.avpr` and `parquet-compat.avdl` but according to `sql/core/src/test/README.md`, they are intended to be handled by `gen-avro.sh`.

Also, in terms of Maven build, there seems to be no definition to handle `*.avpr` or `*.avdl`.

Based on the above, I think we can remove `sbt-avro`.

### Why are the changes needed?

If `sbt-avro` is really no longer used, it's confusable that `sbt-avro` related configurations are in `SparkBuild.scala`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #33190 from sarutak/remove-avro-from-sbt.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 11:00:59 -07:00
Liang-Chi Hsieh a6e00ee9d7 [SPARK-35785][SS][FOLLOWUP] Ignore concurrent update and cleanup test
### What changes were proposed in this pull request?

This patch ignores the test "ensure that concurrent update and cleanup consistent versions" in #32933. The test is currently flaky and we will address it later.

### Why are the changes needed?

Unblock other developments.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33195 from viirya/ignore-rocksdb-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 10:58:48 -07:00
Dongjoon Hyun dcc405743e [SPARK-35994][INFRA] Publish snapshot from branch-3.2
### What changes were proposed in this pull request?

This PR aims to publish snapshot artifacts from branch-3.2 additionally.

### Why are the changes needed?

`GitHub Action`'s cronjob feature is only supported in the default branch. So, to have a daily job, we should add here.

Currently, it's publishing master and 3.1.
- https://github.com/apache/spark/actions/workflows/publish_snapshot.yml

<img width="273" alt="Screen Shot 2021-07-02 at 10 22 41 AM" src="https://user-images.githubusercontent.com/9700541/124309380-7c407400-db1f-11eb-9aa4-30db61a72b80.png">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33192 from dongjoon-hyun/SPARK-35994.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 10:54:30 -07:00
Dongjoon Hyun c55b9fd1e0 [SPARK-35992][BUILD] Upgrade ORC to 1.6.9
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.9.

### Why are the changes needed?

This is required to bring ORC-804 in order to fix ORC encryption masking bug.

### Does this PR introduce _any_ user-facing change?

No. This is not released yet.

### How was this patch tested?

Pass the newly added test case.

Closes #33189 from dongjoon-hyun/SPARK-35992.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 09:49:49 -07:00
Takuya UESHIN 77696448db [SPARK-35981][PYTHON][TEST] Use check_exact=False to loosen the check precision
### What changes were proposed in this pull request?

We should use `check_exact=False` because the value check in `StatsTest.test_cov_corr_meta` is too strict.

### Why are the changes needed?

In some environment, the precision could be different in pandas' `DataFrame.corr` function and the test `StatsTest.test_cov_corr_meta` fails.

```
AssertionError: DataFrame.iloc[:, 0] (column name="a") are different
DataFrame.iloc[:, 0] (column name="a") values are different (14.28571 %)
[index]: [a, b, c, d, e, f, g]
[left]:  [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[right]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.807406715958909e-17]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified tests should still pass.

Closes #33179 from ueshin/issuse/SPARK-35981/corr.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-02 17:58:10 +09:00
Wenchen Fan 0c9c8ff569 [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing
### What changes were proposed in this pull request?

By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.

However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.

This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.

### Why are the changes needed?

AQE is default on now, we should make the perf better in the default case.

### Does this PR introduce _any_ user-facing change?

yes, a new config.

### How was this patch tested?

new tests

Closes #33172 from cloud-fan/aqe2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-02 16:07:31 +08:00
Yuanjian Li ca6acf0839 [SPARK-35785][SS] Cleanup support for RocksDB instance
### What changes were proposed in this pull request?
Add the functionality of cleaning up files of old versions for the RocksDB instance and RocksDBFileManager.

### Why are the changes needed?
Part of the implementation of RocksDB state store.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #32933 from xuanyuanking/SPARK-35785.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-02 00:47:55 -07:00
Dongjoon Hyun 79a6e00b76 [SPARK-35825][INFRA][FOLLOWUP] Increase it in build/mvn script
### What changes were proposed in this pull request?

This is a follow up of https://github.com/apache/spark/pull/32961.

This PR additionally sets the stack size in `build/mvn`.

### Why are the changes needed?

We are still hitting `StackOverflowError` in Jenkins.

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/3064/console
```
[INFO] compiling 166 Scala sources and 19 Java sources to /home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/sql/catalyst/target/scala-2.12/classes ...
[ERROR] ## Exception when compiling 480 sources to /home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
```

This PR increases the JVM of `mvn` instead of the plugin.

```
$ MAVEN_OPTS="-XX:+PrintFlagsFinal" build/mvn clean | grep 'intx ThreadStackSize'
     intx ThreadStackSize                           = 2048                                {pd product}

$ MAVEN_OPTS="-Xss128m -XX:+PrintFlagsFinal" build/mvn clean | grep 'intx ThreadStackSize'
     intx ThreadStackSize                          := 131072                              {pd product}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #33180 from dongjoon-hyun/SPARK-35825.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-01 22:24:48 -07:00
Karen Feng 1fda011d71 [SPARK-35955][SQL] Check for overflow in Average in ANSI mode
### What changes were proposed in this pull request?

Fixes decimal overflow issues for decimal average in ANSI mode, so that overflows throw an exception rather than returning null.

### Why are the changes needed?

Query:

```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(mean("decNum"))
df2: org.apache.spark.sql.DataFrame = [avg(decNum): decimal(38,22)]

scala> df2.show(40,false)
```

Before:
```
+-----------+
|avg(decNum)|
+-----------+
|null       |
+-----------+
```

After:
```
21/07/01 19:48:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.lang.ArithmeticException: Overflow in sum of decimals.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInSumOfDecimalError(QueryExecutionErrors.scala:162)
	at org.apache.spark.sql.errors.QueryExecutionErrors.overflowInSumOfDecimalError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:499)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #33177 from karenfeng/SPARK-35955.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-02 12:41:24 +08:00
Rahul Mahadev 47485a3c2d [SPARK-35897][SS] Support user defined initial state with flatMapGroupsWithState in Structured Streaming
### What changes were proposed in this pull request?
This PR aims to add support for specifying a user defined initial state for arbitrary structured streaming stateful processing using [flat]MapGroupsWithState operator.

### Why are the changes needed?
Users can load previous state of their stateful processing as an initial state instead of redoing the entire processing once again.

### Does this PR introduce _any_ user-facing change?

Yes this PR introduces new API
```
  def mapGroupsWithState[S: Encoder, U: Encoder](
      timeoutConf: GroupStateTimeout,
      initialState: KeyValueGroupedDataset[K, S])(
      func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]

  def flatMapGroupsWithState[S: Encoder, U: Encoder](
      outputMode: OutputMode,
      timeoutConf: GroupStateTimeout,
      initialState: KeyValueGroupedDataset[K, S])(
      func: (K, Iterator[V], GroupState[S]) => Iterator[U])

```

### How was this patch tested?

Through unit tests in FlatMapGroupsWithStateSuite

Closes #33093 from rahulsmahadev/flatMapGroupsWithState.

Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-02 11:53:17 +08:00
Xinrong Meng 95d94948c5 [SPARK-35339][PYTHON] Improve unit tests for data-type-based basic operations
### What changes were proposed in this pull request?

Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes

### Why are the changes needed?

Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #33095 from xinrong-databricks/datatypeops_test.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-01 17:37:32 -07:00
Anton Okolnychyi fceabe2372 [SPARK-35779][SQL] Dynamic filtering for Data Source V2
### What changes were proposed in this pull request?

This PR implemented the proposal per [design doc](https://docs.google.com/document/d/1RfFn2e9o_1uHJ8jFGsSakp-BZMizX1uRrJSybMe2a6M) for SPARK-35779.

### Why are the changes needed?

Spark supports dynamic partition filtering that enables reusing parts of the query to skip unnecessary partitions in the larger table during joins. This optimization has proven to be beneficial for star-schema queries which are common in the industry. Unfortunately, dynamic pruning is currently limited to partition pruning during joins and is only supported for built-in v1 sources. As more and more Spark users migrate to Data Source V2, it is important to generalize dynamic filtering and expose it to all v2 connectors.

Please, see the design doc for more information on this effort.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new optional mix-in interface for `Scan` in Data Source V2.

### How was this patch tested?

This PR comes with tests.

Closes #32921 from aokolnychyi/dynamic-filtering-wip.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-07-01 17:00:12 -07:00
Gengliang Wang a643076d4e [SPARK-35975][SQL] New configuration spark.sql.timestampType for the default timestamp type
### What changes were proposed in this pull request?

Add a new configuration `spark.sql.timestampType`, which configures the default timestamp type of Spark SQL, including SQL DDL and Cast clause. Setting the configuration as `TIMESTAMP_NTZ` will use `TIMESTAMP WITHOUT TIME ZONE` as the default type while putting it as `TIMESTAMP_LTZ` will use `TIMESTAMP WITH LOCAL TIME ZONE`.

The default value of the new configuration is TIMESTAMP_LTZ, which is consistent with previous Spark releases.

### Why are the changes needed?

A new configuration for switching the default timestamp type as timestamp without time zone.

### Does this PR introduce _any_ user-facing change?

No, it's a new feature.

### How was this patch tested?

Unit test

Closes #33176 from gengliangwang/newTsTypeConf.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-01 23:25:18 +03:00
SaurabhChawla ca1217667c [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence
### What changes were proposed in this pull request?

unionByName does not supports struct having same col names but different sequence
```
val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
val unionDF = df1.unionByName(df2)
```
it gives the exception

`org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table; 'Union false, false :- LocalRelation [_1#38, _2#39] +- LocalRelation _1#45, _2#46`

In this case the col names are same so this unionByName should have the support to check within in the Struct if col names are same it should not throw this exception and works.

after fix we are getting the result

```
val unionDF = df1.unionByName(df2)
scala>  unionDF.show
+---+------+
|  a|     b|
+---+------+
|  1|{1, 2}|
|  1|{2, 1}|
+---+------+

```

### Why are the changes needed?
As per unionByName functionality based on name, does the union. In the case of struct this scenario was missing where all the columns  names are same but sequence is different,  so added this functionality.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added the unit test and also done the testing through spark shell

Closes #32972 from SaurabhChawla100/SPARK-35756.

Authored-by: SaurabhChawla <s.saurabhtim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-01 17:37:09 +00:00
Kent Yao 94c1e3c38c [SPARK-35969][K8S] Make the pod prefix more readable and tallied with K8S DNS Label Names
### What changes were proposed in this pull request?

By default, the executor pod prefix is generated by the app name. It handles characters that match [^a-z0-9\\-] differently. The '.' and all whitespaces will be converted to '-', but other ones to empty string. Especially,  characters like '_', '|' are commonly used as a word separator in many languages.

According to the K8S DNS Label Names, see https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names, we can convert all special characters to `-`.

 
For example,

```
scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-")
res11: String = xyz-abc-i-am-a-app-name-w-some-abbrs

scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-")
res12: String = xyzabciamaappnamewsomeabbrs
```

```scala
scala> "time.is%the¥most$valuable_——————thing,it's about time.".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-")
res9: String = time-is-the-most-valuable-thing-it-s-about-time-

scala> "time.is%the¥most$valuable_——————thing,it's about time.".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-")
res10: String = time-isthemostvaluablethingits-about-time-

```

### Why are the changes needed?

For better UX

### Does this PR introduce _any_ user-facing change?

yes, the executor pod name might look better
### How was this patch tested?

add new ones

Closes #33171 from yaooqinn/SPARK-35969.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-01 08:15:00 -07:00
Gengliang Wang 3acc4b973b [SPARK-35971][SQL] Rename the type name of TimestampNTZType as "timestamp_ntz"
### What changes were proposed in this pull request?

Rename the type name string of TimestampNTZType from "timestamp without time zone" to "timestamp_ntz".

### Why are the changes needed?

This is to make the column header shorter and simpler.
Snowflake and Flink uses similar approach:
https://docs.snowflake.com/en/sql-reference/data-types-datetime.html
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/concepts/timezone/

### Does this PR introduce _any_ user-facing change?

No, the new timestamp type is not released yet.

### How was this patch tested?

Unit tests

Closes #33173 from gengliangwang/reviseTypeName.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-07-01 20:50:19 +08:00