spark-instrumented-optimizer/python/pyspark/sql/tests
Wenchen Fan 0c9c8ff569 [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing
### What changes were proposed in this pull request?

By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.

However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.

This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.

### Why are the changes needed?

AQE is default on now, we should make the perf better in the default case.

### Does this PR introduce _any_ user-facing change?

yes, a new config.

### How was this patch tested?

new tests

Closes #33172 from cloud-fan/aqe2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-02 16:07:31 +08:00
..
__init__.py [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files 2018-11-14 14:51:11 +08:00
test_arrow.py [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception 2021-05-26 11:54:40 +09:00
test_catalog.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_column.py [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs 2021-02-02 09:29:40 +09:00
test_conf.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_context.py [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 2020-09-28 21:54:00 -07:00
test_dataframe.py [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing 2021-07-02 16:07:31 +08:00
test_datasources.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_functions.py [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs 2021-05-13 14:58:01 +09:00
test_group.py [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs 2021-02-02 09:29:40 +09:00
test_pandas_cogrouped_map.py [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas 2021-02-02 16:25:32 +09:00
test_pandas_grouped_map.py [SPARK-33489][PYSPARK] Add NullType support for Arrow executions 2021-01-25 11:34:47 +09:00
test_pandas_map.py [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas 2021-02-02 16:25:32 +09:00
test_pandas_udf.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_pandas_udf_grouped_agg.py [SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests 2021-03-04 10:03:54 +09:00
test_pandas_udf_scalar.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_pandas_udf_typehints.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_pandas_udf_window.py [SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests 2021-03-04 10:03:54 +09:00
test_readwriter.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_serde.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_session.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_streaming.py [SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception 2021-05-26 11:54:40 +09:00
test_types.py [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests 2020-12-01 10:34:40 +09:00
test_udf.py [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite 2021-03-07 19:12:42 -06:00
test_utils.py [SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters 2021-03-29 09:31:24 +00:00