spark-instrumented-optimizer

History

itholic 2537fe8cba [SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame ### What changes were proposed in this pull request? Currently, inferring nested structs is always using `MapType`. This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below: ```python data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}] df = spark.createDataFrame(data) df.show(truncate=False) +--------------------------------+ \|inside_struct \| +--------------------------------+ \|{name -> null, payment -> 100.5}\| +--------------------------------+ ``` The "name" became `null`, but it should've been `"Lee"`. In this case, we need to be able to infer the schema with a `StructType` instead of a `MapType`. Therefore, this PR proposes adding an new configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` to handle which type is used for inferring nested structs. - When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `false` (by default), inferring nested structs by `MapType` - When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `true`, inferring nested structs by `StructType` ### Why are the changes needed? Because always inferring the nested structs by `MapType` doesn't work properly for some cases. ### Does this PR introduce _any_ user-facing change? New configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is added. ### How was this patch tested? Added an unit test Closes #33214 from itholic/SPARK-35929. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-07-07 15:14:18 +09:00
..
__init__.py	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files	2018-11-14 14:51:11 +08:00
test_arrow.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
test_catalog.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_column.py	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
test_conf.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_context.py	[SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py	2020-09-28 21:54:00 -07:00
test_dataframe.py	[SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing	2021-07-02 16:07:31 +08:00
test_datasources.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_functions.py	[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs	2021-05-13 14:58:01 +09:00
test_group.py	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
test_pandas_cogrouped_map.py	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
test_pandas_grouped_map.py	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions	2021-01-25 11:34:47 +09:00
test_pandas_map.py	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
test_pandas_udf.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_grouped_agg.py	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests	2021-03-04 10:03:54 +09:00
test_pandas_udf_scalar.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_typehints.py	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests	2020-12-01 10:34:40 +09:00
test_pandas_udf_window.py	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests	2021-03-04 10:03:54 +09:00
test_readwriter.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_serde.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_session.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_streaming.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
test_types.py	[SPARK-35929][PYTHON] Support to infer nested dict as a struct when creating a DataFrame	2021-07-07 15:14:18 +09:00
test_udf.py	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite	2021-03-07 19:12:42 -06:00
test_utils.py	[SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters	2021-03-29 09:31:24 +00:00