2537fe8cba
### What changes were proposed in this pull request? Currently, inferring nested structs is always using `MapType`. This behavior causes an issue because it infers the schema with a value type of the first field of the struct as below: ```python data = [{"inside_struct": {"payment": 100.5, "name": "Lee"}}] df = spark.createDataFrame(data) df.show(truncate=False) +--------------------------------+ |inside_struct | +--------------------------------+ |{name -> null, payment -> 100.5}| +--------------------------------+ ``` The "name" became `null`, but it should've been `"Lee"`. In this case, we need to be able to infer the schema with a `StructType` instead of a `MapType`. Therefore, this PR proposes adding an new configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` to handle which type is used for inferring nested structs. - When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `false` (by default), inferring nested structs by `MapType` - When `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is `true`, inferring nested structs by `StructType` ### Why are the changes needed? Because always inferring the nested structs by `MapType` doesn't work properly for some cases. ### Does this PR introduce _any_ user-facing change? New configuration `spark.sql.pyspark.inferNestedDictAsStruct.enabled` is added. ### How was this patch tested? Added an unit test Closes #33214 from itholic/SPARK-35929. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
__init__.py | ||
test_arrow.py | ||
test_catalog.py | ||
test_column.py | ||
test_conf.py | ||
test_context.py | ||
test_dataframe.py | ||
test_datasources.py | ||
test_functions.py | ||
test_group.py | ||
test_pandas_cogrouped_map.py | ||
test_pandas_grouped_map.py | ||
test_pandas_map.py | ||
test_pandas_udf.py | ||
test_pandas_udf_grouped_agg.py | ||
test_pandas_udf_scalar.py | ||
test_pandas_udf_typehints.py | ||
test_pandas_udf_window.py | ||
test_readwriter.py | ||
test_serde.py | ||
test_session.py | ||
test_streaming.py | ||
test_types.py | ||
test_udf.py | ||
test_utils.py |