[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark
### What changes were proposed in this pull request? This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not. ### Why are the changes needed? To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new examples with new argument in the docstring. ### How was this patch tested? Doctest added and manually tested ``` $ python/run-tests --testnames pyspark.sql.dataframe Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.dataframe (35s) Finished test(/.../python3): pyspark.sql.dataframe (35s) Tests passed in 35 seconds ``` Closes #29657 from itholic/SPARK-32798. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
parent
c43460cf82
commit
8bd3770552
|
@ -1548,7 +1548,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
|
|||
return self.union(other)
|
||||
|
||||
@since(2.3)
|
||||
def unionByName(self, other):
|
||||
def unionByName(self, other, allowMissingColumns=False):
|
||||
""" Returns a new :class:`DataFrame` containing union of rows in this and another
|
||||
:class:`DataFrame`.
|
||||
|
||||
|
@ -1567,8 +1567,28 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
|
|||
| 1| 2| 3|
|
||||
| 6| 4| 5|
|
||||
+----+----+----+
|
||||
|
||||
When the parameter `allowMissingColumns` is ``True``,
|
||||
this function allows different set of column names between two :class:`DataFrame`\\s.
|
||||
Missing columns at each side, will be filled with null values.
|
||||
The missing columns at left :class:`DataFrame` will be added at the end in the schema
|
||||
of the union result:
|
||||
|
||||
>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
|
||||
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
|
||||
>>> df1.unionByName(df2, allowMissingColumns=True).show()
|
||||
+----+----+----+----+
|
||||
|col0|col1|col2|col3|
|
||||
+----+----+----+----+
|
||||
| 1| 2| 3|null|
|
||||
|null| 4| 5| 6|
|
||||
+----+----+----+----+
|
||||
|
||||
.. versionchanged:: 3.1.0
|
||||
Added optional argument `allowMissingColumns` to specify whether to allow
|
||||
missing columns.
|
||||
"""
|
||||
return DataFrame(self._jdf.unionByName(other._jdf), self.sql_ctx)
|
||||
return DataFrame(self._jdf.unionByName(other._jdf, allowMissingColumns), self.sql_ctx)
|
||||
|
||||
@since(1.3)
|
||||
def intersect(self, other):
|
||||
|
|
Loading…
Reference in a new issue