[SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark

### What changes were proposed in this pull request?

This PR proposes to add new argument `allowMissingColumns` to `unionByName` for allowing users to specify whether to allow missing columns or not.

### Why are the changes needed?

To expose `allowMissingColumns` argument in Python API also. Currently this is only exposed in Scala/Java APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new examples with new argument in the docstring.

### How was this patch tested?

Doctest added and manually tested

```
$ python/run-tests --testnames pyspark.sql.dataframe
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['/.../python3', 'python3.8']
Will test the following Python tests: ['pyspark.sql.dataframe']
/.../python3 python_implementation is CPython
/.../python3 version is: Python 3.8.5
python3.8 python_implementation is CPython
python3.8 version is: Python 3.8.5
Starting test(/.../python3): pyspark.sql.dataframe
Starting test(python3.8): pyspark.sql.dataframe
Finished test(python3.8): pyspark.sql.dataframe (35s)
Finished test(/.../python3): pyspark.sql.dataframe (35s)
Tests passed in 35 seconds
```

Closes #29657 from itholic/SPARK-32798.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
itholic 2020-09-08 09:41:02 +09:00 committed by HyukjinKwon
parent c43460cf82
commit 8bd3770552

View file

@ -1548,7 +1548,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return self.union(other)
@since(2.3)
def unionByName(self, other):
def unionByName(self, other, allowMissingColumns=False):
""" Returns a new :class:`DataFrame` containing union of rows in this and another
:class:`DataFrame`.
@ -1567,8 +1567,28 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
| 1| 2| 3|
| 6| 4| 5|
+----+----+----+
When the parameter `allowMissingColumns` is ``True``,
this function allows different set of column names between two :class:`DataFrame`\\s.
Missing columns at each side, will be filled with null values.
The missing columns at left :class:`DataFrame` will be added at the end in the schema
of the union result:
>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
>>> df1.unionByName(df2, allowMissingColumns=True).show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3|null|
|null| 4| 5| 6|
+----+----+----+----+
.. versionchanged:: 3.1.0
Added optional argument `allowMissingColumns` to specify whether to allow
missing columns.
"""
return DataFrame(self._jdf.unionByName(other._jdf), self.sql_ctx)
return DataFrame(self._jdf.unionByName(other._jdf, allowMissingColumns), self.sql_ctx)
@since(1.3)
def intersect(self, other):