f2f09e4cdb
This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.
Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```
After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```
This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.
We should follow the behavior of pandas as much as possible.
Yes, the result for some cases have duplicates values will change.
Unit test.
Closes #33634 from itholic/SPARK-36369.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit
|
||
---|---|---|
.. | ||
cloudpickle | ||
ml | ||
mllib | ||
pandas | ||
resource | ||
sql | ||
streaming | ||
testing | ||
tests | ||
__init__.py | ||
__init__.pyi | ||
_globals.py | ||
_typing.pyi | ||
accumulators.py | ||
accumulators.pyi | ||
broadcast.py | ||
broadcast.pyi | ||
conf.py | ||
conf.pyi | ||
context.py | ||
context.pyi | ||
daemon.py | ||
files.py | ||
files.pyi | ||
find_spark_home.py | ||
install.py | ||
java_gateway.py | ||
join.py | ||
profiler.py | ||
profiler.pyi | ||
py.typed | ||
rdd.py | ||
rdd.pyi | ||
rddsampler.py | ||
resultiterable.py | ||
resultiterable.pyi | ||
serializers.py | ||
shell.py | ||
shuffle.py | ||
statcounter.py | ||
statcounter.pyi | ||
status.py | ||
status.pyi | ||
storagelevel.py | ||
storagelevel.pyi | ||
taskcontext.py | ||
taskcontext.pyi | ||
traceback_utils.py | ||
util.py | ||
util.pyi | ||
version.py | ||
version.pyi | ||
worker.py |