spark-instrumented-optimizer/python/docs/source/user_guide/ps_pandas_pyspark.rst
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00

119 lines
2.9 KiB
ReStructuredText

===============================
Working with pandas and PySpark
===============================
.. currentmodule:: pyspark.pandas
Users from pandas and/or PySpark face API compatibility issue sometimes when they
work with Koalas. Since Koalas does not target 100% compatibility of both pandas and
PySpark, users need to do some workaround to port their pandas and/or PySpark codes or
get familiar with Koalas in this case. This page aims to describe it.
pandas
------
pandas users can access to full pandas APIs by calling :func:`DataFrame.to_pandas`.
Koalas DataFrame and pandas DataFrame are similar. However, the former is distributed
and the latter is in a single machine. When converting to each other, the data is
transferred between multiple machines and the single client machine.
For example, if you need to call ``pandas_df.values`` of pandas DataFrame, you can do
as below:
.. code-block:: python
>>> import pyspark.pandas as ks
>>>
>>> kdf = ks.range(10)
>>> pdf = kdf.to_pandas()
>>> pdf.values
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
pandas DataFrame can be a Koalas DataFrame easily as below:
.. code-block:: python
>>> ks.from_pandas(pdf)
id
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Note that converting Koalas DataFrame to pandas requires to collect all the data into the client machine; therefore,
if possible, it is recommended to use Koalas or PySpark APIs instead.
PySpark
-------
PySpark users can access to full PySpark APIs by calling :func:`DataFrame.to_spark`.
Koalas DataFrame and Spark DataFrame are virtually interchangeable.
For example, if you need to call ``spark_df.filter(...)`` of Spark DataFrame, you can do
as below:
.. code-block:: python
>>> import pyspark.pandas as ks
>>>
>>> kdf = ks.range(10)
>>> sdf = kdf.to_spark().filter("id > 5")
>>> sdf.show()
+---+
| id|
+---+
| 6|
| 7|
| 8|
| 9|
+---+
Spark DataFrame can be a Koalas DataFrame easily as below:
.. code-block:: python
>>> sdf.to_koalas()
id
0 6
1 7
2 8
3 9
However, note that it requires to create new default index in case Koalas DataFrame is created from
Spark DataFrame. See `Default Index Type <options.rst#default-index-type>`_. In order to avoid this overhead, specify the column
to use as an index when possible.
.. code-block:: python
>>> # Create a Koalas DataFrame with an explicit index.
... kdf = ks.DataFrame({'id': range(10)}, index=range(10))
>>> # Keep the explicit index.
... sdf = kdf.to_spark(index_col='index')
>>> # Call Spark APIs
... sdf = sdf.filter("id > 5")
>>> # Uses the explicit index to avoid to create default index.
... sdf.to_koalas(index_col='index')
id
index
6 6
7 7
8 8
9 9