3d158f9c91
### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
275 lines
12 KiB
ReStructuredText
275 lines
12 KiB
ReStructuredText
====================
|
|
Options and settings
|
|
====================
|
|
.. currentmodule:: pyspark.pandas
|
|
|
|
Koalas has an options system that lets you customize some aspects of its behaviour,
|
|
display-related options being those the user is most likely to adjust.
|
|
|
|
Options have a full "dotted-style", case-insensitive name (e.g. ``display.max_rows``).
|
|
You can get/set options directly as attributes of the top-level ``options`` attribute:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.options.display.max_rows
|
|
1000
|
|
>>> ks.options.display.max_rows = 10
|
|
>>> ks.options.display.max_rows
|
|
10
|
|
|
|
The API is composed of 3 relevant functions, available directly from the ``koalas``
|
|
namespace:
|
|
|
|
* :func:`get_option` / :func:`set_option` - get/set the value of a single option.
|
|
* :func:`reset_option` - reset one or more options to their default value.
|
|
|
|
**Note:** Developers can check out `pyspark.pandas/config.py <https://github.com/databricks/koalas/blob/master/databricks/koalas/config.py>`_ for more information.
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.get_option("display.max_rows")
|
|
1000
|
|
>>> ks.set_option("display.max_rows", 101)
|
|
>>> ks.get_option("display.max_rows")
|
|
101
|
|
|
|
|
|
Getting and setting options
|
|
---------------------------
|
|
|
|
As described above, :func:`get_option` and :func:`set_option`
|
|
are available from the koalas namespace. To change an option, call
|
|
``set_option('option name', new_value)``.
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.get_option('compute.max_rows')
|
|
1000
|
|
>>> ks.set_option('compute.max_rows', 2000)
|
|
>>> ks.get_option('compute.max_rows')
|
|
2000
|
|
|
|
All options also have a default value, and you can use ``reset_option`` to do just that:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.reset_option("display.max_rows")
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.get_option("display.max_rows")
|
|
1000
|
|
>>> ks.set_option("display.max_rows", 999)
|
|
>>> ks.get_option("display.max_rows")
|
|
999
|
|
>>> ks.reset_option("display.max_rows")
|
|
>>> ks.get_option("display.max_rows")
|
|
1000
|
|
|
|
``option_context`` context manager has been exposed through
|
|
the top-level API, allowing you to execute code with given option values. Option values
|
|
are restored automatically when you exit the `with` block:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> with ks.option_context("display.max_rows", 10, "compute.max_rows", 5):
|
|
... print(ks.get_option("display.max_rows"))
|
|
... print(ks.get_option("compute.max_rows"))
|
|
10
|
|
5
|
|
>>> print(ks.get_option("display.max_rows"))
|
|
>>> print(ks.get_option("compute.max_rows"))
|
|
1000
|
|
1000
|
|
|
|
|
|
Operations on different DataFrames
|
|
----------------------------------
|
|
|
|
Koalas disallows the operations on different DataFrames (or Series) by default to prevent expensive
|
|
operations. It internally performs a join operation which can be expensive in general.
|
|
|
|
This can be enabled by setting `compute.ops_on_diff_frames` to `True` to allow such cases.
|
|
See the examples below.
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.set_option('compute.ops_on_diff_frames', True)
|
|
>>> kdf1 = ks.range(5)
|
|
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
|
|
>>> (kdf1 - kdf2).sort_index()
|
|
id
|
|
0 -5.0
|
|
1 -3.0
|
|
2 -1.0
|
|
3 NaN
|
|
4 NaN
|
|
>>> ks.reset_option('compute.ops_on_diff_frames')
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.set_option('compute.ops_on_diff_frames', True)
|
|
>>> kdf = ks.range(5)
|
|
>>> kser_a = ks.Series([1, 2, 3, 4])
|
|
>>> # 'kser_a' is not from 'kdf' DataFrame. So it is considered as a Series not from 'kdf'.
|
|
>>> kdf['new_col'] = kser_a
|
|
>>> kdf
|
|
id new_col
|
|
0 0 1.0
|
|
1 1 2.0
|
|
3 3 4.0
|
|
2 2 3.0
|
|
4 4 NaN
|
|
>>> ks.reset_option('compute.ops_on_diff_frames')
|
|
|
|
|
|
Default Index type
|
|
------------------
|
|
|
|
In Koalas, the default index is used in several cases, for instance,
|
|
when Spark DataFrame is converted into Koalas DataFrame. In this case, internally Koalas attaches a
|
|
default index into Koalas DataFrame.
|
|
|
|
There are several types of the default index that can be configured by `compute.default_index_type` as below:
|
|
|
|
**sequence**: It implements a sequence that increases one by one, by PySpark's Window function without
|
|
specifying partition. Therefore, it can end up with whole partition in single node.
|
|
This index type should be avoided when the data is large. This is default. See the example below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.set_option('compute.default_index_type', 'sequence')
|
|
>>> kdf = ks.range(3)
|
|
>>> ks.reset_option('compute.default_index_type')
|
|
>>> kdf.index
|
|
Int64Index([0, 1, 2], dtype='int64')
|
|
|
|
This is conceptually equivalent to the PySpark example as below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> from pyspark.sql import functions as F, Window
|
|
>>> import pyspark.pandas as ks
|
|
>>> spark_df = ks.range(3).to_spark()
|
|
>>> sequential_index = F.row_number().over(
|
|
... Window.orderBy(F.monotonically_increasing_id().asc())) - 1
|
|
>>> spark_df.select(sequential_index).rdd.map(lambda r: r[0]).collect()
|
|
[0, 1, 2]
|
|
|
|
**distributed-sequence**: It implements a sequence that increases one by one, by group-by and
|
|
group-map approach in a distributed manner. It still generates the sequential index globally.
|
|
If the default index must be the sequence in a large dataset, this
|
|
index has to be used.
|
|
Note that if more data are added to the data source after creating this index,
|
|
then it does not guarantee the sequential index. See the example below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.set_option('compute.default_index_type', 'distributed-sequence')
|
|
>>> kdf = ks.range(3)
|
|
>>> ks.reset_option('compute.default_index_type')
|
|
>>> kdf.index
|
|
Int64Index([0, 1, 2], dtype='int64')
|
|
|
|
This is conceptually equivalent to the PySpark example as below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> spark_df = ks.range(3).to_spark()
|
|
>>> spark_df.rdd.zipWithIndex().map(lambda p: p[1]).collect()
|
|
[0, 1, 2]
|
|
|
|
**distributed**: It implements a monotonically increasing sequence simply by using
|
|
PySpark's `monotonically_increasing_id` function in a fully distributed manner. The
|
|
values are indeterministic. If the index does not have to be a sequence that increases
|
|
one by one, this index should be used. Performance-wise, this index almost does not
|
|
have any penalty comparing to other index types. See the example below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import pyspark.pandas as ks
|
|
>>> ks.set_option('compute.default_index_type', 'distributed')
|
|
>>> kdf = ks.range(3)
|
|
>>> ks.reset_option('compute.default_index_type')
|
|
>>> kdf.index
|
|
Int64Index([25769803776, 60129542144, 94489280512], dtype='int64')
|
|
|
|
This is conceptually equivalent to the PySpark example as below:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> from pyspark.sql import functions as F
|
|
>>> import pyspark.pandas as ks
|
|
>>> spark_df = ks.range(3).to_spark()
|
|
>>> spark_df.select(F.monotonically_increasing_id()) \
|
|
... .rdd.map(lambda r: r[0]).collect()
|
|
[25769803776, 60129542144, 94489280512]
|
|
|
|
.. warning::
|
|
It is very unlikely for this type of index to be used for computing two
|
|
different dataframes because it is not guaranteed to have the same indexes in two dataframes.
|
|
If you use this default index and turn on `compute.ops_on_diff_frames`, the result
|
|
from the operations between two different DataFrames will likely be an unexpected
|
|
output due to the indeterministic index values.
|
|
|
|
|
|
Available options
|
|
-----------------
|
|
|
|
=============================== ============== =====================================================
|
|
Option Default Description
|
|
=============================== ============== =====================================================
|
|
display.max_rows 1000 This sets the maximum number of rows Koalas should
|
|
output when printing out various output. For example,
|
|
this value determines the number of rows to be shown
|
|
at the repr() in a dataframe. Set `None` to unlimit
|
|
the input length. Default is 1000.
|
|
compute.max_rows 1000 'compute.max_rows' sets the limit of the current
|
|
Koalas DataFrame. Set `None` to unlimit the input
|
|
length. When the limit is set, it is executed by the
|
|
shortcut by collecting the data into the driver, and
|
|
then using the pandas API. If the limit is unset, the
|
|
operation is executed by PySpark. Default is 1000.
|
|
compute.shortcut_limit 1000 'compute.shortcut_limit' sets the limit for a
|
|
shortcut. It computes specified number of rows and
|
|
use its schema. When the dataframe length is larger
|
|
than this limit, Koalas uses PySpark to compute.
|
|
compute.ops_on_diff_frames False This determines whether or not to operate between two
|
|
different dataframes. For example, 'combine_frames'
|
|
function internally performs a join operation which
|
|
can be expensive in general. So, if
|
|
`compute.ops_on_diff_frames` variable is not True,
|
|
that method throws an exception.
|
|
compute.default_index_type 'sequence' This sets the default index type: sequence,
|
|
distributed and distributed-sequence.
|
|
compute.ordered_head False 'compute.ordered_head' sets whether or not to operate
|
|
head with natural ordering. Koalas does not guarantee
|
|
the row ordering so `head` could return some rows
|
|
from distributed partitions. If
|
|
'compute.ordered_head' is set to True, Koalas
|
|
performs natural ordering beforehand, but it will
|
|
cause a performance overhead.
|
|
plotting.max_rows 1000 'plotting.max_rows' sets the visual limit on top-n-
|
|
based plots such as `plot.bar` and `plot.pie`. If it
|
|
is set to 1000, the first 1000 data points will be
|
|
used for plotting. Default is 1000.
|
|
plotting.sample_ratio None 'plotting.sample_ratio' sets the proportion of data
|
|
that will be plotted for sample-based plots such as
|
|
`plot.line` and `plot.area`. This option defaults to
|
|
'plotting.max_rows' option.
|
|
plotting.backend 'plotly' Backend to use for plotting. Default is plotly.
|
|
Supports any package that has a top-level `.plot`
|
|
method. Known options are: [matplotlib, plotly].
|
|
=============================== ============== =====================================================
|