spark-instrumented-optimizer/python/docs/source/user_guide/ps_faq.rst

===
FAQ
===

What's the project's status?
----------------------------

Koalas 1.0.0 was released, and it is much more stable now.
You might still face the following differences:

 - Most of pandas-equivalent APIs are implemented but still some may be missing.
   Please create a GitHub issue if your favorite function is not yet supported.
   We also document all APIs that are not yet supported in the `missing directory <https://github.com/pyspark.pandas/tree/master/databricks/koalas/missing>`_.

 - Some behaviors may be different, in particular in the treatment of nulls: Pandas uses
   Not a Number (NaN) special constants to indicate missing values, while Spark has a
   special flag on each value to indicate missing values. We would love to hear from you
   if you come across any discrepancies

 - Because Spark is lazy in nature, some operations like creating new columns only get
   performed when Spark needs to print or write the dataframe.

Is it Koalas or koalas?
-----------------------

It's Koalas. Unlike pandas, we use upper case here.

Should I use PySpark's DataFrame API or Koalas?
-----------------------------------------------

If you are already familiar with pandas and want to leverage Spark for big data, we recommend
using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.

Does Koalas support Structured Streaming?
-----------------------------------------

No, Koalas does not support Structured Streaming officially.

As a workaround, you can use Koalas APIs with `foreachBatch` in Structured Streaming which allows batch APIs:

.. code-block:: python

   >>> def func(batch_df, batch_id):
   ...     koalas_df = ks.DataFrame(batch_df)
   ...     koalas_df['a'] = 1
   ...     print(koalas_df)

   >>> spark.readStream.format("rate").load().writeStream.foreachBatch(func).start()
                   timestamp  value  a
   0 2020-02-21 09:49:37.574      4  1
                   timestamp  value  a
   0 2020-02-21 09:49:38.574      5  1
   ...

How can I request support for a method?
---------------------------------------

File a GitHub issue: https://github.com/pyspark.pandas/issues

Databricks customers are also welcome to file a support ticket to request a new feature.

How is Koalas different from Dask?
----------------------------------

Different projects have different focuses. Spark is already deployed in virtually every
organization, and often is the primary interface to the massive amount of data stored in data lakes.
Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data
scientists.

How can I contribute to Koalas?
-------------------------------

See `Contributing Guide <https://koalas.readthedocs.io/en/latest/development/contributing.html>`_.

Why a new project (instead of putting this in Apache Spark itself)?
-------------------------------------------------------------------

Two reasons:

1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a
release as a separate project is minuscule (in the order of minutes). A release on Spark takes a
lot longer (in the order of days)

2. Koalas takes a different approach that might contradict Spark's API design principles, and those
principles cannot be changed lightly given the large user base of Spark. A new, separate project
provides an opportunity for us to experiment with new design principles.
[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2021-06-03 22:11:09 -04:00			`===`
			`FAQ`
			`===`

			`What's the project's status?`
			`----------------------------`

			`Koalas 1.0.0 was released, and it is much more stable now.`
			`You might still face the following differences:`

			`- Most of pandas-equivalent APIs are implemented but still some may be missing.`
			`Please create a GitHub issue if your favorite function is not yet supported.`
			We also document all APIs that are not yet supported in the `missing directory <https://github.com/pyspark.pandas/tree/master/databricks/koalas/missing>`_.

			`- Some behaviors may be different, in particular in the treatment of nulls: Pandas uses`
			`Not a Number (NaN) special constants to indicate missing values, while Spark has a`
			`special flag on each value to indicate missing values. We would love to hear from you`
			`if you come across any discrepancies`

			`- Because Spark is lazy in nature, some operations like creating new columns only get`
			`performed when Spark needs to print or write the dataframe.`

			`Is it Koalas or koalas?`
			`-----------------------`

			`It's Koalas. Unlike pandas, we use upper case here.`

			`Should I use PySpark's DataFrame API or Koalas?`
			`-----------------------------------------------`

			`If you are already familiar with pandas and want to leverage Spark for big data, we recommend`
			`using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.`

			`Does Koalas support Structured Streaming?`
			`-----------------------------------------`

			`No, Koalas does not support Structured Streaming officially.`

			As a workaround, you can use Koalas APIs with `foreachBatch` in Structured Streaming which allows batch APIs:

			`.. code-block:: python`

			`>>> def func(batch_df, batch_id):`
			`... koalas_df = ks.DataFrame(batch_df)`
			`... koalas_df['a'] = 1`
			`... print(koalas_df)`

			`>>> spark.readStream.format("rate").load().writeStream.foreachBatch(func).start()`
			`timestamp value a`
			`0 2020-02-21 09:49:37.574 4 1`
			`timestamp value a`
			`0 2020-02-21 09:49:38.574 5 1`
			`...`

			`How can I request support for a method?`
			`---------------------------------------`

			`File a GitHub issue: https://github.com/pyspark.pandas/issues`

			`Databricks customers are also welcome to file a support ticket to request a new feature.`

			`How is Koalas different from Dask?`
			`----------------------------------`

			`Different projects have different focuses. Spark is already deployed in virtually every`
			`organization, and often is the primary interface to the massive amount of data stored in data lakes.`
			`Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data`
			`scientists.`

			`How can I contribute to Koalas?`
			`-------------------------------`

			See `Contributing Guide <https://koalas.readthedocs.io/en/latest/development/contributing.html>`_.

			`Why a new project (instead of putting this in Apache Spark itself)?`
			`-------------------------------------------------------------------`

			`Two reasons:`

			`1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a`
			`release as a separate project is minuscule (in the order of minutes). A release on Spark takes a`
			`lot longer (in the order of days)`

			`2. Koalas takes a different approach that might contradict Spark's API design principles, and those`
			`principles cannot be changed lightly given the large user base of Spark. A new, separate project`
			`provides an opportunity for us to experiment with new design principles.`