spark-instrumented-optimizer/python/docs/source/user_guide/ps_faq.rst
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00

87 lines
3.4 KiB
ReStructuredText

===
FAQ
===
What's the project's status?
----------------------------
Koalas 1.0.0 was released, and it is much more stable now.
You might still face the following differences:
- Most of pandas-equivalent APIs are implemented but still some may be missing.
Please create a GitHub issue if your favorite function is not yet supported.
We also document all APIs that are not yet supported in the `missing directory <https://github.com/pyspark.pandas/tree/master/databricks/koalas/missing>`_.
- Some behaviors may be different, in particular in the treatment of nulls: Pandas uses
Not a Number (NaN) special constants to indicate missing values, while Spark has a
special flag on each value to indicate missing values. We would love to hear from you
if you come across any discrepancies
- Because Spark is lazy in nature, some operations like creating new columns only get
performed when Spark needs to print or write the dataframe.
Is it Koalas or koalas?
-----------------------
It's Koalas. Unlike pandas, we use upper case here.
Should I use PySpark's DataFrame API or Koalas?
-----------------------------------------------
If you are already familiar with pandas and want to leverage Spark for big data, we recommend
using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.
Does Koalas support Structured Streaming?
-----------------------------------------
No, Koalas does not support Structured Streaming officially.
As a workaround, you can use Koalas APIs with `foreachBatch` in Structured Streaming which allows batch APIs:
.. code-block:: python
>>> def func(batch_df, batch_id):
... koalas_df = ks.DataFrame(batch_df)
... koalas_df['a'] = 1
... print(koalas_df)
>>> spark.readStream.format("rate").load().writeStream.foreachBatch(func).start()
timestamp value a
0 2020-02-21 09:49:37.574 4 1
timestamp value a
0 2020-02-21 09:49:38.574 5 1
...
How can I request support for a method?
---------------------------------------
File a GitHub issue: https://github.com/pyspark.pandas/issues
Databricks customers are also welcome to file a support ticket to request a new feature.
How is Koalas different from Dask?
----------------------------------
Different projects have different focuses. Spark is already deployed in virtually every
organization, and often is the primary interface to the massive amount of data stored in data lakes.
Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data
scientists.
How can I contribute to Koalas?
-------------------------------
See `Contributing Guide <https://koalas.readthedocs.io/en/latest/development/contributing.html>`_.
Why a new project (instead of putting this in Apache Spark itself)?
-------------------------------------------------------------------
Two reasons:
1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a
release as a separate project is minuscule (in the order of minutes). A release on Spark takes a
lot longer (in the order of days)
2. Koalas takes a different approach that might contradict Spark's API design principles, and those
principles cannot be changed lightly given the large user base of Spark. A new, separate project
provides an opportunity for us to experiment with new design principles.