3d158f9c91
### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
87 lines
3.4 KiB
ReStructuredText
87 lines
3.4 KiB
ReStructuredText
===
|
|
FAQ
|
|
===
|
|
|
|
What's the project's status?
|
|
----------------------------
|
|
|
|
Koalas 1.0.0 was released, and it is much more stable now.
|
|
You might still face the following differences:
|
|
|
|
- Most of pandas-equivalent APIs are implemented but still some may be missing.
|
|
Please create a GitHub issue if your favorite function is not yet supported.
|
|
We also document all APIs that are not yet supported in the `missing directory <https://github.com/pyspark.pandas/tree/master/databricks/koalas/missing>`_.
|
|
|
|
- Some behaviors may be different, in particular in the treatment of nulls: Pandas uses
|
|
Not a Number (NaN) special constants to indicate missing values, while Spark has a
|
|
special flag on each value to indicate missing values. We would love to hear from you
|
|
if you come across any discrepancies
|
|
|
|
- Because Spark is lazy in nature, some operations like creating new columns only get
|
|
performed when Spark needs to print or write the dataframe.
|
|
|
|
Is it Koalas or koalas?
|
|
-----------------------
|
|
|
|
It's Koalas. Unlike pandas, we use upper case here.
|
|
|
|
Should I use PySpark's DataFrame API or Koalas?
|
|
-----------------------------------------------
|
|
|
|
If you are already familiar with pandas and want to leverage Spark for big data, we recommend
|
|
using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.
|
|
|
|
Does Koalas support Structured Streaming?
|
|
-----------------------------------------
|
|
|
|
No, Koalas does not support Structured Streaming officially.
|
|
|
|
As a workaround, you can use Koalas APIs with `foreachBatch` in Structured Streaming which allows batch APIs:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> def func(batch_df, batch_id):
|
|
... koalas_df = ks.DataFrame(batch_df)
|
|
... koalas_df['a'] = 1
|
|
... print(koalas_df)
|
|
|
|
>>> spark.readStream.format("rate").load().writeStream.foreachBatch(func).start()
|
|
timestamp value a
|
|
0 2020-02-21 09:49:37.574 4 1
|
|
timestamp value a
|
|
0 2020-02-21 09:49:38.574 5 1
|
|
...
|
|
|
|
How can I request support for a method?
|
|
---------------------------------------
|
|
|
|
File a GitHub issue: https://github.com/pyspark.pandas/issues
|
|
|
|
Databricks customers are also welcome to file a support ticket to request a new feature.
|
|
|
|
How is Koalas different from Dask?
|
|
----------------------------------
|
|
|
|
Different projects have different focuses. Spark is already deployed in virtually every
|
|
organization, and often is the primary interface to the massive amount of data stored in data lakes.
|
|
Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data
|
|
scientists.
|
|
|
|
How can I contribute to Koalas?
|
|
-------------------------------
|
|
|
|
See `Contributing Guide <https://koalas.readthedocs.io/en/latest/development/contributing.html>`_.
|
|
|
|
Why a new project (instead of putting this in Apache Spark itself)?
|
|
-------------------------------------------------------------------
|
|
|
|
Two reasons:
|
|
|
|
1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a
|
|
release as a separate project is minuscule (in the order of minutes). A release on Spark takes a
|
|
lot longer (in the order of days)
|
|
|
|
2. Koalas takes a different approach that might contradict Spark's API design principles, and those
|
|
principles cannot be changed lightly given the large user base of Spark. A new, separate project
|
|
provides an opportunity for us to experiment with new design principles.
|