The largest amount of work consists simply of implementing the pandas API using Spark's built-in functions, which is usually straightforward. But there are many different forms of contributions in addition to writing code:
1. Use the project and provide feedback, by creating new tickets or commenting on existing relevant tickets.
1. Read and understand the `Design Principles <design.rst>`_ for the project. Contributions should follow these principles.
2. Signaling your work: If you are working on something, comment on the relevant ticket that you are doing so to avoid multiple people taking on the same work at the same time. It is also a good practice to signal that your work has stalled or you have moved on and want somebody else to take over.
3. Understand what the functionality is in pandas or in Spark.
4. Implement the functionality, with test cases providing close to 100% statement coverage. Document the functionality.
5. Run existing and new test cases to make sure they still pass. Also run `dev/reformat` script to reformat Python files by using `Black <https://github.com/psf/black>`_, and run the linter `dev/lint-python`.
6. Build the docs (`make html` in `docs` directory) and verify the docs related to your change look OK.
7. Submit a pull request, and be responsive to code review feedback from other community members.
That's it. Your contribution, once merged, will be available in the next release.
Note that `-k` is used for simplicity although it takes an expression. You can use `--verbose` to check what to filter. See `pytest --help` for more details.
Building Documentation
======================
To build documentation via Sphinx:
..code-block:: bash
cd docs && make clean html
It generates HTMLs under `docs/build/html` directory. Open `docs/build/html/index.html` to check if documentation is built properly.
Coding Conventions
==================
We follow `PEP 8 <https://www.python.org/dev/peps/pep-0008/>`_ with one exception: lines can be up to 100 characters in length, not 79.
In general, doctests should be grouped logically by separating a newline.
For instance, the first block is for the statements for preparation, the second block is for using the function with a specific argument,
and third block is for another argument. As a example, please refer `DataFrame.rsub <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rsub.html#pandas.DataFrame.rsub>`_ in pandas.
These blocks should be consistently separated in pandas-on-Spark doctests, and more doctests should be added if the coverage of the doctests or the number of examples to show is not enough even though they are different from pandas'.
5. Verify the uploaded package can be installed and executed. One unofficial tip is to run the doctests of pandas API on Spark within a Python interpreter after installing it.