spark-instrumented-optimizer

History

itholic 3a18864c5f [SPARK-35809][PYTHON] Add `index_col` argument for ps.sql ### What changes were proposed in this pull request? This PR proposes adding an argument `index_col` for `ps.sql` function, to preserve the index when users want. NOTE that the `reset_index()` have to be performed before using `ps.sql` with `index_col`. ```python >>> psdf A B a 1 4 b 2 5 c 3 6 >>> psdf_reset_index = psdf.reset_index() >>> ps.sql("SELECT * from {psdf_reset_index} WHERE A > 1", index_col="index") A B index b 2 5 c 3 6 ``` Otherwise, the index is always lost. ```python >>> ps.sql("SELECT * from {psdf} WHERE A > 1") A B 0 2 5 1 3 6 ``` ### Why are the changes needed? Index is one of the key object for the existing pandas users, so we should provide the way to keep the index after computing the `ps.sql`. ### Does this PR introduce _any_ user-facing change? Yes, the new argument is added. ### How was this patch tested? Add a unit test and manually check the build pass. Closes #33450 from itholic/SPARK-35809. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6578f0b135`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-07-22 17:08:42 +09:00
..
data_type_ops	[SPARK-36167][PYTHON][3.2] Revisit more InternalField managements	2021-07-20 09:30:35 +09:00
indexes	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex	2021-07-22 17:06:25 +09:00
missing	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex	2021-07-22 17:06:25 +09:00
plot	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark	2021-06-28 19:03:42 -07:00
spark	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark	2021-06-29 10:52:24 -07:00
tests	[SPARK-35809][PYTHON] Add `index_col` argument for ps.sql	2021-07-22 17:08:42 +09:00
typedef	[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs	2021-07-16 11:41:53 +09:00
usage_logging	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
__init__.py	[SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package	2021-07-22 14:21:53 +09:00
_typing.py	[SPARK-35944][PYTHON] Introduce Name and Label type aliases	2021-07-01 09:40:07 +09:00
accessors.py	[SPARK-35944][PYTHON] Introduce Name and Label type aliases	2021-07-01 09:40:07 +09:00
base.py	[SPARK-35615][PYTHON] Make unary and comparison operators data-type-based	2021-07-07 13:47:04 -07:00
categorical.py	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex	2021-07-22 17:06:25 +09:00
config.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
datetimes.py	[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor	2021-06-01 10:33:10 +09:00
exceptions.py	[SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module	2021-05-21 11:03:35 -07:00
extensions.py	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark	2021-06-29 10:52:24 -07:00
frame.py	[SPARK-36167][PYTHON][3.2] Revisit more InternalField managements	2021-07-20 09:30:35 +09:00
generic.py	[SPARK-35806][PYTHON] Mapping the `mode` argument to pandas in DataFrame.to_csv	2021-07-19 19:58:19 +09:00
groupby.py	[SPARK-35944][PYTHON] Introduce Name and Label type aliases	2021-07-01 09:40:07 +09:00
indexing.py	[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs	2021-07-16 11:41:53 +09:00
internal.py	[SPARK-36167][PYTHON][3.2] Revisit more InternalField managements	2021-07-20 09:30:35 +09:00
ml.py	[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs	2021-07-16 11:41:53 +09:00
mlflow.py	[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs	2021-07-16 11:41:53 +09:00
namespace.py	[SPARK-35810][PYTHON] Deprecate ps.broadcast API	2021-07-19 10:45:16 +09:00
numpy_compat.py	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark	2021-06-28 19:03:42 -07:00
series.py	[SPARK-36167][PYTHON][3.2] Revisit more InternalField managements	2021-07-20 09:30:35 +09:00
sql_processor.py	[SPARK-35809][PYTHON] Add `index_col` argument for ps.sql	2021-07-22 17:08:42 +09:00
strings.py	[SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings	2021-06-15 11:17:56 +09:00
utils.py	[SPARK-35806][PYTHON] Mapping the `mode` argument to pandas in DataFrame.to_csv	2021-07-19 19:58:19 +09:00
window.py	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark	2021-06-29 10:52:24 -07:00