spark-instrumented-optimizer

History

Xinrong Meng 56c211bd6a [SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map` Implement `CategoricalIndex.map` and `DatetimeIndex.map` `MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary. Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well. Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now. - CategoricalIndex.map ```py >>> idx = ps.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)) >>> idx.map(pser) CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category') ``` - DatetimeIndex.map ```py >>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10") >>> psidx = ps.from_pandas(pidx) >>> mapper_dict = { ... datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8), ... datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9), ... } >>> psidx.map(mapper_dict) DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None) >>> mapper_pser = pd.Series([1, 2, 3], index=pidx) >>> psidx.map(mapper_pser) Int64Index([1, 2, 3], dtype='int64') >>> psidx DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None) >>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r")) Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM', 'August 10, 2020, 12:00:00 AM'], dtype='object') ``` Unit tests. Closes #33756 from xinrong-databricks/other_indexes_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0b6af464dc`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-08-23 10:11:21 +09:00
..
cloudpickle	[SPARK-33983][PYTHON] Update cloudpickle to v1.6.0	2021-01-04 10:36:31 -08:00
ml	[SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings	2021-07-24 16:58:19 +09:00
mllib	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins	2021-08-01 21:38:39 +09:00
pandas	[SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map`	2021-08-23 10:11:21 +09:00
resource	[SPARK-32320][PYSPARK] Remove mutable default arguments	2020-12-08 09:35:36 +08:00
sql	[SPARK-36465][SS] Dynamic gap duration in session window	2021-08-16 11:06:16 +09:00
streaming	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins	2021-08-01 21:38:39 +09:00
testing	[SPARK-35599][PYTHON] Adjust `check_exact` parameter for older pd.testing	2021-06-07 11:12:49 +09:00
tests	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins	2021-08-01 21:38:39 +09:00
__init__.py	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
__init__.pyi	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
accumulators.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
accumulators.pyi	[SPARK-33002][PYTHON] Remove non-API annotations	2020-10-07 19:53:59 +09:00
broadcast.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
broadcast.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
conf.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-35938][PYTHON] Add deprecation warning for Python 3.6	2021-07-01 09:32:25 +09:00
context.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
files.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
find_spark_home.py	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option	2021-01-05 17:21:32 +09:00
install.py	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.)	2020-11-16 10:21:50 +09:00
java_gateway.py	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.)	2020-11-16 10:21:50 +09:00
profiler.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
py.typed	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
rdd.py	[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function	2021-06-09 10:57:27 +09:00
rdd.pyi	[SPARK-35986][PYSPARK] Fix type hint for RDD.histogram's buckets	2021-07-04 10:24:55 +09:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
resultiterable.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
serializers.py	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
shell.py	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts	2020-11-10 11:12:19 +09:00
shuffle.py	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
statcounter.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
statcounter.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
status.py
status.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
storagelevel.py	[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py	2020-09-15 08:41:22 -05:00
storagelevel.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
taskcontext.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
taskcontext.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
traceback_utils.py
util.py	[SPARK-35946][PYTHON] Respect Py4J server in InheritableThread API	2021-06-29 22:18:54 -07:00
util.pyi	[SPARK-35303][PYTHON] Enable pinned thread mode by default	2021-06-18 12:02:29 +09:00
version.py	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
version.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
worker.py	[SPARK-36062][PYTHON] Try to capture faulthanlder when a Python worker crashes	2021-07-09 11:31:00 +09:00