spark-instrumented-optimizer

History

Xinrong Meng 266608d50e [SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps ### What changes were proposed in this pull request? The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately. ### Why are the changes needed? StructType, ArrayType, and MapType are not accepted by DataTypeOps now. We should handle these complex types. Among them: - ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation. - StructOps will be helpful to make to/from pandas conversion data-type-based. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) 0 [1.0, 2.0, 3.0, 0.4, 0.5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) 0 [1, 2, 3, 4, 5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Concatenation can only be applied to arrays of the same type ``` ### How was this patch tested? Unit tests. Closes #32626 from xinrong-databricks/datatypeop_complex. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>		2021-05-26 10:40:01 -07:00
..
cloudpickle	[SPARK-33983][PYTHON] Update cloudpickle to v1.6.0	2021-01-04 10:36:31 -08:00
ml	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
mllib	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
pandas	[SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps	2021-05-26 10:40:01 -07:00
resource	[SPARK-32320][PYSPARK] Remove mutable default arguments	2020-12-08 09:35:36 +08:00
sql	[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page	2021-05-26 17:12:49 +09:00
streaming	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
testing	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
tests	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
__init__.py	[SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__	2021-04-14 23:36:25 +02:00
__init__.pyi	[SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__	2021-04-14 23:36:25 +02:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
accumulators.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
accumulators.pyi	[SPARK-33002][PYTHON] Remove non-API annotations	2020-10-07 19:53:59 +09:00
broadcast.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
broadcast.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
conf.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
context.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
files.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
find_spark_home.py	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option	2021-01-05 17:21:32 +09:00
install.py	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.)	2020-11-16 10:21:50 +09:00
java_gateway.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.)	2020-11-16 10:21:50 +09:00
profiler.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
py.typed	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
rdd.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
rdd.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
resultiterable.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
serializers.py	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.)	2020-11-16 10:21:50 +09:00
shell.py	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts	2020-11-10 11:12:19 +09:00
shuffle.py	Spelling r common dev mlib external project streaming resource managers python	2020-11-27 10:22:45 -06:00
statcounter.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
statcounter.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
status.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
storagelevel.py	[SPARK-31448][PYTHON] Fix storage level used in persist() in dataframe.py	2020-09-15 08:41:22 -05:00
storagelevel.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
taskcontext.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00
taskcontext.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-35498][PYTHON] Add thread target wrapper API for pyspark pin thread mode	2021-05-25 09:50:22 +09:00
version.py	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT	2020-12-04 14:10:42 -08:00
version.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
worker.py	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception	2021-05-26 11:54:40 +09:00