spark-instrumented-optimizer

History

Bryan Cutler 0812d6c17c [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-10-06 18:11:24 +09:00
..
avro	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
pandas	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures	2020-10-06 18:11:24 +09:00
tests	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures	2020-10-06 18:11:24 +09:00
__init__.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
__init__.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
catalog.py	[SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable()	2020-08-25 13:42:31 +09:00
catalog.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
column.py	[SPARK-32835][PYTHON] Add withField method to the pyspark Column class	2020-09-16 20:18:36 +09:00
column.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
conf.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-32897][PYTHON] Don't show a deprecation warning at SparkSession.builder.getOrCreate	2020-09-16 10:13:47 -07:00
context.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
dataframe.py	[SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName	2020-09-21 09:39:34 +09:00
dataframe.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
functions.py	[MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated	2020-09-29 05:20:12 +00:00
functions.pyi	[SPARK-33020][PYTHON] Add nth_value as a PySpark function	2020-09-28 22:14:28 -07:00
group.py	[SPARK-32719][PYTHON] Add Flake8 check missing imports	2020-08-31 11:23:31 +09:00
group.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
readwriter.py	[SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV	2020-09-16 20:16:15 +09:00
readwriter.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
session.py	[MINOR][PYTHON] Fix typo in a docsting of RDD.toDF	2020-08-26 10:34:49 -07:00
session.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
streaming.py	[SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods	2020-09-23 09:28:33 +09:00
streaming.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
types.py	[SPARK-32814][PYTHON] Replace __metaclass__ field with metaclass keyword	2020-09-16 20:22:11 +09:00
types.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
udf.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
udf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
utils.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
utils.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
window.py	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE	2020-01-13 22:55:19 +08:00
window.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00