spark-instrumented-optimizer

History

Liang-Chi Hsieh 1f02871489 [SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate ### What changes were proposed in this pull request? This patch proposed to skip predicates on PythonUDFs to be pushdown through Aggregate. ### Why are the changes needed? The predicates on PythonUDFs cannot be pushdown through Aggregate. Pushed down predicates cannot be evaluate because PythonUDFs cannot be evaluated on Filter and cause error like: ``` Caused by: java.lang.UnsupportedOperationException: Cannot generate code for expression: mean(input[1, struct<bar:bigint>, true].bar) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:304) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:303) at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:52) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:821) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at scala.Option.getOrElse(Option.scala:189) ``` ### Does this PR introduce any user-facing change? Yes. Previously the predicates on PythonUDFs will be pushdown through Aggregate can cause error. After this change, the query can work. ### How was this patch tested? Unit test. Closes #28089 from viirya/SPARK-30921. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-04-06 09:36:20 +09:00
..
avro	[SPARK-27506][SQL][FOLLOWUP] Use option `avroSchema` to specify an evolved schema in `from_avro`	2019-12-30 18:14:21 +09:00
pandas	[SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas	2020-03-29 13:59:18 +09:00
tests	[SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate	2020-04-06 09:36:20 +09:00
__init__.py	[SPARK-31088][SQL] Add back HiveContext and createExternalTable	2020-03-26 23:51:15 -07:00
catalog.py	[SPARK-31088][SQL] Add back HiveContext and createExternalTable	2020-03-26 23:51:15 -07:00
column.py	[SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation	2020-02-18 16:46:45 +09:00
conf.py	[SPARK-23698][PYTHON] Resolve undefined names in Python 3	2018-08-22 10:06:59 -07:00
context.py	[SPARK-31088][SQL] Add back HiveContext and createExternalTable	2020-03-26 23:51:15 -07:00
dataframe.py	[SPARK-31087] [SQL] Add Back Multiple Removed APIs	2020-03-28 22:05:16 -07:00
functions.py	[SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound	2020-03-31 15:16:17 +09:00
group.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
readwriter.py	[SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp	2020-03-30 12:20:11 +08:00
session.py	[SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted	2020-02-20 12:21:24 +09:00
streaming.py	[SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp	2020-03-30 12:20:11 +08:00
types.py	[SPARK-30941][PYSPARK] Add a note to asDict to document its behavior when there are duplicate fields	2020-03-09 11:06:45 -07:00
udf.py	[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints	2020-02-12 10:49:46 +09:00
utils.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
window.py	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE	2020-01-13 22:55:19 +08:00