spark-instrumented-optimizer

History

Marco Gaido ff48b1b338 [SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF ## What changes were proposed in this pull request? In SPARK-20586 the flag `deterministic` was added to Scala UDF, but it is not available for python UDF. This flag is useful for cases when the UDF's code can return different result with the same input. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. This can lead to unexpected behavior. This PR adds the deterministic flag, via the `asNondeterministic` method, to let the user mark the function as non-deterministic and therefore avoid the optimizations which might lead to strange behaviors. ## How was this patch tested? Manual tests: ``` >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> df_br = spark.createDataFrame([{'name': 'hello'}]) >>> import random >>> udf_random_col = udf(lambda: int(100*random.random()), IntegerType()).asNondeterministic() >>> df_br = df_br.withColumn('RAND', udf_random_col()) >>> random.seed(1234) >>> udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) >>> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show() +-----+----+-------------+ \| name\|RAND\|RAND_PLUS_TEN\| +-----+----+-------------+ \|hello\| 3\| 13\| +-----+----+-------------+ ``` Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19929 from mgaido91/SPARK-22629.		2017-12-26 06:39:40 -08:00
..
ml	[SPARK-22810][ML][PYSPARK] Expose Python API for LinearRegression with huber loss.	2017-12-20 17:51:42 -08:00
mllib	[SPARK-22399][ML] update the location of reference paper	2017-10-31 08:20:23 +00:00
sql	[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF	2017-12-26 06:39:40 -08:00
streaming	[SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs	2017-10-24 12:44:47 +09:00
__init__.py	[MINOR] Fix some typo of the document	2017-06-19 20:35:58 +01:00
accumulators.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
broadcast.py	[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry	2017-08-02 07:12:23 +09:00
cloudpickle.py	[SPARK-21070][PYSPARK] Attempt to update cloudpickle again	2017-08-22 11:17:53 +09:00
conf.py	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
context.py	[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas	2017-11-13 13:16:01 +09:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
heapq3.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
java_gateway.py	[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas	2017-11-13 13:16:01 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
rdd.py	[SPARK-22409] Introduce function type argument in pandas_udf	2017-11-17 16:43:08 +01:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0	2017-12-21 20:43:56 +09:00
shell.py	[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell	2017-04-12 10:54:50 -07:00
shuffle.py	[SPARK-10710] Remove ability to disable spilling in core and SQL	2015-09-19 21:40:21 -07:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles	2017-09-18 13:20:11 +09:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-19505][PYTHON] AttributeError on Exception.message in Python3	2017-04-11 12:18:31 -07:00
version.py	[MINOR] Bump SparkR and PySpark version to 2.3.0.	2017-06-19 11:13:03 +01:00
worker.py	[SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone	2017-11-28 16:45:22 +08:00