spark-instrumented-optimizer

History

Anton Yanchenko bda5b51576 [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)` ## What changes were proposed in this pull request? Add missing validation for `LongType` in `pyspark.sql.types._make_type_verifier`. ## How was this patch tested? Doctests / unittests / manual tests. Unpatched version: ``` In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() Out[23]: [Row(x=None)] ``` Patched: ``` In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-c1740fcadbf9> in <module> ----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 689 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 690 else: --> 691 rdd, schema = self._createFromLocal(map(prepare, data), schema) 692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 405 # make sure data could consumed multiple times 406 if not isinstance(data, list): --> 407 data = list(data) 408 409 if schema is None or isinstance(schema, (list, tuple)): /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj) 671 672 def prepare(obj): --> 673 verify_func(obj) 674 return obj 675 elif isinstance(schema, DataType): /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj) 1397 if isinstance(obj, dict): 1398 for f, verifier in verifiers: -> 1399 verifier(obj.get(f)) 1400 elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): 1401 # the order in obj could be different than dataType.fields /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj) 1356 if obj < -9223372036854775808 or obj > 9223372036854775807: 1357 raise ValueError( -> 1358 new_msg("object of LongType out of range, got: %s" % obj)) 1359 1360 verify_value = verify_long ValueError: field x: object of LongType out of range, got: 18446744073709551616 ``` Closes #25117 from simplylizz/master. Authored-by: Anton Yanchenko <simplylizz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2019-08-08 11:47:25 +09:00
..
ml	[SPARK-23469][ML] HashingTF should use corrected MurmurHash3 implementation	2019-08-02 10:53:36 -05:00
mllib	[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors	2019-07-09 16:39:21 -05:00
sql	[SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`	2019-08-08 11:47:25 +09:00
streaming	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
testing	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
tests	[SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC	2019-08-05 20:18:53 +09:00
__init__.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
broadcast.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
cloudpickle.py	[SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0	2019-02-28 02:33:10 +09:00
conf.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
context.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
find_spark_home.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-27870][SQL][PYTHON] Add a runtime buffer size configuration for Pandas UDFs	2019-06-15 20:56:22 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
rdd.py	[MINOR][PYTHON] Use `_memory_limit` to get worker memory conf in `rdd.py`	2019-07-27 11:58:50 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resourceinformation.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-25696] The storage memory displayed on spark Application UI is…	2018-12-10 18:27:01 -06:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs	2019-03-11 10:15:07 +09:00
version.py	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT	2018-10-02 08:48:24 -07:00
worker.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00