spark-instrumented-optimizer

History

hyukjinkwon ebc124d4c4 [SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition ## What changes were proposed in this pull request? This PR deals with four points as below: - Reuse existing DDL parser APIs rather than reimplementing within PySpark - Support DDL formatted string, `field type, field type`. - Support case-insensitivity for parsing. - Support nested data types as below: Before ``` >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show() ... ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int> ``` ``` >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show() ... ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int> ``` ``` >>> spark.createDataFrame([[1]], "a int").show() ... ValueError: Could not parse datatype: a int ``` After ``` >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show() +---+ \| a\| +---+ \|[1]\| +---+ ``` ``` >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show() +---+ \| a\| +---+ \|[1]\| +---+ ``` ``` >>> spark.createDataFrame([[1]], "a int").show() +---+ \| a\| +---+ \| 1\| +---+ ``` ## How was this patch tested? Author: hyukjinkwon <gurwls223@gmail.com> Closes #18590 from HyukjinKwon/deduplicate-python-ddl.		2017-07-11 22:03:10 +08:00
..
__init__.py	[SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes"	2016-08-06 05:02:59 +01:00
catalog.py	[SPARK-18777][PYTHON][SQL] Return UDF from udf.register	2017-05-06 22:28:42 -07:00
column.py	[SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe	2017-05-01 09:43:32 -07:00
conf.py	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code	2016-05-23 18:14:48 -07:00
context.py	[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs	2017-07-05 10:59:10 -07:00
dataframe.py	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas	2017-07-10 15:21:03 -07:00
functions.py	[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition	2017-07-11 22:03:10 +08:00
group.py	[MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation	2016-07-06 10:45:51 -07:00
readwriter.py	[SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader	2017-06-24 11:39:41 +08:00
session.py	[SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema verification and improve exception message	2017-07-04 20:45:58 +08:00
streaming.py	[SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader	2017-06-24 11:39:41 +08:00
tests.py	[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition	2017-07-11 22:03:10 +08:00
types.py	[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition	2017-07-11 22:03:10 +08:00
utils.py	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo	2017-01-04 15:07:29 +00:00
window.py	[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames	2016-12-02 17:39:28 -08:00