spark-instrumented-optimizer

History

Xiang Gao b7a40f64e6 [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ \| doublearray\| floatarray\| +---------------+------------------+ \|[1.0, 2.0, 3.0]\|[null, null, null]\| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <qasdfgtyuiop@gmail.com> Author: Gao, Xiang <qasdfgtyuiop@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18444 from zasdfgbnm/fix_array_infer.		2017-07-20 12:46:06 +09:00
..
__init__.py	[SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes"	2016-08-06 05:02:59 +01:00
catalog.py	[SPARK-18777][PYTHON][SQL] Return UDF from udf.register	2017-05-06 22:28:42 -07:00
column.py	[SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe	2017-05-01 09:43:32 -07:00
conf.py	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code	2016-05-23 18:14:48 -07:00
context.py	[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs	2017-07-05 10:59:10 -07:00
dataframe.py	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas	2017-07-10 15:21:03 -07:00
functions.py	[SPARK-21394][SPARK-21432][PYTHON] Reviving callable object/partial function support in UDF in PySpark	2017-07-17 00:37:36 -07:00
group.py	[MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation	2016-07-06 10:45:51 -07:00
readwriter.py	[SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader	2017-06-24 11:39:41 +08:00
session.py	[SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema verification and improve exception message	2017-07-04 20:45:58 +08:00
streaming.py	[SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader	2017-06-24 11:39:41 +08:00
tests.py	[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python	2017-07-20 12:46:06 +09:00
types.py	[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python	2017-07-20 12:46:06 +09:00
utils.py	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo	2017-01-04 15:07:29 +00:00
window.py	[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames	2016-12-02 17:39:28 -08:00