spark-instrumented-optimizer

History

Kan Zhang 4fdb491775 [SPARK-2010] Support for nested data in PySpark SQL JIRA issue https://issues.apache.org/jira/browse/SPARK-2010 This PR adds support for nested collection types in PySpark SQL, including array, dict, list, set, and tuple. Example, ``` >>> from array import array >>> from pyspark.sql import SQLContext >>> sqlCtx = SQLContext(sc) >>> rdd = sc.parallelize([ ... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}] True >>> rdd = sc.parallelize([ ... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == \ ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}] True ``` Author: Kan Zhang <kzhang@apache.org> Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits: 1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO 504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL		2014-06-16 11:11:29 -07:00
..
lib	SPARK-1004. PySpark on YARN	2014-04-29 23:24:34 -07:00
pyspark	[SPARK-2010] Support for nested data in PySpark SQL	2014-06-16 11:11:29 -07:00
test_support	License headers	2013-12-09 16:41:01 -08:00
.gitignore	SPARK-1004. PySpark on YARN	2014-04-29 23:24:34 -07:00
epydoc.conf	[SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs	2014-04-21 21:57:40 -07:00
run-tests	HOTFIX: A few PySpark tests were not actually run	2014-06-11 12:11:46 -07:00