spark-instrumented-optimizer

History

Ahir Reddy c99bcb7fea SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling		2014-04-15 00:07:55 -07:00
..
mllib	SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining	2014-04-10 11:17:41 -07:00
__init__.py	SPARK-1374: PySpark API for SparkSQL	2014-04-15 00:07:55 -07:00
accumulators.py	Add custom serializer support to PySpark.	2013-11-10 16:45:38 -08:00
broadcast.py	Fix some Python docs and make sure to unset SPARK_TESTING in Python	2013-12-29 20:15:07 -05:00
cloudpickle.py	Rename top-level 'pyspark' directory to 'python'	2013-01-01 15:05:00 -08:00
conf.py	SPARK-1114: Allow PySpark to use existing JVM and Gateway	2014-02-20 21:20:39 -08:00
context.py	SPARK-1305: Support persisting RDD's directly to Tachyon	2014-04-04 20:38:20 -07:00
daemon.py	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
files.py	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
java_gateway.py	SPARK-1374: PySpark API for SparkSQL	2014-04-15 00:07:55 -07:00
join.py	Spark 1271: Co-Group and Group-By should pass Iterable[X]	2014-04-08 18:15:59 -07:00
rdd.py	Spark 1271: Co-Group and Group-By should pass Iterable[X]	2014-04-08 18:15:59 -07:00
rddsampler.py	RDD sample() and takeSample() prototypes for PySpark	2013-08-28 16:46:13 -07:00
resultiterable.py	Spark 1271: Co-Group and Group-By should pass Iterable[X]	2014-04-08 18:15:59 -07:00
serializers.py	SPARK-1421. Make MLlib work on Python 2.6	2014-04-05 20:52:05 -07:00
shell.py	Set spark.executor.uri from environment variable (needed by Mesos)	2014-04-10 17:49:30 -07:00
sql.py	SPARK-1374: PySpark API for SparkSQL	2014-04-15 00:07:55 -07:00
statcounter.py	Spark 1246 add min max to stat counter	2014-03-18 00:45:47 -07:00
storagelevel.py	SPARK-1305: Support persisting RDD's directly to Tachyon	2014-04-04 20:38:20 -07:00
tests.py	Fix for SPARK-1025: PySpark hang on missing files.	2014-01-23 18:24:51 -08:00
worker.py	SPARK-1115: Catch depickling errors	2014-02-26 14:51:21 -08:00