spark-instrumented-optimizer

History

Michael Armbrust 158ad0bba9 [SPARK-2097][SQL] UDF Support This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL. Scala: ```scala registerFunction("strLenScala", (_: String).length) sql("SELECT strLenScala('test')") ``` Python: ```python sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType()) sqlCtx.sql("SELECT strLenPython('test')") ``` Java: ```java sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() { Override public Integer call(String str) throws Exception { return str.length(); } }, DataType.IntegerType); sqlContext.sql("SELECT stringLengthJava('test')"); ``` Author: Michael Armbrust <michael@databricks.com> Closes #1063 from marmbrus/udfs and squashes the following commits: 9eda0fe [Michael Armbrust] newline 747c05e [Michael Armbrust] Add some scala UDF tests. d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs 005d684 [Michael Armbrust] Fix naming and formatting. d14dac8 [Michael Armbrust] Fix last line of autogened java files. 8135c48 [Michael Armbrust] Move UDF unit tests to pyspark. 40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs 6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable. 7a83101 [Michael Armbrust] Drop toString 795fd15 [Michael Armbrust] Try to avoid capturing SQLContext. e54fb45 [Michael Armbrust] Docs and tests. 437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments. 01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs 8e6c932 [Michael Armbrust] WIP 3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs 6237c8d [Michael Armbrust] WIP 2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs. 0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.		2014-08-02 16:33:48 -07:00
..
mllib	[SPARK-2478] [mllib] DecisionTree Python API	2014-08-02 13:07:17 -07:00
__init__.py	[SPARK-2724] Python version of RandomRDDGenerators	2014-07-31 20:32:57 -07:00
accumulators.py	SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark	2014-07-31 15:31:53 -07:00
broadcast.py	Fix some Python docs and make sure to unset SPARK_TESTING in Python	2013-12-29 20:15:07 -05:00
cloudpickle.py	[SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle	2014-07-29 01:02:18 -07:00
conf.py	[SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default	2014-07-24 18:15:37 -07:00
context.py	[SPARK-2454] Do not ship spark home to Workers	2014-08-02 00:45:38 -07:00
daemon.py	[SPARK-2764] Simplify daemon.py process structure	2014-08-01 19:38:21 -07:00
files.py	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
java_gateway.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
join.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
rdd.py	[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD	2014-08-01 18:47:41 -07:00
rddsampler.py	[SPARK-2656] Python version of stratified sampling	2014-07-24 23:42:08 -07:00
resultiterable.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
serializers.py	[SPARK-2538] [PySpark] Hash based disk spilling aggregation	2014-07-24 22:53:47 -07:00
shell.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
shuffle.py	[SPARK-2538] [PySpark] Hash based disk spilling aggregation	2014-07-24 22:53:47 -07:00
sql.py	[SPARK-2097][SQL] UDF Support	2014-08-02 16:33:48 -07:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
tests.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
worker.py	[SPARK-2580] [PySpark] keep silent in worker if JVM close the socket	2014-07-29 00:15:45 -07:00