spark-instrumented-optimizer/python/pyspark
Michael Armbrust 158ad0bba9 [SPARK-2097][SQL] UDF Support
This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.

Scala:
```scala
registerFunction("strLenScala", (_: String).length)
sql("SELECT strLenScala('test')")
```
Python:
```python
sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
sqlCtx.sql("SELECT strLenPython('test')")
```
Java:
```java
sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
  Override
  public Integer call(String str) throws Exception {
    return str.length();
  }
}, DataType.IntegerType);

sqlContext.sql("SELECT stringLengthJava('test')");
```

Author: Michael Armbrust <michael@databricks.com>

Closes #1063 from marmbrus/udfs and squashes the following commits:

9eda0fe [Michael Armbrust] newline
747c05e [Michael Armbrust] Add some scala UDF tests.
d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
005d684 [Michael Armbrust] Fix naming and formatting.
d14dac8 [Michael Armbrust] Fix last line of autogened java files.
8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
7a83101 [Michael Armbrust] Drop toString
795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
e54fb45 [Michael Armbrust] Docs and tests.
437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
8e6c932 [Michael Armbrust] WIP
3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
6237c8d [Michael Armbrust] WIP
2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
2014-08-02 16:33:48 -07:00
..
mllib [SPARK-2478] [mllib] DecisionTree Python API 2014-08-02 13:07:17 -07:00
__init__.py [SPARK-2724] Python version of RandomRDDGenerators 2014-07-31 20:32:57 -07:00
accumulators.py SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark 2014-07-31 15:31:53 -07:00
broadcast.py Fix some Python docs and make sure to unset SPARK_TESTING in Python 2013-12-29 20:15:07 -05:00
cloudpickle.py [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle 2014-07-29 01:02:18 -07:00
conf.py [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default 2014-07-24 18:15:37 -07:00
context.py [SPARK-2454] Do not ship spark home to Workers 2014-08-02 00:45:38 -07:00
daemon.py [SPARK-2764] Simplify daemon.py process structure 2014-08-01 19:38:21 -07:00
files.py Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
java_gateway.py [SPARK-2470] PEP8 fixes to PySpark 2014-07-21 22:30:53 -07:00
join.py [SPARK-2470] PEP8 fixes to PySpark 2014-07-21 22:30:53 -07:00
rdd.py [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD 2014-08-01 18:47:41 -07:00
rddsampler.py [SPARK-2656] Python version of stratified sampling 2014-07-24 23:42:08 -07:00
resultiterable.py [SPARK-2470] PEP8 fixes to PySpark 2014-07-21 22:30:53 -07:00
serializers.py [SPARK-2538] [PySpark] Hash based disk spilling aggregation 2014-07-24 22:53:47 -07:00
shell.py [SPARK-2470] PEP8 fixes to PySpark 2014-07-21 22:30:53 -07:00
shuffle.py [SPARK-2538] [PySpark] Hash based disk spilling aggregation 2014-07-24 22:53:47 -07:00
sql.py [SPARK-2097][SQL] UDF Support 2014-08-02 16:33:48 -07:00
statcounter.py StatCounter on NumPy arrays [PYSPARK][SPARK-2012] 2014-08-01 22:33:25 -07:00
storagelevel.py [SPARK-2470] PEP8 fixes to PySpark 2014-07-21 22:30:53 -07:00
tests.py StatCounter on NumPy arrays [PYSPARK][SPARK-2012] 2014-08-01 22:33:25 -07:00
worker.py [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket 2014-07-29 00:15:45 -07:00